Spark 1.0 embedded Hive libraries

2014-06-06 Thread Silvio Fiorito
Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore) used in SparkSQL? Are they forked with Spark-specific customizations, like Shark, or simply relabeled with a new package name (org.spark-project.hive)? I couldn't find any repos on Github

Re: Spark 1.0 embedded Hive libraries

2014-06-06 Thread Silvio Fiorito
- Patrick On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore) used in SparkSQL? Are they forked with Spark-specific customizations, like Shark, or simply

SparkSQL Hive partitioning support

2014-08-13 Thread Silvio Fiorito
Using the SchemaRDD insertInto method, is there any way to support partitions on a field in the RDD? If not, what's the alternative, register a table and do an insert into via SQL statement? Any plans to support partitioning via insertInto? What other options are there for inserting into a

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Silvio Fiorito
Yin Michael, Thanks, I'll try the SQL route. From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Wednesday, August 13, 2014 at 5:04 PM To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Cc: Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Silvio Fiorito
You also need to ensure you're using checkpointing and support recreating the context on driver failure as described in the docs here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node From: Matt Narrell

Re: Does HiveContext support Parquet?

2014-08-15 Thread Silvio Fiorito
Yes, you can write to Parquet tables. On Spark 1.0.2 all I had to do was include the parquet-hive-bundle-1.5.0.jar on my classpath. From: lycmailto:yanchen@huawei.com Sent: ?Friday?, ?August? ?15?, ?2014 ?7?:?30? ?PM To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org

RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
There's really nothing special besides including that jar on your classpath. You just do selects, inserts, etc as you normally would. The same instructions here apply https://cwiki.apache.org/confluence/display/Hive/Parquet From:

RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
If you're using HiveContext then all metadata is in the Hive metastore as defined in hive-site.xml. Concurrent writes should be fine as long as you're using a concurrent metastore db. From: Flavio Pompermaiermailto:pomperma...@okkam.it Sent: ‎8/‎16/‎2014 1:26 PM

Re: How to use Spark Streaming from an HTTP api?

2014-08-18 Thread Silvio Fiorito
You need to create a custom receiver that submits the HTTP requests then deserializes the data and pushes it into the Streaming context. See here for an example: http://spark.apache.org/docs/latest/streaming-custom-receivers.html On 8/18/14, 6:20 PM, bumble123 tc1...@att.com wrote: I want to

Re: Does HiveContext support Parquet?

2014-08-18 Thread Silvio Fiorito
First the JAR needs to be deployed using the ‹jars argument. Then in your HQL code you need to use the DeprecatedParquetInputFormat and DeprecatedParquetOutputFormat as described here https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0 .12 This is because SparkSQL is based

Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
This is a long running Spark Streaming job running in YARN, Spark v1.0.2 on CDH5. The jobs will run for about 34-37 hours then die due to this FileNotFoundException. There’s very little CPU or RAM usage, I’m running 2 x cores, 2 x executors, 4g memory, YARN cluster mode. Here’s the stack

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
Thanks, I’ll go ahead and disable that setting for now. From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com Date: Wednesday, August 20, 2014 at 3:20 PM To: Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior...@granturing.com Cc: user@spark.apache.orgmailto:user

Re: Does HiveContext support Parquet?

2014-08-27 Thread Silvio Fiorito
What Spark and Hadoop versions are you on? I have it working in my Spark app with the parquet-hive-bundle-1.5.0.jar bundled into my app fat-jar. I¹m running Spark 1.0.2 and CDH5. bin/spark-shell --master local[*] --driver-class-path ~/parquet-hive-bundle-1.5.0.jar To see if that works? On

Re: save schemardd to hive

2014-09-02 Thread Silvio Fiorito
You can use saveAsTable or do an INSERT SparkSQL statement as well in case you need other Hive query features, like partitioning. On 9/2/14, 6:54 AM, centerqi hu cente...@gmail.com wrote: I got it import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ val hiveContext =

Re: Store DStreams into Hive using Hive Streaming

2014-11-06 Thread Silvio Fiorito
Geovani, You can use HiveContext to do inserts into a Hive table in a Streaming app just as you would a batch app. A DStream is really a collection of RDDs so you can run the insert from within the foreachRDD. You just have to be careful that you’re not creating large amounts of small files.

Re: clean up of state in State Dstream

2014-12-12 Thread Silvio Fiorito
If you no longer need to maintain state for a key, just return None for that value and it gets removed. From: Sunil Yarram yvsu...@gmail.commailto:yvsu...@gmail.com Date: Friday, December 12, 2014 at 9:44 PM To: user@spark.apache.orgmailto:user@spark.apache.org

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
Hi Pierce, You shouldn’t have to use groupByKey because updateStateByKey will get a Seq of all the values for that key already. I used that for realtime sessionization as well. What I did was key my incoming events, then send them to udpateStateByKey. The updateStateByKey function then

Re: Spark 1.2 Release Date

2014-12-18 Thread Silvio Fiorito
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On 12/18/14, 2:09 PM, Al M alasdair.mcbr...@gmail.com wrote: Is there a planned release date for Spark 1.2? I saw on the Spark Wiki https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that we are

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
the expected output, thank you! On Thu, Dec 18, 2014 at 12:11 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Ok, I have a better idea of what you’re trying to do now. I think the prob might be the map. The first time the function runs, currentValue will be None. Using map on None

Re: Spark Streaming Threading Model

2014-12-19 Thread Silvio Fiorito
Batches will wait for the previous batch to finish. The monitoring console will show you the backlog of waiting batches. From: Asim Jalis asimja...@gmail.commailto:asimja...@gmail.com Date: Friday, December 19, 2014 at 1:16 PM To: user user@spark.apache.orgmailto:user@spark.apache.org Subject:

RE: Publishing streaming results to web interface

2015-01-03 Thread Silvio Fiorito
Is this through a streaming app? I've done this before by publishing results out to a queue our message bus, with a web app listening on the other end. If it's just batch or infrequent you could save the results out to a file. From:

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Silvio Fiorito
Nick, Have you tried https://github.com/kaitoy/pcap4j I’ve used this in a Spark app already and didn’t have any issues. My use case was slightly different than yours, but you should give it a try. From: Nick Allen n...@nickallen.orgmailto:n...@nickallen.org Date: Friday, January 16, 2015 at

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Silvio Fiorito
Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Silvio Fiorito
One method I’ve used is to publish each batch to a message bus or queue with a custom UI listening on the other end, displaying the results in d3.js or some other app. As far as I’m aware there isn’t a tool that will directly take a DStream. Spark Notebook seems to have some support for

Re: SparkSQL + Tableau Connector

2015-02-19 Thread Silvio Fiorito
Great, glad it worked out! From: Todd Nist Date: Thursday, February 19, 2015 at 9:19 AM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector Hi Silvio, I got this working today using your suggestion with the Initial SQL and a Custom

Re: Multiple Spark Streaming receiver model

2015-01-06 Thread Silvio Fiorito
Hi Manjul, Each StreamingContext will have its own batch size. If that doesn’t work for the different sources you have then you would have to create different streaming apps. You can only create a new StreamingContext in the same Spark app, once you’ve stopped the previous one. Spark

Re: Spark Web UI Doesn't Open in Yarn-Client Mode

2015-02-14 Thread Silvio Fiorito
on yarn you need to first go to the resource manager UI, find your job, and click the link for the UI there. From: Puneet Kumar Ojhamailto:puneet.ku...@pubmatic.com Sent: ?Saturday?, ?February? ?14?, ?2015 ?5?:?25? ?AM To: user@spark.apache.orgmailto:user@spark.apache.org Hi, I am running 3

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
/*') ; Time taken: 0.34 seconds spark-sql select * from people; NULLMichael 30 Andy 19 Justin NULLMichael 30 Andy 19 Justin Time taken: 0.576 seconds From: Todd Nist Date: Tuesday, February 10, 2015 at 6:49 PM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-10 Thread Silvio Fiorito
Have you tried Apache Daemon? http://commons.apache.org/proper/commons-daemon/procrun.html From: Wang, Ningjun (LNG-NPV) Date: Tuesday, March 10, 2015 at 11:47 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Is it possible to use windows service to start and stop spark

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
on the streaming side it really helps simplify running tests on batch outputs. If you’re having serialization issues you may need to look at using transient lazy initializers, to see if that helps? From: Michal Michalski Date: Tuesday, April 28, 2015 at 11:42 AM To: Silvio Fiorito Cc: user Subject: Re: Best

Re: Spark partitioning question

2015-04-28 Thread Silvio Fiorito
. From: Marius Danciu Date: Tuesday, April 28, 2015 at 9:53 AM To: Silvio Fiorito, user Subject: Re: Spark partitioning question Thank you Silvio, I am aware of groubBy limitations and this is subject for replacement. I did try repartitionAndSortWithinPartitions but then I end up with maybe too

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
If you need to keep the keys, you can use aggregateByKey to calculate an avg of the values: val step1 = data.aggregateByKey((0.0, 0))((a, b) = (a._1 + b, a._2 + 1), (a, b) = (a._1 + b._1, a._2 + b._2)) val avgByKey = step1.mapValues(i = i._1/i._2) Essentially, what this is doing is passing an

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
Hi Michal, Please try spark-testing-base by Holden. I’ve used it and it works well for unit testing batch and streaming jobs https://github.com/holdenk/spark-testing-base Thanks, Silvio From: Michal Michalski Date: Tuesday, April 28, 2015 at 11:32 AM To: user Subject: Best practices on

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
to the first statement in Silvio's answer (below). But my reasoning to deconstruct and reconstruct is missing something. Thanks again! On 04/28/2015 11:26 AM, Silvio Fiorito wrote: If you need to keep the keys, you can use aggregateByKey to calculate an avg of the values: val step1

RE: How to group multiple row data ?

2015-04-29 Thread Silvio Fiorito
I think you'd probably want to look at combineByKey. I'm on my phone so can't give you an example, but that's one solution i would try. You would then take the resulting RDD and go back to a DF if needed. From: bipinmailto:bipin@gmail.com Sent: ‎4/‎29/‎2015

Re: Stop Cluster Mode Running App

2015-05-07 Thread Silvio Fiorito
Hi James, If you’re on Spark 1.3 you can use the kill command in spark-submit to shut it down. You’ll need the driver id from the Spark UI or from when you submitted the app. spark-submit --master spark://master:7077 --kill driver-id Thanks, Silvio From: James King Date: Wednesday, May 6,

Re: It takes too long (30 seconds) to create Spark Context with SPARK/YARN

2015-05-11 Thread Silvio Fiorito
Could you upload the spark assembly to HDFS and then set spark.yarn.jar to the path where you uploaded it? That can help minimize start-up time. How long if you start just a spark shell? On 5/11/15, 11:15 AM, stanley wangshua...@yahoo.com wrote: I am running Spark jobs on YARN cluster. It

Re: Submit Spark application in cluster mode and supervised

2015-05-08 Thread Silvio Fiorito
If you’re using multiple masters with ZooKeeper then you should set your master URL to be spark://host01:7077,host02:7077 And the property spark.deploy.recoveryMode=ZOOKEEPER See here for more info: http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper

Re: Does long-lived SparkContext hold on to executor resources?

2015-05-11 Thread Silvio Fiorito
You want to look at dynamic resource allocation, here: http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation On 5/11/15, 11:23 AM, stanley wangshua...@yahoo.com wrote: I am building an analytics app with Spark. I plan to use long-lived SparkContexts to minimize

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Silvio Fiorito
Yes, just put the Cassandra connector on the Spark classpath and set the connector config properties in the interpreter settings. From: Mohammed Guller Date: Monday, June 22, 2015 at 11:56 AM To: Matthew Johnson, shahid ashraf Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: RE:

Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
a mapPartitions or one of the other combineByKey APIs? From: Jianguo Li Date: Tuesday, June 23, 2015 at 9:46 AM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: workaround for groupByKey Thanks. Yes, unfortunately, they all need to be grouped. I guess I can

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
perhaps? From: Jianguo Li Date: Monday, June 22, 2015 at 6:21 PM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: workaround for groupByKey Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learning Sparking We can

Re: Verifying number of workers in Spark Streaming

2015-06-21 Thread Silvio Fiorito
If you look at your streaming app UI you should see how many tasks are executed each batch and on how many executors. This is dependent on the batch duration and block interval, which defaults to 200ms. So every block interval a partition will be generated. You can control the parallelism by

Re: createDirectStream and Stats

2015-06-20 Thread Silvio Fiorito
Are you sure you were using all 100 executors even with the receiver model? Because in receiver mode, the number of partitions is dependent on the batch duration and block interval. It may not necessarily map directly to the number of executors in your app unless you've adjusted the block

Re:

2015-06-25 Thread Silvio Fiorito
Hi Deepak, Have you tried specifying the minimum partitions when you load the file? I haven’t tried that myself against HDFS before, so I’m not sure if it will affect data locality. Ideally not, it should still maintain data locality but just more partitions. Once your job runs, you can check

Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Silvio Fiorito
Hi Daniel, As Peter pointed out you need the hadoop-azure JAR as well as the Azure storage SDK for Java (com.microsoft.azure:azure-storage). Even though the WASB driver is built for 2.7, I was still able to use the hadoop-azure JAR with Spark built for older Hadoop versions, back to 2.4 I

Re:

2015-06-25 Thread Silvio Fiorito
/apache/spark/SparkContext.scala#L1053 And the HDFS FileInputFormat implementation, that seems like a good option to try. You should be able to call conf.setLong(FileInputFormat.SPLIT_MAXSIZE, max). I hope that helps! From: ÐΞ€ρ@Ҝ (๏̯͡๏) Date: Thursday, June 25, 2015 at 5:49 PM To: Silvio Fiorito Cc

Re: how to increase parallelism ?

2015-06-24 Thread Silvio Fiorito
To: Silvio Fiorito Cc: user Subject: Re: how to increase parallelism ? What that did was run a repartition with 174 tasks repartition with 174 tasks AND actual .filter.map stage with 500 tasks It actually doubled to stages. On Wed, Jun 24, 2015 at 12:01 PM, Silvio Fiorito silvio.fior

Re: Registering custom metrics

2015-06-22 Thread Silvio Fiorito
Hi Gerard, Yes, you have to implement your own custom Metrics Source using the Code Hale library. See here for some examples: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala

Re: Registering custom metrics

2015-06-22 Thread Silvio Fiorito
Sorry, replied to Gerard’s question vs yours. See here: Yes, you have to implement your own custom Metrics Source using the Code Hale library. See here for some examples: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala

Re:

2015-06-26 Thread Silvio Fiorito
]], classOf[AvroKey[GenericRecord]], classOf[NullWritable], hadoopConf).map(_._1.datum.get(name)) println(input.partitions.size) From: ÐΞ€ρ@Ҝ (๏̯͡๏) Date: Friday, June 26, 2015 at 11:04 AM To: Silvio Fiorito Cc: user Subject: Re: dependency groupIdorg.apache.avro/groupId artifactIdavro

Re:

2015-06-26 Thread Silvio Fiorito
)). reduceByKey(_ + _). coalesce(1). sortBy(_._2, false). take(10). foreach(println) From: ÐΞ€ρ@Ҝ (๏̯͡๏) Date: Friday, June 26, 2015 at 10:18 AM To: Silvio Fiorito Cc: user Subject: Re: All these throw compilation error at newAPIHadoopFile 1) val hadoopConfiguration = new Configuration

Re:

2015-06-26 Thread Silvio Fiorito
No worries, glad to help! It also helped me as I had not worked directly with the Hadoop APIs for controlling splits. From: ÐΞ€ρ@Ҝ (๏̯͡๏) Date: Friday, June 26, 2015 at 1:31 PM To: Silvio Fiorito Cc: user Subject: Re: Silvio, Thanks for your responses and patience. It worked after i reshuffled

Re: how to increase parallelism ?

2015-06-24 Thread Silvio Fiorito
Hi Deepak, Parallelism is controlled by the number of partitions. In this case, how many partitions are there for the details RDD (likely 170). You can check by running “details.partitions.length”. If you want to increase parallelism you can do so by repartitioning, increasing the number of

Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Silvio Fiorito
Hi, just answered in your other thread as well... Depending on your requirements, you can look at the updateStateByKey API From: Nipun Arora Date: Wednesday, June 17, 2015 at 10:51 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Iterative Programming by keeping data across

Re:

2015-06-17 Thread Silvio Fiorito
Depending on your requirements, you can look at the updateStateByKey API From: Nipun Arora Date: Wednesday, June 17, 2015 at 10:48 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: no subject Hi, Is there anyway in spark streaming to keep data across multiple micro-batches?

Re: Dataframe Partitioning

2015-05-28 Thread Silvio Fiorito
That’s due to the config setting spark.sql.shuffle.partitions which defaults to 200 From: Masf Date: Thursday, May 28, 2015 at 10:02 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Dataframe Partitioning Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do

Re: custom RDD in java

2015-07-01 Thread Silvio Fiorito
If all you’re doing is just dumping tables from SQLServer to HDFS, have you looked at Sqoop? Otherwise, if you need to run this in Spark could you just use the existing JdbcRDD? From: Shushant Arora Date: Wednesday, July 1, 2015 at 10:19 AM To: user Subject: custom RDD in java Hi Is it

Re: custom RDD in java

2015-07-01 Thread Silvio Fiorito
Sure, you can create custom RDDs. Haven’t done so in Java, but in Scala absolutely. From: Shushant Arora Date: Wednesday, July 1, 2015 at 1:44 PM To: Silvio Fiorito Cc: user Subject: Re: custom RDD in java ok..will evaluate these options but is it possible to create RDD in java? On Wed, Jul 1

Re: Spark performance issue

2015-07-03 Thread Silvio Fiorito
It’ll help to see the code or at least understand what transformations you’re using. Also, you have 15 nodes but not using all of them, so that means you may be losing data locality. You can see this in the job UI for Spark if any jobs do not have node or process local. From: diplomatic Guru

Re: Spark Streaming broadcast to all keys

2015-07-03 Thread Silvio Fiorito
updateStateByKey will run for all keys, whether they have new data in a batch or not so you should be able to still use it. On 7/3/15, 7:34 AM, micvog mich...@micvog.com wrote: UpdateStateByKey is useful but what if I want to perform an operation to all existing keys (not only the ones in

RE: .NET on Apache Spark?

2015-07-02 Thread Silvio Fiorito
Since Spark runs on the JVM, no there isn't support for .Net. You should take a look at Dryad and Naiad instead. https://github.com/MicrosoftResearch/ From: Zwitsmailto:daniel.van...@ortec-finance.com Sent: ‎7/‎2/‎2015 4:33 AM To:

Re: .NET on Apache Spark?

2015-07-05 Thread Silvio Fiorito
Joe Duffy, director of engineering on Microsoft's compiler team made a comment about investigating F# type providers for Spark. https://twitter.com/xjoeduffyx/status/614076012372955136 From: Ashic Mahtabmailto:as...@live.com Sent: ?Sunday?, ?July? ?5?, ?2015 ?1?:?29? ?PM To: Ruslan

Re: map vs foreach for sending data to external system

2015-07-02 Thread Silvio Fiorito
foreach absolutely runs on the executors. For sending data to an external system you should likely use foreachPartition in order to batch the output. Also if you want to limit the parallelism of the output action then you can use coalesce. What makes you think foreach is running on the driver?

Re: Partitioning of Dataframes

2015-05-22 Thread Silvio Fiorito
This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote: Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then need to make into a Dataframe again? Maybe like this:

Re: Left outer joining big data set with small lookups

2015-08-17 Thread Silvio Fiorito
executors due to lookup skew. Any more idea to tackle this issue in Spark Dataframe? Thanks Vijay On Aug 14, 2015, at 10:27 AM, Silvio Fiorito silvio.fior...@granturing.com wrote: You could cache the lookup DataFrames, it’ll then do a broadcast join. On 8/14/15, 9:39 AM

Re: Shuffle files lifecycle

2015-06-29 Thread Silvio Fiorito
Regarding 1 and 2, yes shuffle output is stored on the worker local disks and will be reused across jobs as long as they’re available. You can identify when they’re used by seeing skipped stages in the job UI. They are periodically cleaned up based on available space of the configured

RE: How to increase parallelism of a Spark cluster?

2015-08-02 Thread Silvio Fiorito
Can you share the transformations up to the foreachPartition? From: Sujit Palmailto:sujitatgt...@gmail.com Sent: ‎8/‎2/‎2015 4:42 PM To: Igor Bermanmailto:igor.ber...@gmail.com Cc: usermailto:user@spark.apache.org Subject: Re: How to increase parallelism of a Spark

Re: Spark DataFrames uses too many partition

2015-08-11 Thread Silvio Fiorito
You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200. On 8/11/15, 11:31 AM, Al M alasdair.mcbr...@gmail.com wrote: I am using DataFrames with Spark 1.4.1. I really like DataFrames but the partitioning makes no sense to me. I am loading

Re: Left outer joining big data set with small lookups

2015-08-14 Thread Silvio Fiorito
You could cache the lookup DataFrames, it’ll then do a broadcast join. On 8/14/15, 9:39 AM, VIJAYAKUMAR JAWAHARLAL sparkh...@data2o.io wrote: Hi I am facing huge performance problem when I am trying to left outer join very big data set (~140GB) with bunch of small lookups [Start schema

Re: Sessionization using updateStateByKey

2015-07-15 Thread Silvio Fiorito
Hi Cody, I’ve had success using updateStateByKey for real-time sessionization by aging off timed-out sessions (returning None in the update function). This was on a large commercial website with millions of hits per day. This was over a year ago so I don’t have access to the stats any longer

Re: Local Repartition

2015-07-20 Thread Silvio Fiorito
Hi Daniel, Coalesce, by default will not cause a shuffle. The second parameter when set to true will cause a full shuffle. This is actually what repartition does (calls coalesce with shuffle=true). It will attempt to keep colocated partitions together (as you describe) on the same executor.

RE: Dynamic Resource Allocation with Spark Streaming (Standalone Cluster, Spark 1.5.1)

2015-10-26 Thread Silvio Fiorito
Hi Matthias, Unless there was a change in 1.5, I'm afraid dynamic resource allocation is not yet supported in streaming apps. Thanks, Silvio Sent from my Lumia 930 From: Matthias Niehoff Sent: ‎10/‎26/‎2015 4:00 PM To:

RE: Concurrent execution of actions within a driver

2015-10-26 Thread Silvio Fiorito
There is a collectAsync action if you want to run them in parallel, but keep in mind the two jobs will need to share resources and you should use the FAIR scheduler. From: praveen S Sent: ‎10/‎26/‎2015 4:27 AM To:

RE: Maintaining overall cumulative data in Spark Streaming

2015-10-29 Thread Silvio Fiorito
You could use updateStateByKey. There's a stateful word count example on Github. https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/StatefulNetworkWordCount.scala From: Sandeep

RE: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Silvio Fiorito
It's something due to the columnar compression. I've seen similar intermittent issues when caching DataFrames. "sportingpulse.com" is a value in one of the columns of the DF. From: Ted Yu Sent: ‎10/‎30/‎2015 6:33 PM To: Zhang,

Re: key not found: sportingpulse.com in Spark SQL 1.5.0

2015-10-30 Thread Silvio Fiorito
I don’t believe I have it on 1.5.1. Are you able to test the data locally to confirm, or is it too large? From: "Zhang, Jingyu" <jingyu.zh...@news.com.au<mailto:jingyu.zh...@news.com.au>> Date: Friday, October 30, 2015 at 7:31 PM To: Silvio Fiorito <silv

Re: Maintaining overall cumulative data in Spark Streaming

2015-10-30 Thread Silvio Fiorito
In the update function you can return None for a key and it will remove it. If you’re restarting your app you can delete your checkpoint directory to start from scratch, rather than continuing from the previous state. From: Sandeep Giri >

Re: how to merge two dataframes

2015-10-30 Thread Silvio Fiorito
Are you able to upgrade to Spark 1.5.1 and Cassandra connector to latest version? It no longer requires a separate CassandraSQLContext. From: Yana Kadiyska > Reply-To: "yana.kadiy...@gmail.com"

Re: Help! Stuck using withColumn

2015-08-26 Thread Silvio Fiorito
Hi Saif, In both cases you’re referencing columns that don’t exist in the current DataFrame. The first email you did a select and then a withColumn for ‘month_date_cur' on the resulting DF, but that column does not exist, because you did a select for only ‘month_balance’. In the second email

Re: Unbale to run Group BY on Large File

2015-09-02 Thread Silvio Fiorito
Unfortunately, groupBy is not the most efficient operation. What is it you’re trying to do? It may be possible with one of the other *byKey transformations. From: "SAHA, DEBOBROTA" Date: Wednesday, September 2, 2015 at 7:46 PM To: "'user@spark.apache.org'" Subject:

Re: Can Spark Provide Multiple Context Support?

2015-09-08 Thread Silvio Fiorito
Is the data from HDFS static or is it unique for each event in the stream? If it’s static, you can just create the SparkContext, load the files from HDFS, then start a StreamingContext with the existing SparkContext and go from there. From: Rachana Srivastava Date: Tuesday, September 8, 2015 at

Re: Where can I learn how to write udf?

2015-09-14 Thread Silvio Fiorito
Hi Saif, There are 2 types of UDFs. Those used by SQL and those used by the Scala DSL. For SQL, you just register a function like so (this example is from the docs): sqlContext.udf.register(“strLen”, (s: String) => s.length) sqlContext.sql(“select name, strLen(name) from people”).show The

Re: Realtime Data Visualization Tool for Spark

2015-09-11 Thread Silvio Fiorito
So if you want to build your own from the ground up, then yes you could go the d3js route. Like Feynman also responded you could use something like Spark Notebook or Zeppelin to create some charts as well. It really depends on your intended audience and ultimate goal. If you just want some

Re: DataFrame repartition not repartitioning

2015-09-16 Thread Silvio Fiorito
You just need to assign it to a new variable: val avroFile = sqlContext.read.format("com.databricks.spark.avro").load(inFile) val repart = avroFile.repartition(10) repart.save(outFile, "parquet") From: Steve Annessa Date: Wednesday, September 16, 2015 at 2:08 PM To:

Re: Java REST custom receiver

2015-10-01 Thread Silvio Fiorito
When you say “receive messages” you mean acting as a REST endpoint, right? If so, it might be better to use JMS (or Kafka) option for a few reasons: The receiver will be deployed to any of the available executors, so your REST clients will need to be made aware of the IP where the receiver is

Re: Receiver and Parallelization

2015-09-25 Thread Silvio Fiorito
One thing you should look at is your batch duration and spark.streaming.blockInterval Those 2 things control how many partitions are generated for each RDD (batch) of the DStream when using a receiver (vs direct approach). So if you have a 2 second batch duration and the default blockInterval

Re: Testing with spark testing base

2015-12-05 Thread Silvio Fiorito
Yes, with IntelliJ you can set up a scalatest run configuration. You can also run directly from the sbt CLI by running “sbt test” From: Masf > Date: Saturday, December 5, 2015 at 12:51 PM To: "user@spark.apache.org"

Re: Spark data frame

2015-12-22 Thread Silvio Fiorito
Michael, collect will bring down the results to the driver JVM, whereas the RDD or DataFrame would be cached on the executors (if it is cached). So, as Dean said, the driver JVM needs to have enough memory to store the results of collect. Thanks, Silvio From: Michael Segel

Re: why one of Stage is into Skipped section instead of Completed

2015-12-26 Thread Silvio Fiorito
Skipped stages result from existing shuffle output of a stage when re-running a transformation. The executors will have the output of the stage in their local dirs and Spark recognizes that, so rather than re-computing, it will start from the following stage. So, this is a good thing in that

Re: Strategies for propery load-balanced partitioning

2016-06-03 Thread Silvio Fiorito
Hi Saif! When you say this happens with spark-csv, are the files gzipped by any chance? GZip is non-splittable so if you’re seeing skew simply from loading data it could be you have some extremely large gzip files. So for a single stage job you will have those tasks lagging compared to the

Re: HiveContext standalone => without a Hive metastore

2016-05-26 Thread Silvio Fiorito
Hi Gerard, I’ve never had an issue using the HiveContext without a hive-site.xml configured. However, one issue you may have is if multiple users are starting the HiveContext from the same path, they’ll all be trying to store the default Derby metastore in the same location. Also, if you want

Re: Spark Streaming: BatchDuration and Processing time

2016-01-17 Thread Silvio Fiorito
It will just queue up the subsequent batches, however if this delay is constant you may start losing batches. It can handle spikes in processing time, but if you know you're consistently running over your batch duration you either need to increase the duration or look at enabling back pressure

Re: Unit test with sqlContext

2016-02-04 Thread Silvio Fiorito
Hi Steve, Have you looked at the spark-testing-base package by Holden? It’s really useful for unit testing Spark apps as it handles all the bootstrapping for you. https://github.com/holdenk/spark-testing-base DataFrame examples are here:

Re: Spark with .NET

2016-02-09 Thread Silvio Fiorito
That’s just a .NET assembly (not related to Spark DataSets) but doesn’t look like they’re actually using it. It’s typically a default reference pulled in by the project templates. The code though is available from Mono here:

Re: map operation clears custom partitioner

2016-02-22 Thread Silvio Fiorito
You can use mapValues to ensure partitioning is not lost. From: Brian London > Date: Monday, February 22, 2016 at 1:21 PM To: user > Subject: map operation clears custom partitioner It

Re: visualize data from spark streaming

2016-01-20 Thread Silvio Fiorito
You’ve got a few options: * Use a notebook tool such as Zeppelin, Jupyter, or Spark Notebook to write up some visualizations which update in time with your streaming batches * Use Spark Streaming to push your batch results to another 3rd-party system with a BI tool that supports

Re: General Question (Spark Hive integration )

2016-01-21 Thread Silvio Fiorito
Also, just to clarify it doesn’t read the whole table into memory unless you specifically cache it. From: Silvio Fiorito <silvio.fior...@granturing.com<mailto:silvio.fior...@granturing.com>> Date: Thursday, January 21, 2016 at 10:02 PM To: "Balaraju.Kagidala Kagidala&

Re: General Question (Spark Hive integration )

2016-01-21 Thread Silvio Fiorito
Hi Bala, It depends on how your Hive table is configured. If you used partitioning and you are filtering on a partition column then it will only load the relevant partitions. If, however, you’re filtering on a non-partitioned column then it will have to read all the data and then filter as

  1   2   >