Re: Using Spark on Azure Blob Storage

2015-06-25 Thread Silvio Fiorito
Hi Daniel, As Peter pointed out you need the hadoop-azure JAR as well as the Azure storage SDK for Java (com.microsoft.azure:azure-storage). Even though the WASB driver is built for 2.7, I was still able to use the hadoop-azure JAR with Spark built for older Hadoop versions, back to 2.4 I

Re:

2015-06-25 Thread Silvio Fiorito
/apache/spark/SparkContext.scala#L1053 And the HDFS FileInputFormat implementation, that seems like a good option to try. You should be able to call conf.setLong(FileInputFormat.SPLIT_MAXSIZE, max). I hope that helps! From: ÐΞ€ρ@Ҝ (๏̯͡๏) Date: Thursday, June 25, 2015 at 5:49 PM To: Silvio Fiorito Cc

Re: how to increase parallelism ?

2015-06-24 Thread Silvio Fiorito
To: Silvio Fiorito Cc: user Subject: Re: how to increase parallelism ? What that did was run a repartition with 174 tasks repartition with 174 tasks AND actual .filter.map stage with 500 tasks It actually doubled to stages. On Wed, Jun 24, 2015 at 12:01 PM, Silvio Fiorito silvio.fior

Re: how to increase parallelism ?

2015-06-24 Thread Silvio Fiorito
Hi Deepak, Parallelism is controlled by the number of partitions. In this case, how many partitions are there for the details RDD (likely 170). You can check by running “details.partitions.length”. If you want to increase parallelism you can do so by repartitioning, increasing the number of

Re: workaround for groupByKey

2015-06-23 Thread Silvio Fiorito
a mapPartitions or one of the other combineByKey APIs? From: Jianguo Li Date: Tuesday, June 23, 2015 at 9:46 AM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: workaround for groupByKey Thanks. Yes, unfortunately, they all need to be grouped. I guess I can

Re: Code review - Spark SQL command-line client for Cassandra

2015-06-22 Thread Silvio Fiorito
Yes, just put the Cassandra connector on the Spark classpath and set the connector config properties in the interpreter settings. From: Mohammed Guller Date: Monday, June 22, 2015 at 11:56 AM To: Matthew Johnson, shahid ashraf Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: RE:

Re: workaround for groupByKey

2015-06-22 Thread Silvio Fiorito
perhaps? From: Jianguo Li Date: Monday, June 22, 2015 at 6:21 PM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: workaround for groupByKey Thanks for your suggestion. I guess aggregateByKey is similar to combineByKey. I read in the Learning Sparking We can

Re: Registering custom metrics

2015-06-22 Thread Silvio Fiorito
Hi Gerard, Yes, you have to implement your own custom Metrics Source using the Code Hale library. See here for some examples: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala

Re: Registering custom metrics

2015-06-22 Thread Silvio Fiorito
Sorry, replied to Gerard’s question vs yours. See here: Yes, you have to implement your own custom Metrics Source using the Code Hale library. See here for some examples: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/metrics/source/JvmSource.scala

Re: Verifying number of workers in Spark Streaming

2015-06-21 Thread Silvio Fiorito
If you look at your streaming app UI you should see how many tasks are executed each batch and on how many executors. This is dependent on the batch duration and block interval, which defaults to 200ms. So every block interval a partition will be generated. You can control the parallelism by

Re: createDirectStream and Stats

2015-06-20 Thread Silvio Fiorito
Are you sure you were using all 100 executors even with the receiver model? Because in receiver mode, the number of partitions is dependent on the batch duration and block interval. It may not necessarily map directly to the number of executors in your app unless you've adjusted the block

Re: Iterative Programming by keeping data across micro-batches in spark-streaming?

2015-06-17 Thread Silvio Fiorito
Hi, just answered in your other thread as well... Depending on your requirements, you can look at the updateStateByKey API From: Nipun Arora Date: Wednesday, June 17, 2015 at 10:51 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Iterative Programming by keeping data across

Re:

2015-06-17 Thread Silvio Fiorito
Depending on your requirements, you can look at the updateStateByKey API From: Nipun Arora Date: Wednesday, June 17, 2015 at 10:48 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: no subject Hi, Is there anyway in spark streaming to keep data across multiple micro-batches?

Re: Dataframe Partitioning

2015-05-28 Thread Silvio Fiorito
That’s due to the config setting spark.sql.shuffle.partitions which defaults to 200 From: Masf Date: Thursday, May 28, 2015 at 10:02 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Dataframe Partitioning Hi. I have 2 dataframe with 1 and 12 partitions respectively. When I do

Re: Partitioning of Dataframes

2015-05-22 Thread Silvio Fiorito
This is added to 1.4.0 https://github.com/apache/spark/pull/5762 On 5/22/15, 8:48 AM, Karlson ksonsp...@siberie.de wrote: Hi, wouldn't df.rdd.partitionBy() return a new RDD that I would then need to make into a Dataframe again? Maybe like this:

Re: It takes too long (30 seconds) to create Spark Context with SPARK/YARN

2015-05-11 Thread Silvio Fiorito
Could you upload the spark assembly to HDFS and then set spark.yarn.jar to the path where you uploaded it? That can help minimize start-up time. How long if you start just a spark shell? On 5/11/15, 11:15 AM, stanley wangshua...@yahoo.com wrote: I am running Spark jobs on YARN cluster. It

Re: Does long-lived SparkContext hold on to executor resources?

2015-05-11 Thread Silvio Fiorito
You want to look at dynamic resource allocation, here: http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation On 5/11/15, 11:23 AM, stanley wangshua...@yahoo.com wrote: I am building an analytics app with Spark. I plan to use long-lived SparkContexts to minimize

Re: Submit Spark application in cluster mode and supervised

2015-05-08 Thread Silvio Fiorito
If you’re using multiple masters with ZooKeeper then you should set your master URL to be spark://host01:7077,host02:7077 And the property spark.deploy.recoveryMode=ZOOKEEPER See here for more info: http://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper

Re: Stop Cluster Mode Running App

2015-05-07 Thread Silvio Fiorito
Hi James, If you’re on Spark 1.3 you can use the kill command in spark-submit to shut it down. You’ll need the driver id from the Spark UI or from when you submitted the app. spark-submit --master spark://master:7077 --kill driver-id Thanks, Silvio From: James King Date: Wednesday, May 6,

RE: How to group multiple row data ?

2015-04-29 Thread Silvio Fiorito
I think you'd probably want to look at combineByKey. I'm on my phone so can't give you an example, but that's one solution i would try. You would then take the resulting RDD and go back to a DF if needed. From: bipinmailto:bipin@gmail.com Sent: ‎4/‎29/‎2015

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
on the streaming side it really helps simplify running tests on batch outputs. If you’re having serialization issues you may need to look at using transient lazy initializers, to see if that helps? From: Michal Michalski Date: Tuesday, April 28, 2015 at 11:42 AM To: Silvio Fiorito Cc: user Subject: Re: Best

Re: Spark partitioning question

2015-04-28 Thread Silvio Fiorito
. From: Marius Danciu Date: Tuesday, April 28, 2015 at 9:53 AM To: Silvio Fiorito, user Subject: Re: Spark partitioning question Thank you Silvio, I am aware of groubBy limitations and this is subject for replacement. I did try repartitionAndSortWithinPartitions but then I end up with maybe too

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
If you need to keep the keys, you can use aggregateByKey to calculate an avg of the values: val step1 = data.aggregateByKey((0.0, 0))((a, b) = (a._1 + b, a._2 + 1), (a, b) = (a._1 + b._1, a._2 + b._2)) val avgByKey = step1.mapValues(i = i._1/i._2) Essentially, what this is doing is passing an

Re: Best practices on testing Spark jobs

2015-04-28 Thread Silvio Fiorito
Hi Michal, Please try spark-testing-base by Holden. I’ve used it and it works well for unit testing batch and streaming jobs https://github.com/holdenk/spark-testing-base Thanks, Silvio From: Michal Michalski Date: Tuesday, April 28, 2015 at 11:32 AM To: user Subject: Best practices on

Re: Calculating the averages for each KEY in a Pairwise (K,V) RDD ...

2015-04-28 Thread Silvio Fiorito
to the first statement in Silvio's answer (below). But my reasoning to deconstruct and reconstruct is missing something. Thanks again! On 04/28/2015 11:26 AM, Silvio Fiorito wrote: If you need to keep the keys, you can use aggregateByKey to calculate an avg of the values: val step1

Re: Is it possible to use windows service to start and stop spark standalone cluster

2015-03-10 Thread Silvio Fiorito
Have you tried Apache Daemon? http://commons.apache.org/proper/commons-daemon/procrun.html From: Wang, Ningjun (LNG-NPV) Date: Tuesday, March 10, 2015 at 11:47 PM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: Is it possible to use windows service to start and stop spark

Re: SparkSQL + Tableau Connector

2015-02-19 Thread Silvio Fiorito
Great, glad it worked out! From: Todd Nist Date: Thursday, February 19, 2015 at 9:19 AM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: SparkSQL + Tableau Connector Hi Silvio, I got this working today using your suggestion with the Initial SQL and a Custom

Re: Spark Web UI Doesn't Open in Yarn-Client Mode

2015-02-14 Thread Silvio Fiorito
on yarn you need to first go to the resource manager UI, find your job, and click the link for the UI there. From: Puneet Kumar Ojhamailto:puneet.ku...@pubmatic.com Sent: ?Saturday?, ?February? ?14?, ?2015 ?5?:?25? ?AM To: user@spark.apache.orgmailto:user@spark.apache.org Hi, I am running 3

Re: Can spark job server be used to visualize streaming data?

2015-02-12 Thread Silvio Fiorito
One method I’ve used is to publish each batch to a message bus or queue with a custom UI listening on the other end, displaying the results in d3.js or some other app. As far as I’m aware there isn’t a tool that will directly take a DStream. Spark Notebook seems to have some support for

Re: SparkSQL + Tableau Connector

2015-02-11 Thread Silvio Fiorito
Hey Todd, I don’t have an app to test against the thrift server, are you able to define custom SQL without using Tableau’s schema query? I guess it’s not possible to just use SparkSQL temp tables, you may have to use permanent Hive tables that are actually in the metastore so Tableau can

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
Hi Todd, What you could do is run some SparkSQL commands immediately after the Thrift server starts up. Or does Tableau have some init SQL commands you could run? You can actually load data using SQL, such as: create temporary table people using org.apache.spark.sql.json options (path

Re: SparkSQL + Tableau Connector

2015-02-10 Thread Silvio Fiorito
/*') ; Time taken: 0.34 seconds spark-sql select * from people; NULLMichael 30 Andy 19 Justin NULLMichael 30 Andy 19 Justin Time taken: 0.576 seconds From: Todd Nist Date: Tuesday, February 10, 2015 at 6:49 PM To: Silvio Fiorito Cc: user@spark.apache.orgmailto:user@spark.apache.org Subject

Re: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Silvio Fiorito
Nick, Have you tried https://github.com/kaitoy/pcap4j I’ve used this in a Spark app already and didn’t have any issues. My use case was slightly different than yours, but you should give it a try. From: Nick Allen n...@nickallen.orgmailto:n...@nickallen.org Date: Friday, January 16, 2015 at

Re: Multiple Spark Streaming receiver model

2015-01-06 Thread Silvio Fiorito
Hi Manjul, Each StreamingContext will have its own batch size. If that doesn’t work for the different sources you have then you would have to create different streaming apps. You can only create a new StreamingContext in the same Spark app, once you’ve stopped the previous one. Spark

RE: Publishing streaming results to web interface

2015-01-03 Thread Silvio Fiorito
Is this through a streaming app? I've done this before by publishing results out to a queue our message bus, with a web app listening on the other end. If it's just batch or infrequent you could save the results out to a file. From:

Re: Spark Streaming Threading Model

2014-12-19 Thread Silvio Fiorito
Batches will wait for the previous batch to finish. The monitoring console will show you the backlog of waiting batches. From: Asim Jalis asimja...@gmail.commailto:asimja...@gmail.com Date: Friday, December 19, 2014 at 1:16 PM To: user user@spark.apache.orgmailto:user@spark.apache.org Subject:

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
Hi Pierce, You shouldn’t have to use groupByKey because updateStateByKey will get a Seq of all the values for that key already. I used that for realtime sessionization as well. What I did was key my incoming events, then send them to udpateStateByKey. The updateStateByKey function then

Re: Spark 1.2 Release Date

2014-12-18 Thread Silvio Fiorito
It’s on Maven Central already http://search.maven.org/#browse%7C717101892 On 12/18/14, 2:09 PM, Al M alasdair.mcbr...@gmail.com wrote: Is there a planned release date for Spark 1.2? I saw on the Spark Wiki https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage that we are

Re: Help with updateStateByKey

2014-12-18 Thread Silvio Fiorito
the expected output, thank you! On Thu, Dec 18, 2014 at 12:11 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Ok, I have a better idea of what you’re trying to do now. I think the prob might be the map. The first time the function runs, currentValue will be None. Using map on None

Re: clean up of state in State Dstream

2014-12-12 Thread Silvio Fiorito
If you no longer need to maintain state for a key, just return None for that value and it gets removed. From: Sunil Yarram yvsu...@gmail.commailto:yvsu...@gmail.com Date: Friday, December 12, 2014 at 9:44 PM To: user@spark.apache.orgmailto:user@spark.apache.org

Re: Store DStreams into Hive using Hive Streaming

2014-11-06 Thread Silvio Fiorito
Geovani, You can use HiveContext to do inserts into a Hive table in a Streaming app just as you would a batch app. A DStream is really a collection of RDDs so you can run the insert from within the foreachRDD. You just have to be careful that you’re not creating large amounts of small files.

Re: save schemardd to hive

2014-09-02 Thread Silvio Fiorito
You can use saveAsTable or do an INSERT SparkSQL statement as well in case you need other Hive query features, like partitioning. On 9/2/14, 6:54 AM, centerqi hu cente...@gmail.com wrote: I got it import org.apache.spark.{SparkConf, SparkContext} import org.apache.spark.sql._ val hiveContext =

Re: Does HiveContext support Parquet?

2014-08-27 Thread Silvio Fiorito
What Spark and Hadoop versions are you on? I have it working in my Spark app with the parquet-hive-bundle-1.5.0.jar bundled into my app fat-jar. I¹m running Spark 1.0.2 and CDH5. bin/spark-shell --master local[*] --driver-class-path ~/parquet-hive-bundle-1.5.0.jar To see if that works? On

Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
This is a long running Spark Streaming job running in YARN, Spark v1.0.2 on CDH5. The jobs will run for about 34-37 hours then die due to this FileNotFoundException. There’s very little CPU or RAM usage, I’m running 2 x cores, 2 x executors, 4g memory, YARN cluster mode. Here’s the stack

Re: Stage failure in BlockManager due to FileNotFoundException on long-running streaming job

2014-08-20 Thread Silvio Fiorito
Thanks, I’ll go ahead and disable that setting for now. From: Aaron Davidson ilike...@gmail.commailto:ilike...@gmail.com Date: Wednesday, August 20, 2014 at 3:20 PM To: Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior...@granturing.com Cc: user@spark.apache.orgmailto:user

Re: How to use Spark Streaming from an HTTP api?

2014-08-18 Thread Silvio Fiorito
You need to create a custom receiver that submits the HTTP requests then deserializes the data and pushes it into the Streaming context. See here for an example: http://spark.apache.org/docs/latest/streaming-custom-receivers.html On 8/18/14, 6:20 PM, bumble123 tc1...@att.com wrote: I want to

Re: Does HiveContext support Parquet?

2014-08-18 Thread Silvio Fiorito
First the JAR needs to be deployed using the ‹jars argument. Then in your HQL code you need to use the DeprecatedParquetInputFormat and DeprecatedParquetOutputFormat as described here https://cwiki.apache.org/confluence/display/Hive/Parquet#Parquet-Hive0.10-0 .12 This is because SparkSQL is based

RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
There's really nothing special besides including that jar on your classpath. You just do selects, inserts, etc as you normally would. The same instructions here apply https://cwiki.apache.org/confluence/display/Hive/Parquet From:

RE: Does HiveContext support Parquet?

2014-08-16 Thread Silvio Fiorito
If you're using HiveContext then all metadata is in the Hive metastore as defined in hive-site.xml. Concurrent writes should be fine as long as you're using a concurrent metastore db. From: Flavio Pompermaiermailto:pomperma...@okkam.it Sent: ‎8/‎16/‎2014 1:26 PM

Re: Does HiveContext support Parquet?

2014-08-15 Thread Silvio Fiorito
Yes, you can write to Parquet tables. On Spark 1.0.2 all I had to do was include the parquet-hive-bundle-1.5.0.jar on my classpath. From: lycmailto:yanchen@huawei.com Sent: ?Friday?, ?August? ?15?, ?2014 ?7?:?30? ?PM To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org

Re: spark streaming : what is the best way to make a driver highly available

2014-08-14 Thread Silvio Fiorito
You also need to ensure you're using checkpointing and support recreating the context on driver failure as described in the docs here: http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-the-driver-node From: Matt Narrell

SparkSQL Hive partitioning support

2014-08-13 Thread Silvio Fiorito
Using the SchemaRDD insertInto method, is there any way to support partitions on a field in the RDD? If not, what's the alternative, register a table and do an insert into via SQL statement? Any plans to support partitioning via insertInto? What other options are there for inserting into a

Re: SparkSQL Hive partitioning support

2014-08-13 Thread Silvio Fiorito
Yin Michael, Thanks, I'll try the SQL route. From: Yin Huai huaiyin@gmail.commailto:huaiyin@gmail.com Date: Wednesday, August 13, 2014 at 5:04 PM To: Michael Armbrust mich...@databricks.commailto:mich...@databricks.com Cc: Silvio Fiorito silvio.fior...@granturing.commailto:silvio.fior

Spark 1.0 embedded Hive libraries

2014-06-06 Thread Silvio Fiorito
Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore) used in SparkSQL? Are they forked with Spark-specific customizations, like Shark, or simply relabeled with a new package name (org.spark-project.hive)? I couldn't find any repos on Github

Re: Spark 1.0 embedded Hive libraries

2014-06-06 Thread Silvio Fiorito
- Patrick On Fri, Jun 6, 2014 at 12:08 PM, Silvio Fiorito silvio.fior...@granturing.com wrote: Is there a repo somewhere with the code for the Hive dependencies (hive-exec, hive-serde, hive-metastore) used in SparkSQL? Are they forked with Spark-specific customizations, like Shark, or simply

<    1   2