Re: make-distribution.sh failing at spark/R/lib/sparkr.zip

2015-08-12 Thread Ted Yu
I ran your command on Linux which passed. Are you going to use SparkR ? If so, consider including the following: -Psparkr Cheers On Wed, Aug 12, 2015 at 3:31 AM, MEETHU MATHEW meethu2...@yahoo.co.in wrote: Hi, I am trying to create a package using the make-distribution.sh script from the

Re: grouping by a partitioned key

2015-08-12 Thread Philip Weaver
Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the keyed directory structure that you referenced in that link. On Tue, Aug 11, 2015 at 11:54 PM, Hemant Bhanawat hemant9...@gmail.com wrote: As far as I know, Spark SQL cannot process data on a per-partition-basis.

Spark 1.2.2 build problem with Hive 0.12, bringing in wrong version of avro-mapred

2015-08-12 Thread java8964
Hi, This email is sent to both dev and user list, just want to see if someone familiar with Spark/Maven build procedure can provide any help. I am building Spark 1.2.2 with the following command: mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -Phive -Phive-0.12.0 The spark-assembly-1.2.2-hadoop2.2.0.jar

Re: avoid duplicate due to executor failure in spark stream

2015-08-12 Thread Cody Koeninger
Accumulators aren't going to work to communicate state changes between executors. You need external storage. On Tue, Aug 11, 2015 at 11:28 AM, Shushant Arora shushantaror...@gmail.com wrote: What if processing is neither idempotent nor its in transaction ,say I am posting events to some

Spark - Standalone Vs YARN Vs Mesos

2015-08-12 Thread ๏̯͡๏
Do we have any comparisons in terms of resource utilization, scheduling of running Spark in the below three modes 1) Standalone 2) over YARN 3) over Mesos Can some one share resources (thoughts/URLs) on this area. -- Deepak

Unit Testing

2015-08-12 Thread Mohit Anchlia
Is there a way to run spark streaming methods in standalone eclipse environment to test out the functionality?

spark's behavior about failed tasks

2015-08-12 Thread freedafeng
Hello there, I have a spark running in a 20 node cluster. The job is logically simple, just a mapPartition and then sum. The return value of the mapPartitions is an integer for each partition. The tasks got some random failure (which could be caused by a 3rh party key-value store connections. The

Re: Sorted Multiple Outputs

2015-08-12 Thread Eugene Morozov
Yiannis, sorry for late response, It is indeed not possible to create new RDD inside of foreachPartitions, so you have to write data manually. I haven’t tried that and haven’t got such an exception, but I’d assume you might try to write locally and them upload it into HDFS. FileSystem has a

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Tim Chen
You're referring to both fine grain and coarse grain? Desirable number of executors per node could be interesting but it can't be guaranteed (or we could try to and when failed abort the job). How would you imagine this new option to actually work? Tim On Wed, Aug 12, 2015 at 11:48 AM, Ajay

Spark 1.3 + Parquet: Skipping data using statistics

2015-08-12 Thread YaoPau
I've seen this function referenced in a couple places, first this forum post https://forums.databricks.com/questions/951/why-should-i-use-parquet.html and this talk by Michael Armbrust https://www.youtube.com/watch?v=6axUqHCu__Y during the 42nd minute. As I understand it, if you create a

Re: stopping spark stream app

2015-08-12 Thread Tathagata Das
stop() is a blocking method when stopGraceful is set to true. In that case, it obviously waits for all batches with data to complete processing. Why are you joining on the thread in streaming listener? The listener is just a callback listener and is NOT supposed to do any long running blocking

what is cause of, and how to recover from, unresponsive nodes w/ spark-ec2 script

2015-08-12 Thread AlexG
I'm using the spark-ec2 script to launch a 30 node r3.8xlarge cluster. Occasionally several nodes will become unresponsive: I will notice that hdfs complains it can't find some blocks, then when I go to restart hadoop, the messages indicate that the connection to some nodes timed out, then when I

Re: PySpark order-only window function issue

2015-08-12 Thread Davies Liu
This should be a bug, go ahead to open a JIRA for it, thanks! On Tue, Aug 11, 2015 at 6:41 AM, Maciej Szymkiewicz mszymkiew...@gmail.com wrote: Hello everyone, I am trying to use PySpark API with window functions without specifying partition clause. I mean something equivalent to this

Re: Spark - Standalone Vs YARN Vs Mesos

2015-08-12 Thread Tim Chen
I'm not sure what you're looking for, since you can't really compare Standalone with YARN or Mesos, as Standalone is assuming the Spark workers/master owns the cluster, and YARN/Mesos is trying to share the cluster among different applications/frameworks. And when you refer to resource

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat
Inline.. On Thu, Aug 13, 2015 at 5:06 AM, Eugene Morozov fathers...@list.ru wrote: Hemant, William, pls see inlined. On 12 Aug 2015, at 18:18, Philip Weaver philip.wea...@gmail.com wrote: Yes, I am partitoning using DataFrameWriter.partitionBy, which produces the keyed directory structure

Re: Very high latency to initialize a DataFrame from partitioned parquet database.

2015-08-12 Thread Cheng Lian
Hi Philip, What do you mean by saying still partitioned the same way? If you are trying to to save the partition columns encoded in partition directories directly into Parquet files, and put all Parquet part-files into a single directory without creating any intermediate sub-directories, then

Re: How to minimize shuffling on Spark dataframe Join?

2015-08-12 Thread Abdullah Anwar
Hi Hemant, Thank you for your replay. I think source of my dataframe is not partitioned on key, its an avro file where 'id' is a field .. but I don't know how to read a file and at the same time configure partition key. I couldn't find anything on SQLContext.read.load where you can set

Re: grouping by a partitioned key

2015-08-12 Thread Hemant Bhanawat
As far as I know, Spark SQL cannot process data on a per-partition-basis. DataFrame.foreachPartition is the way. I haven't tried it, but, following looks like a not-so-sophisticated way of making spark sql partition aware.

Re: Spark DataFrames uses too many partition

2015-08-12 Thread Alasdair McBride
Thanks Silvio! On 11 Aug 2015 17:44, Silvio Fiorito silvio.fior...@granturing.com wrote: You need to configure the spark.sql.shuffle.partitions parameter to a different value. It defaults to 200. On 8/11/15, 11:31 AM, Al M alasdair.mcbr...@gmail.com wrote: I am using DataFrames with

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Tim Chen
Yes the options are not that configurable yet but I think it's not hard to change it. I have a patch out actually specifically able to configure amount of cpus per executor in coarse grain mode, and hopefully merged next release. I think the open question now is for fine grain mode can we limit

Re: Parquet without hadoop: Possible?

2015-08-12 Thread Cheng Lian
One thing to note is that, it would be good to add explicit file system scheme to the output path (i.e. file:///var/... instead of /var/...), esp. when you do have HDFS running. Because in this case the data might be written to HDFS rather than your local file system if Spark found Hadoop

RE: Spark DataFrames uses too many partition

2015-08-12 Thread Alasdair McBride
Thank you Hao; that was a fantastic response. I have raised SPARK-9782 for this. I also would love to have dynamic partitioning. I mentioned it in the Jira. On 12 Aug 2015 02:19, Cheng, Hao hao.ch...@intel.com wrote: That's a good question, we don't support reading small files in a single

How to Handle Update Operation from Spark to MongoDB

2015-08-12 Thread Deepesh Maheshwari
Hi, I am using MongoDb -Hadoop connector to insert RDD into mongodb. rdd.saveAsNewAPIHadoopFile(file:///notapplicable, Object.class, BSONObject.class, MongoOutputFormat.class, outputConfig); But, some operation required to insert rdd data as update operation for Mongo

Re: stopping spark stream app

2015-08-12 Thread Shushant Arora
calling jssc.stop(false/true,false/true) from streamingListener causes deadlock , So I created another thread and called jssc.stop from that but that too caused deadlock if onBatchCompleted is not completed before jssc.stop(). So is it safe If I call System.exit(1) from another thread without

Re:Re: Possible issue for Spark SQL/DataFrame

2015-08-12 Thread Netwaver
It is a space separated data, just as below And What is your thought about the second issue? Thank you. At 2015-08-10 15:20:39, Akhil Das ak...@sigmoidanalytics.com wrote: Isnt it a space separated data? It is not a comma(,) separated nor pipe (|) separated data. Thanks Best Regards

What is the Effect of Serialization within Stages?

2015-08-12 Thread Mark Heimann
Hello everyone, I am wondering what the effect of serialization is within a stage. My understanding of Spark as an execution engine is that the data flow graph is divided into stages and a new stage always starts after an operation/transformation that cannot be pipelined (such as groupBy or

Parquet file organisation for 100GB+ dataframes

2015-08-12 Thread Ewan Leith
Hi all, Can anyone share their experiences working with storing and organising larger datasets with Spark? I've got a dataframe stored in Parquet on Amazon S3 (using EMRFS) which has a fairly complex nested schema (based on JSON files), which I can query in Spark, but the initial setup takes

Re: Spark DataFrames uses too many partition

2015-08-12 Thread Al M
The DataFrames parallelism currently controlled through configuration option spark.sql.shuffle.partitions. The default value is 200 I have raised an Improvement Jira to make it possible to specify the number of partitions in https://issues.apache.org/jira/browse/SPARK-9872 -- View this

Error writing to cassandra table using spark application

2015-08-12 Thread Nupur Kumar (BLOOMBERG/ 731 LEX)
Hello, I am doing this for the first time so feel free to let me know/forward this to where it needs to be if not here. I have a spark application that does some computations and writes the results to Cassandra. I had this set up working (and Cass table populating) on dev machines but when I

Is there any tool that i can prove to customer that spark is faster then hive ?

2015-08-12 Thread Ladle
Hi , I have build the the machine learning features and model using Apache spark. And the same features i have i build using hive,java and used mahout to run model. Now how can i show to customer that Apache Spark is more faster then hive. Is there any tool that shows the time ? Regards,

Spark 1.4.1 py4j.Py4JException: Method read([]) does not exist

2015-08-12 Thread resonance
Hello everybody, I am programming with Pyspark in the Eclipse IDE and have been trying to transition to Spark 1.4.1 so that I may finally program using Python 3. (Good work with the progress!) The following program works in Spark 1.3.1 but throws an exception in Spark 1.4.1: from pyspark import

Re: Is there any tool that i can prove to customer that spark is faster then hive ?

2015-08-12 Thread Nick Pentreath
Perhaps you could time the end-to-end runtime for each pipeline, and each stage? Through Id be fairly confidant that Spark will outperform hive/mahout on MR, that's not he only consideration - having everything on a single platform and the Spark / data frame API is a huge win just by itself

Re: Possible issue for Spark SQL/DataFrame

2015-08-12 Thread Eugene Morozov
Cannot tell anything specific about separator as it’s not clear how you create schema from schemaString. Regarding the second issue - that’s expected, because there is a Map there and you cannot provide more, than one value for the key. That’s why you see only the last “min” value. This is a

Re: Is there any tool that i can prove to customer that spark is faster then hive ?

2015-08-12 Thread Gourav Sengupta
You might also need to consider the maturity of SPARKSQL vs HIVEQL. Besides that please read the following (which will soon be available as a part of standard Amazon stack, in case its not already) https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark%3A+Getting+Started. All that you

Does Spark optimization might miss to run transformation?

2015-08-12 Thread Eugene Morozov
Hi! I’d like to complete action (store / print smth) inside of transformation (map or mapPartitions). This approach has some flaws, but there is a question. Might it happen that Spark will optimise (RDD or DataFrame) processing so that my mapPartitions simply won’t happen? -- Eugene Morozov

Re: Controlling number of executors on Mesos vs YARN

2015-08-12 Thread Jerry Lam
Great stuff Tim. This definitely will make Mesos users life easier Sent from my iPad On 2015-08-12, at 11:52, Haripriya Ayyalasomayajula aharipriy...@gmail.com wrote: Thanks Tim, Jerry. On Wed, Aug 12, 2015 at 1:18 AM, Tim Chen t...@mesosphere.io wrote: Yes the options are not that

Re: stopping spark stream app

2015-08-12 Thread Tathagata Das
Well, system.exit will not ensure all data was processed before shutdown. There should not be a deadlock is onBatchCompleted just starts the thread (that runs stop()) and completes. On Wed, Aug 12, 2015 at 1:50 AM, Shushant Arora shushantaror...@gmail.com wrote: calling

Re: Spark runs into an Infinite loop even if the tasks are completed successfully

2015-08-12 Thread Imran Rashid
yikes. Was this a one-time thing? Or does it happen consistently? can you turn on debug logging for o.a.s.scheduler (dunno if it will help, but maybe ...) On Tue, Aug 11, 2015 at 8:59 AM, Akhil Das ak...@sigmoidanalytics.com wrote: Hi My Spark job (running in local[*] with spark 1.4.1)

Re: Partitioning in spark streaming

2015-08-12 Thread Tathagata Das
Yes. On Wed, Aug 12, 2015 at 12:12 PM, Mohit Anchlia mohitanch...@gmail.com wrote: Thanks! To write to hdfs I do need to use saveAs method? On Wed, Aug 12, 2015 at 12:01 PM, Tathagata Das t...@databricks.com wrote: This is how Spark does. It writes the task output to a uniquely-named