Re: External Shuffle service over yarn

2015-06-26 Thread Sandy Ryza
Hi Yash, One of the main advantages is that, if you turn dynamic allocation on, and executors are discarded, your application is still able to get at the shuffle data that they wrote out. -Sandy On Thu, Jun 25, 2015 at 11:08 PM, yash datta sau...@gmail.com wrote: Hi devs, Can someone point

Re: how to implement my own datasource?

2015-06-26 Thread 诺铁
thank you guys, I'll read examples and give a try. On Fri, Jun 26, 2015 at 2:47 AM, jimfcarroll jimfcarr...@gmail.com wrote: I'm not sure if this is what you're looking for but we have several custom RDD implementations for internal data format/partitioning schemes. The Spark api is really

Re: External Shuffle service over yarn

2015-06-26 Thread Aaron Davidson
A second advantage is that it allows individual Executors to go into GC pause (or even crash) and still allow other Executors to read shuffle data and make progress, which tends to improve stability of memory-intensive jobs. On Thu, Jun 25, 2015 at 11:42 PM, Sandy Ryza sandy.r...@cloudera.com

Re: Spark for distributed dbms cluster

2015-06-26 Thread Akhil Das
Which distributed database are you referring here? Spark can connect with almost all those databases out there (You just need to pass the Input/Output Format classes or there are a bunch of connectors also available). Thanks Best Regards On Fri, Jun 26, 2015 at 12:07 PM, louis.hust

Re: [SQL] codegen on wide dataset throws StackOverflow

2015-06-26 Thread Josh Rosen
Which Spark version are you using? Can you file a JIRA for this issue? On Thu, Jun 25, 2015 at 6:35 AM, Peter Rudenko petro.rude...@gmail.com wrote: Hi, i have a small but very wide dataset (2000 columns). Trying to optimize Dataframe pipeline for it, since it behaves very poorly comparing

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu
Thank you, Jie! Very nice work! -- Nan Zhu http://codingcat.me On Friday, June 26, 2015 at 8:17 AM, Huang, Jie wrote: Correct. Your calculation is right! We have been aware of that kmeans performance drop also. According to our observation, it is caused by some unbalanced

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-26 Thread Patrick Wendell
Hey Tom - no one voted on this yet, so I need to keep it open until people vote. But I'm not aware of specific things we are waiting for. Anyone else? - Patrick On Fri, Jun 26, 2015 at 7:10 AM, Tom Graves tgraves...@yahoo.com wrote: So is this open for vote then or are we waiting on other

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-26 Thread Tom Graves
So is this open for vote then or are we waiting on other things? Tom On Thursday, June 25, 2015 10:32 AM, Andrew Ash and...@andrewash.com wrote: I would guess that many tickets targeted at 1.4.1 were set that way during the tail end of the 1.4.0 voting process as people realized

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-26 Thread Ted Yu
I got the following when running test suite: [INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null) ^[[0m[^[[0minfo^[[0m] ^[[0mCompiling 2 Scala sources and 1 Java source to /home/hbase/spark-1.4.1/streaming/target/scala-2.10/test-classes...^[[0m ^[[0m[^[[31merror^[[0m]

Re: [VOTE] Release Apache Spark 1.4.1

2015-06-26 Thread Ted Yu
Pardon. During earlier test run, I got: ^[[32mStreamingContextSuite:^[[0m ^[[32m- from no conf constructor^[[0m ^[[32m- from no conf + spark home^[[0m ^[[32m- from no conf + spark home + env^[[0m ^[[32m- from conf with settings^[[0m ^[[32m- from existing SparkContext^[[0m ^[[32m- from existing

Re: R - Scala interface used in Spark?

2015-06-26 Thread Reynold Xin
You doing something for Haskell?? On Fri, Jun 26, 2015 at 5:21 PM, Vasili I. Galchin vigalc...@gmail.com wrote: How about Python?? On Friday, June 26, 2015, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't use the rscala package in SparkR -- We have an in built R-JVM

Re: R - Scala interface used in Spark?

2015-06-26 Thread Reynold Xin
Take a look at this for Python: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals On Fri, Jun 26, 2015 at 6:06 PM, Reynold Xin r...@databricks.com wrote: You doing something for Haskell?? On Fri, Jun 26, 2015 at 5:21 PM, Vasili I. Galchin vigalc...@gmail.com wrote: How

Re: R - Scala interface used in Spark?

2015-06-26 Thread Vasili I. Galchin
thx Reynold! Vasya On Fri, Jun 26, 2015 at 7:03 PM, Reynold Xin r...@databricks.com wrote: Take a look at this for Python: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals On Fri, Jun 26, 2015 at 6:06 PM, Reynold Xin r...@databricks.com wrote: You doing something for

Re: R - Scala interface used in Spark?

2015-06-26 Thread Shivaram Venkataraman
We don't use the rscala package in SparkR -- We have an in built R-JVM bridge that is customized to work with various deployment modes. You can find more details in my Spark Summit 2015 talk. Thanks Shivaram On Fri, Jun 26, 2015 at 3:19 PM, Vasili I. Galchin vigalc...@gmail.com wrote: A friend

Re: R - Scala interface used in Spark?

2015-06-26 Thread Vasili I. Galchin
Url plese !! URL. Please of ypur work. On Friday, June 26, 2015, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't use the rscala package in SparkR -- We have an in built R-JVM bridge that is customized to work with various deployment modes. You can find more details in my

Re: R - Scala interface used in Spark?

2015-06-26 Thread Vasili I. Galchin
How about Python?? On Friday, June 26, 2015, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: We don't use the rscala package in SparkR -- We have an in built R-JVM bridge that is customized to work with various deployment modes. You can find more details in my Spark Summit 2015 talk.

Re: R - Scala interface used in Spark?

2015-06-26 Thread Shivaram Venkataraman
You can see the slides, video at https://spark-summit.org/2015/events/sparkr-the-past-the-present-and-the-future/ On Fri, Jun 26, 2015 at 5:19 PM, Vasili I. Galchin vigalc...@gmail.com wrote: Url plese !! URL. Please of ypur work. On Friday, June 26, 2015, Shivaram Venkataraman

Re: Time is ugly in Spark Streaming....

2015-06-26 Thread Emrehan Tüzün
On Fri, Jun 26, 2015 at 12:30 PM, Sea 261810...@qq.com wrote: Hi, all I find a problem in spark streaming, when I use the time in function foreachRDD... I find the time is very interesting. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,

R - Scala interface used in Spark?

2015-06-26 Thread Vasili I. Galchin
A friend sent the below: http://cran.r-project.org/web/packages/rscala/index.html Is this the glue between R and Scala that is used in Spark? Vasili

Time is ugly in Spark Streaming....

2015-06-26 Thread Sea
Hi, all I find a problem in spark streaming, when I use the time in function foreachRDD... I find the time is very interesting. val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) dataStream.map(x = createGroup(x._2,

Re: Time is ugly in Spark Streaming....

2015-06-26 Thread Gerard Maas
Are you sharing the SimpleDateFormat instance? This looks a lot more like the non-thread-safe behaviour of SimpleDateFormat (that has claimed many unsuspecting victims over the years), than any 'ugly' Spark Streaming. Try writing the timestamps in millis to Kafka and compare. -kr, Gerard. On

RE: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Huang, Jie
Thanks. In general, we can see a stable trend in Spark master branch and latest release. And we are also considering to add more benchmarks/workloads into this automation perf tool. Any comment and feedback is warmly welcomed. Thank you Best Regards, Grace (Huang Jie) From: Nan Zhu

RE: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Huang, Jie
Correct. Your calculation is right! We have been aware of that kmeans performance drop also. According to our observation, it is caused by some unbalanced executions among different tasks. Even we used the same test data between different versions (i.e., not caused by the data skew). And the

Re: [SQL] codegen on wide dataset throws StackOverflow

2015-06-26 Thread Peter Rudenko
I'm using spark-1.4.0. Sure will try to make steps to reproduce and file a JIRA ticket. Thanks, Peter Rudenko On 2015-06-26 11:14, Josh Rosen wrote: Which Spark version are you using? Can you file a JIRA for this issue? On Thu, Jun 25, 2015 at 6:35 AM, Peter Rudenko petro.rude...@gmail.com

Re: [SparkScore]Performance portal for Apache Spark - WW26

2015-06-26 Thread Nan Zhu
Hi, Jie, Thank you very much for this work! Very helpful! I just would like to confirm that I understand the numbers correctly: if we take the running time of 1.2 release as 100s 9.1% - means the running time is 109.1 s? -4% - means it comes 96s? If that’s the true meaning of the numbers,

?????? Time is ugly in Spark Streaming....

2015-06-26 Thread Sea
Yes, I make it. -- -- ??: Gerard Maas;gerard.m...@gmail.com; : 2015??6??26??(??) 5:40 ??: Sea261810...@qq.com; : useru...@spark.apache.org; devdev@spark.apache.org; : Re: Time is ugly in Spark Streaming Are you