Re: Surprising Spark SQL benchmark

2014-11-05 Thread Marco Slot
Hi Patrick, We left the details of the configuration of Spark that we used out of the blog post for brevity, but we're happy to share them. We've done quite a bit of tuning to find the configuration settings that gave us the best query times and run the most queries. I think there might still be

Appropriate way to add a debug flag

2014-11-05 Thread Ganelin, Ilya
Hello all – I am working on https://issues.apache.org/jira/browse/SPARK-3694 and would like to understand the appropriate mechanism by which to check for a debug flag before printing a graph traversal of dependencies of an RDD or Task. I understand that I can use the logging utility and use

Re: src/main/resources/kv1.txt not found in example of HiveFromSpark

2014-11-05 Thread Marcelo Vanzin
Yeah, the code looks for the file in the source location, not in the packaged location. It's in the root of the examples jar; you can extract it to src/main/resources/ kv1.txt in the local directory (creating the subdirs) and then you can run the example. Probably should be fixed though (bonus if

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Reynold Xin
Hi all, We are excited to announce that the benchmark entry has been reviewed by the Sort Benchmark committee and Spark has officially won the Daytona GraySort contest in sorting 100TB of data. Our entry tied with a UCSD research team building high performance systems and we jointly set a new

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Nicholas Chammas
On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I believe that benchmark has a pending certification on it. See http://sortbenchmark.org under Process. Regarding this comment, Reynold has just announced that this benchmark is now certified. -

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Nicholas Chammas
Steve Nunez, I believe the information behind the links below should address your concerns earlier about Databricks's submission to the Daytona Gray benchmark. On Wed, Nov 5, 2014 at 6:43 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: On Fri, Oct 31, 2014 at 3:45 PM, Nicholas Chammas

Re: Breaking the previous large-scale sort record with Spark

2014-11-05 Thread Matei Zaharia
Congrats to everyone who helped make this happen. And if anyone has even more machines they'd like us to run on next year, let us know :). Matei On Nov 5, 2014, at 3:11 PM, Reynold Xin r...@databricks.com wrote: Hi all, We are excited to announce that the benchmark entry has been

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Reynold Xin
Steve, I wouldn't say Hadoop MR is a 2001 Toyota Celica :) In either case, I updated the blog post to actually include CPU / disk / network measures. You should see that in any measure that matters to this benchmark, the old 2100 node cluster is vastly superior. The data even fit in memory! On

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Nicholas Chammas
Steve, Your original comment was about the *reproducibility* of the benchmark, which I was responding to. No one is suggesting you doubt the authenticity or results of the benchmark. For which no details or code have been released to allow others to reproduce it. I would encourage anyone doing

[VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure there is still great oversight of key components (in particular internal

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Timothy Chen
Hi Matei, Definitely in favor of moving into this model for exactly the reasons you mentioned. From the module list though, the module that I'm mostly involved with and is not listed is the Mesos integration piece. I believe we also need a maintainer for Mesos, and I wonder if there is someone

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Michael Armbrust
+1 (binding) On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia matei.zaha...@gmail.com wrote: BTW, my own vote is obviously +1 (binding). Matei On Nov 5, 2014, at 5:31 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Reynold Xin
+1 (binding) We are already doing this implicitly. In my experience, this can create longer term personal commitment, which usually leads to better design decisions if somebody knows they would need to look after something for a while. On Wed, Nov 5, 2014 at 5:33 PM, Matei Zaharia

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Nan Zhu
+1, with a question Will these maintainers have a cleanup for those pending PRs upon we start to apply this model? there are some patches always being there but haven’t been merged, some of which are periodically maintained (rebase, ping , etc….), the others are just phased out Best, --

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
Hi Tim, We can definitely add one for that if the component grows larger or becomes harder to maintain. The main reason I didn't propose one is that the Mesos integration is actually a lot simpler than YARN at the moment, partly because we support several YARN versions that have incompatible

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Sandy Ryza
This seems like a good idea. An area that wasn't listed, but that I think could strongly benefit from maintainers, is the build. Having consistent oversight over Maven, SBT, and dependencies would allow us to avoid subtle breakages. Component maintainers have come up several times within the

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Patrick Wendell
I'm a +1 on this as well, I think it will be a useful model as we scale the project in the future and recognizes some informal process we have now. To respond to Sandy's comment: for changes that fall in between the component boundaries or are straightforward, my understanding of this model is

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Andrew Or
+1 2014-11-05 18:08 GMT-08:00 Patrick Wendell pwend...@gmail.com: I'm a +1 on this as well, I think it will be a useful model as we scale the project in the future and recognizes some informal process we have now. To respond to Sandy's comment: for changes that fall in between the

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Prashant Sharma
+1, Sounds good. Now I know whom to ping for what, even if I did not follow the whole history of the project very carefully. Prashant Sharma On Thu, Nov 6, 2014 at 7:01 AM, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC

Re: Surprising Spark SQL benchmark

2014-11-05 Thread Matei Zaharia
Yup, the Hadoop nodes were from 2013, each with 64 GB RAM, 12 cores, 10 Gbps Ethernet and 12 disks. For 100 TB of data, the intermediate data could fit in memory on this cluster, which can make shuffle much faster than with intermediate data on SSDs. You can find the specs in

create_image.sh contains broken hadoop web link

2014-11-05 Thread Nicholas Chammas
As part of my work for SPARK-3821 https://issues.apache.org/jira/browse/SPARK-3821, I tried building an AMI today using create_image.sh. This line https://github.com/mesos/spark-ec2/blob/f6773584dd71afc49f1225be48439653313c0341/create_image.sh#L68 appears to be broken now (it wasn’t a week or so

Re: create_image.sh contains broken hadoop web link

2014-11-05 Thread Ted Yu
Have you seen this thread ? http://search-hadoop.com/m/LgpTk2Pnw6O/andrew+apache+mirrorsubj=Re+All+mirrored+download+links+from+the+Apache+Hadoop+site+are+broken Cheers On Wed, Nov 5, 2014 at 7:36 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: As part of my work for SPARK-3821

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Mark Hamstra
+1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 on this proposal. On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Will these maintainers have a cleanup for those pending PRs upon we start to apply this model? I

[Classloading] Strange class loading issue

2014-11-05 Thread Matt Cheah
Hi everyone, I¹m running into a strange class loading issue when running a Spark job, using Spark 1.0.2. I¹m running a process where some Java code is compiled dynamically into a jar and added to the Spark context via addJar(). It is also added to the class loader of the thread that created the

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Xiangrui Meng
+1 (binding) On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra m...@clearstorydata.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 6:29 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: +1 on this proposal. On Wed, Nov 5, 2014 at 8:55 PM, Nan Zhu zhunanmcg...@gmail.com wrote: Will these

Re: create_image.sh contains broken hadoop web link

2014-11-05 Thread Nicholas Chammas
Nope, thanks for pointing me to it. Doesn't look like there is a resolution to the issue. Also, the like you pointed to also appears to be broken now: http://apache.mesi.com.ar/hadoop/common/ Nick On Wed, Nov 5, 2014 at 10:43 PM, Ted Yu yuzhih...@gmail.com wrote: Have you seen this thread ?

Re: create_image.sh contains broken hadoop web link

2014-11-05 Thread Ted Yu
The artifacts are in archive: http://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/ Cheers On Nov 5, 2014, at 8:07 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: Nope, thanks for pointing me to it. Doesn't look like there is a resolution to the issue. Also, the like you

Re: create_image.sh contains broken hadoop web link

2014-11-05 Thread Nicholas Chammas
Yup, I just stumbled on that. I'll submit a PR to fix that link. Thanks Ted. On Wed, Nov 5, 2014 at 11:13 PM, Ted Yu yuzhih...@gmail.com wrote: The artifacts are in archive: http://archive.apache.org/dist/hadoop/common/hadoop-2.4.1/ Cheers On Nov 5, 2014, at 8:07 PM, Nicholas Chammas

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Wangfei (X)
+1 发自我的 iPhone 在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道: +1 great idea. On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote: +1 (binding) On Wed, Nov 5, 2014 at 7:52 PM, Mark Hamstra m...@clearstorydata.com wrote: +1 (binding) On Wed, Nov 5, 2014 at

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Cheng Lian
+1 since this is already the de facto model we are using. On Thu, Nov 6, 2014 at 12:40 PM, Wangfei (X) wangf...@huawei.com wrote: +1 发自我的 iPhone 在 2014年11月5日,20:06,Denny Lee denny.g@gmail.com 写道: +1 great idea. On Wed, Nov 5, 2014 at 20:04 Xiangrui Meng men...@gmail.com wrote:

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Jeremy Freeman
Great idea! +1 — Jeremy - jeremyfreeman.net @thefreemanlab On Nov 5, 2014, at 11:48 PM, Timothy Chen tnac...@gmail.com wrote: Matei that makes sense, +1 (non-binding) Tim On Wed, Nov 5, 2014 at 8:46 PM, Cheng Lian lian.cs@gmail.com wrote: +1 since this is

RE: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Cheng, Hao
+1, that definitely will speeds up the PR reviewing / merging. -Original Message- From: Cheng Lian [mailto:lian.cs@gmail.com] Sent: Thursday, November 6, 2014 12:46 PM To: dev Subject: Re: [VOTE] Designating maintainers for some Spark components +1 since this is already the de facto

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread jackylk
+1 Great idea! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Designating-maintainers-for-some-Spark-components-tp9115p9142.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Kousuke Saruta
+1, It makes sense! - Kousuke (2014/11/05 17:31), Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list. Basically, as the Spark project scales up, we need to define a model to make sure

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Reza Zadeh
+1, sounds good. On Wed, Nov 5, 2014 at 9:19 PM, Kousuke Saruta saru...@oss.nttdata.co.jp wrote: +1, It makes sense! - Kousuke (2014/11/05 17:31), Matei Zaharia wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Xuefeng Wu
+1 it make more focus and more consistence. Yours, Xuefeng Wu 吴雪峰 敬上 On 2014年11月6日, at 上午9:31, Matei Zaharia matei.zaha...@gmail.com wrote: Hi all, I wanted to share a discussion we've been having on the PMC list, as well as call for an official vote on it on a public list.

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Matei Zaharia
Several people asked about having maintainers review the PR queue for their modules regularly, and I like that idea. We have a new tool now to help with that in https://spark-prs.appspot.com. In terms of the set of open PRs itself, it is large but note that there are also 2800 *closed* PRs,

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Manoj Babu
+1 Cheers! Manoj. On Thu, Nov 6, 2014 at 12:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Several people asked about having maintainers review the PR queue for their modules regularly, and I like that idea. We have a new tool now to help with that in https://spark-prs.appspot.com. In

Re: [VOTE] Designating maintainers for some Spark components

2014-11-05 Thread Liquan Pei
+1 Liquan On Wed, Nov 5, 2014 at 11:32 PM, Manoj Babu manoj...@gmail.com wrote: +1 Cheers! Manoj. On Thu, Nov 6, 2014 at 12:51 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Several people asked about having maintainers review the PR queue for their modules regularly, and I like