Re: new Catalyst/SQL component merged into master

2014-03-20 Thread Heiko Braun
Congrats! That's a really impressive and useful addition to spark. I just recently discovered a similar feature in pandas and really enjoyed using it. Regards, Heiko > Am 21.03.2014 um 02:11 schrieb Reynold Xin : > > Hi All, > > I'm excited to announce a new module in Spark (SPARK-1251). A

Re: Spark AMI

2014-03-20 Thread Patrick Wendell
It has a bunch of packages installed on it for various spark dependencies (libfortran, numpy, scipy) and some helpful tools (dstat, iotop). On Thu, Mar 20, 2014 at 10:21 AM, Reynold Xin wrote: > It's mostly stock CentOS installation with some scripts. > > > > > On Thu, Mar 20, 2014 at 2:53 AM, Us

Re: Spark 0.9.1 release

2014-03-20 Thread Bhaskar Dutta
Thank You! We plan to test out 0.9.1 on YARN once it is out. Regards, Bhaskar On Fri, Mar 21, 2014 at 12:42 AM, Tom Graves wrote: > I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running > on YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as > submitting user - J

Please subscribe me

2014-03-20 Thread twinkle sachdeva
Please subscribe me.

Re: new Catalyst/SQL component merged into master

2014-03-20 Thread Michael Armbrust
Hi Everyone, I'm very excited about merging this new feature into Spark! We have a lot of cool things in the pipeline, including: porting Shark's in-memory columnar format to Spark SQL, code-generation for expression evaluation and improved support for complex types in parquet. I would love to h

new Catalyst/SQL component merged into master

2014-03-20 Thread Reynold Xin
Hi All, I'm excited to announce a new module in Spark (SPARK-1251). After an initial review we've merged this as Spark as an alpha component to be included in Spark 1.0. This new component adds some exciting features, including: - schema-aware RDD programming via an experimental DSL - native Parq

Re: Spark 0.9.1 release

2014-03-20 Thread Patrick Wendell
Thanks Tom, After I looked more at this patch I don't see how this could have regressed behavior for any users (it seems like it only pertains to warnings and instructions). So maybe the user mistook this patch for a different issue. https://github.com/apache/incubator-spark/pull/553/files - Pat

Re: Spark 0.9.1 release

2014-03-20 Thread Tom Graves
Thanks for the heads up, saw that and will make sure that is resolved before pulling into 0.9.  Unless I'm missing something, they should just use sc.addJar to distributed the jar rather then relying on SPARK_YARN_APP_JAR. Tom On Thursday, March 20, 2014 3:31 PM, Patrick Wendell wrote: Hey

Re: Spark 0.9.1 release

2014-03-20 Thread Patrick Wendell
Hey Tom, > I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on > YARN - JIRA and [SPARK-1051] On Yarn, executors don't doAs as submitting > user - JIRA in. The pyspark one I would consider more of an enhancement so > might not be appropriate for a point release. Some

Re: Spark 0.9.1 release

2014-03-20 Thread Tom Graves
I'll pull [SPARK-1053] Should not require SPARK_YARN_APP_JAR when running on YARN - JIRA and  [SPARK-1051] On Yarn, executors don't doAs as submitting user - JIRA in.  The pyspark one I would consider more of an enhancement so might not be appropriate for a point release.  [SPARK-1053] Shoul

Re: Spark 0.9.1 release

2014-03-20 Thread Bhaskar Dutta
It will be great if "SPARK-1101: Umbrella for hardening Spark on YARN" can get into 0.9.1. Thanks, Bhaskar On Thu, Mar 20, 2014 at 5:37 AM, Tathagata Das wrote: > Hello everyone, > > Since the release of Spark 0.9, we have received a number

Re: Largest input data set observed for Spark.

2014-03-20 Thread Andrew Ash
Understood of course. Did the data fit comfortably in memory or did you experience memory pressure? I've had to do a fair amount of tuning when under memory pressure in the past (0.7.x) and was hoping that the handling of this scenario is improved in later Spark versions. On Thu, Mar 20, 2014 a

Re: Largest input data set observed for Spark.

2014-03-20 Thread Henry Saputra
Reynold, just curious did you guys ran it in AWS? - Henry On Thu, Mar 20, 2014 at 11:08 AM, Reynold Xin wrote: > Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - > I didn't count the size of the uncompressed data, but I am guessing it is > somewhere between 200TB to 700

Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
I'm not really at liberty to discuss details of the job. It involves some expensive aggregated statistics, and took 10 hours to complete (mostly bottlenecked by network & io). On Thu, Mar 20, 2014 at 11:12 AM, Surendranauth Hiraman < suren.hira...@velos.io> wrote: > Reynold, > > How complex w

Re: Largest input data set observed for Spark.

2014-03-20 Thread Surendranauth Hiraman
Reynold, How complex was that job (I guess in terms of number of transforms and actions) and how long did that take to process? -Suren On Thu, Mar 20, 2014 at 2:08 PM, Reynold Xin wrote: > Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - > I didn't count the size of

Re: Largest input data set observed for Spark.

2014-03-20 Thread Reynold Xin
Actually we just ran a job with 70TB+ compressed data on 28 worker nodes - I didn't count the size of the uncompressed data, but I am guessing it is somewhere between 200TB to 700TB. On Thu, Mar 20, 2014 at 12:23 AM, Usman Ghani wrote: > All, > What is the largest input data set y'all have com

Re: Spark AMI

2014-03-20 Thread Reynold Xin
It's mostly stock CentOS installation with some scripts. On Thu, Mar 20, 2014 at 2:53 AM, Usman Ghani wrote: > Is there anything special about the spark AMIs or are they just stock > CentOS installations? >

Spark AMI

2014-03-20 Thread Usman Ghani
Is there anything special about the spark AMIs or are they just stock CentOS installations?

Re: how to config worker HA

2014-03-20 Thread qingyang li
i think i found the answer: apply(flags: Int, replication: Int): StorageLevel 2014-03-20 17:00 GM

Re: how to config worker HA

2014-03-20 Thread qingyang li
can someone help me ? 2014-03-12 21:26 GMT+08:00 qingyang li : > in addition: > on this site: > https://spark.apache.org/docs/0.9.0/scala-programming-guide.html#hadoop-datasets > , > i find RDD can be stored using a different *storage level on the web, > and *also find StorageLevel's attribute

Re: Announcing the official Spark Job Server repo

2014-03-20 Thread andy petrella
Heya, That's cool you've already hacked something for this in the scripts! I have a related question, how would it work actually. I mean, to have this Job Server fault tolerant using Marathon, I would guess that it will need to be itself a Mesos framework, and able to publish its resources needs.

[spark-streaming] Is this LocalInputDStream useful to someone?

2014-03-20 Thread Pascal Voitot Dev
Hi guys, In my recent blog post (http://mandubian.com/2014/03/08/zpark-ml-nio-1/), I needed to have an InputDStream helper looking like NetworkInputDStream to be able to push my data into DStream in an async way. But I didn't want the remoting aspect as my data source runs locally and nowhere else

Re:Largest input data set observed for Spark.

2014-03-20 Thread ligq
4 ratings -- Original -- From: "Usman Ghani";; Date: Thu, Mar 20, 2014 03:23 PM To: "user"; "dev"; Subject: Largest input data set observed for Spark. All, What is the largest input data set y'all have come across that has been successfully proce

Largest input data set observed for Spark.

2014-03-20 Thread Usman Ghani
All, What is the largest input data set y'all have come across that has been successfully processed in production using spark. Ball park?