Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
Hi Sandy, Any resolution for YARN failures ? It's a blocker for running spark on top of YARN. Thanks. Deb On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, I think this may be the same issue as described in https://issues.apache.org/jira/browse/SPARK-2121 . We

Re: [mllib] Add multiplying large scale matrices

2014-09-09 Thread 顾荣
Hi All, Sorry for my late reply! Yu Ishikawa,Thanks for your interests in Saury project. You are welcomed to try that out. If you have questions about that, please email me. We are keeping improving performance/adding features for the project. Xiangrui, thanks for your encouragement. If you

parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
I've been looking at performance differences between spark sql queries against single parquet tables, vs a unionAll of two tables. It's a significant difference, like 5 to 10x Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason in general not to push projections and predicates down into the individual ParquetTableScans in a union? This would be a great case to add to ColumnPruning. Would be awesome if you could open a JIRA

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
Hi Deb, The current state of the art is to increase spark.yarn.executor.memoryOverhead until the job stops failing. We do have plans to try to automatically scale this based on the amount of memory requested, but it will still just be a heuristic. -Sandy On Tue, Sep 9, 2014 at 7:32 AM,

Re: RFC: Deprecating YARN-alpha API's

2014-09-09 Thread Sean Owen
FWIW consensus from Cloudera folk seems to be that there's no need or demand on this end for YARN alpha. It wouldn't have an impact if it were removed sooner even. It will be a small positive to reduce complexity by removing this support, making it a little easier to develop for current YARN

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Sep 9, 2014 at 10:17 AM, Cody Koeninger c...@koeninger.org wrote: Is there a reason in

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org wrote: Opened https://issues.apache.org/jira/browse/SPARK-3462 I'll take a look at ColumnPruning and see what I can do On Tue, Sep 9, 2014 at 12:46 PM, Michael Armbrust mich...@databricks.com wrote: On Tue, Sep

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Gary Malouf
I'm kind of surprised this was not run into before. Do people not segregate their data by day/week in the HDFS directory structure? On Tue, Sep 9, 2014 at 2:08 PM, Michael Armbrust mich...@databricks.com wrote: Thanks! On Tue, Sep 9, 2014 at 11:07 AM, Cody Koeninger c...@koeninger.org

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
I think usually people add these directories as multiple partitions of the same table instead of union. This actually allows us to efficiently prune directories when reading in addition to standard column pruning. On Tue, Sep 9, 2014 at 11:26 AM, Gary Malouf malouf.g...@gmail.com wrote: I'm

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
Hmm...I did try it increase to few gb but did not get a successful run yet... Any idea if I am using say 40 executors, each running 16GB, what's the typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large matrices with say few billion ratings... On Tue, Sep 9, 2014 at 10:49 AM,

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
Maybe I'm missing something, I thought parquet was generally a write-once format and the sqlContext interface to it seems that way as well. d1.saveAsParquetFile(/foo/d1) // another day, another table, with same schema d2.saveAsParquetFile(/foo/d2) Will give a directory structure like

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Patrick Wendell
I think what Michael means is people often use this to read existing partitioned Parquet tables that are defined in a Hive metastore rather than data generated directly from within Spark and then reading it back as a table. I'd expect the latter case to become more common, but for now most users

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Sandy Ryza
I would expect 2 GB would be enough or more than enough for 16 GB executors (unless ALS is using a bunch of off-heap memory?). You mentioned earlier in this thread that the property wasn't showing up in the Environment tab. Are you sure it's making it in? -Sandy On Tue, Sep 9, 2014 at 11:58

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Michael Armbrust
What Patrick said is correct. Two other points: - In the 1.2 release we are hoping to beef up the support for working with partitioned parquet independent of the metastore. - You can actually do operations like INSERT INTO for parquet tables to add data. This creates new parquet files for each

yet another jenkins restart early thursday morning -- 730am PDT (and a brief update on our new jenkins infra)

2014-09-09 Thread shane knapp
since the power incident last thursday, the github pull request builder plugin is still not really working 100%. i found an open issue w/jenkins[1] that could definitely be affecting us, i will be pausing builds early thursday morning and then restarting jenkins. i'll send out a reminder

Re: Lost executor on YARN ALS iterations

2014-09-09 Thread Debasish Das
Last time it did not show up on environment tab but I will give it another shot...Expected behavior is that this env variable will show up right ? On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza sandy.r...@cloudera.com wrote: I would expect 2 GB would be enough or more than enough for 16 GB

Re: RFC: Deprecating YARN-alpha API's

2014-09-09 Thread Chester Chen
We were using it until recently, we are talking to our customers and see if we can get off it. Chester Alpine Data Labs On Tue, Sep 9, 2014 at 10:59 AM, Sean Owen so...@cloudera.com wrote: FWIW consensus from Cloudera folk seems to be that there's no need or demand on this end for YARN

Re: parquet predicate / projection pushdown into unionAll

2014-09-09 Thread Cody Koeninger
Ok, so looking at the optimizer code for the first time and trying the simplest rule that could possibly work, object UnionPushdown extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan transform { // Push down filter into union case f @ Filter(condition, u @

Junit spark tests

2014-09-09 Thread Sudershan Malpani
Hi all, I am calling an object which in turn is calling a method inside a map RDD in spark. While writing the tests how can I mock that object's call? Currently I did doNothing().when(class).method() is called but it is giving task not serializable exception. I tried making the class both spy

Re: Junit spark tests

2014-09-09 Thread Reynold Xin
Can you be a little bit more specific, maybe give a code snippet? On Tue, Sep 9, 2014 at 5:14 PM, Sudershan Malpani sudershan.malp...@gmail.com wrote: Hi all, I am calling an object which in turn is calling a method inside a map RDD in spark. While writing the tests how can I mock that

Re: Junit spark tests

2014-09-09 Thread Sudershan Malpani
Class1.java @Autowired Private ClassX cx; Public list method1(JavaPairRDD data){ List list1 = new ArrayList(); List list2 = new ArrayList(); JavaPairRDD computed = data.map( new FunctionTuple2object, list() { Public List call(object obj) throws