Re: High Availability of Spark Driver

2015-08-28 Thread Ashish Rawat
Thanks Steve. I had not spent many brain cycles on analysing the Yarn pieces, your insights would be extremely useful. I was also considering Zookeeper and Yarn registry for persisting state and sharing information. But for a basic POC, I used the file system and was able to 1. Preserve

Re: Feedback: Feature request

2015-08-28 Thread Cody Koeninger
I wrote some code for this a while back, pretty sure it didn't need access to anything private in the decision tree / random forest model. If people want it added to the api I can put together a PR. I think it's important to have separately parseable operators / operands though. E.g

Re: Feedback: Feature request

2015-08-28 Thread Manish Amde
Sounds good. It's a request I have seen a few times in the past and have needed it personally. May be Joseph Bradley has something to add. I think a JIRA to capture this will be great. We can move this discussion to the JIRA then. On Friday, August 28, 2015, Cody Koeninger c...@koeninger.org

Re: High Availability of Spark Driver

2015-08-28 Thread Chester Chen
Ashish and Steve I am also working on the long running Yarn Spark Job. Just start to focus on failure recovery. This thread of discussion is really helpful. Chester On Fri, Aug 28, 2015 at 12:53 AM, Ashish Rawat ashish.ra...@guavus.com wrote: Thanks Steve. I had not spent many brain

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Jon Bender
Marcelo, Thanks for replying -- after looking at my test again, I misinterpreted another issue I'm seeing which is unrelated (note I'm not using a pre-built binary, rather had to build my own with Yarn/Hive support, as I want to use it on an older cluster (CDH5.1.0)). I can start up a pyspark

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Yin Huai
-1 Found a problem on reading partitioned table. Right now, we may create a SQL project/filter operator for every partition. When we have thousands of partitions, there will be a huge number of SQLMetrics (accumulators), which causes high memory pressure to the driver and then takes down the

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Marcelo Vanzin
Hi Jonathan, Can you be more specific about what problem you're running into? SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the pyspark archives separately to YARN. With that fix in place, pyspark doesn't need to get anything from the Spark assembly, so it has no problems

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Luciano Resende
The binary archives seems to be having some issues, which seems consistent on few of the different ones (different versions of hadoop) that I tried. tar -xvf spark-1.5.0-bin-hadoop2.6.tgz x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar x

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas
Hi Everybody! Thanks for participating in the spark-ec2 survey. The full results are publicly viewable here: https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics The gist of the results is as follows: Most people found spark-ec2 useful as an easy way to

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Jonathan Bender
-1 for regression on PySpark + YARN support It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733 added a requirement for Java 7 in the build process. Due to some quirks with the Java archive format changes between Java 6 and 7, using PySpark with a YARN uberjar seems to break

IOError on createDataFrame

2015-08-28 Thread fsacerdoti
Hello, Similar to the thread below [1], when I tried to create an RDD from a 4GB pandas dataframe I encountered the error TypeError: cannot create an RDD from type: type 'list' However looking into the code shows this is raised from a generic except Exception: predicate