Thanks Steve. I had not spent many brain cycles on analysing the Yarn pieces,
your insights would be extremely useful.
I was also considering Zookeeper and Yarn registry for persisting state and
sharing information. But for a basic POC, I used the file system and was able to
1. Preserve
I wrote some code for this a while back, pretty sure it didn't need access
to anything private in the decision tree / random forest model. If people
want it added to the api I can put together a PR.
I think it's important to have separately parseable operators / operands
though. E.g
Sounds good. It's a request I have seen a few times in the past and have
needed it personally. May be Joseph Bradley has something to add.
I think a JIRA to capture this will be great. We can move this discussion
to the JIRA then.
On Friday, August 28, 2015, Cody Koeninger c...@koeninger.org
Ashish and Steve
I am also working on the long running Yarn Spark Job. Just start to
focus on failure recovery. This thread of discussion is really helpful.
Chester
On Fri, Aug 28, 2015 at 12:53 AM, Ashish Rawat ashish.ra...@guavus.com
wrote:
Thanks Steve. I had not spent many brain
Marcelo,
Thanks for replying -- after looking at my test again, I misinterpreted
another issue I'm seeing which is unrelated (note I'm not using a pre-built
binary, rather had to build my own with Yarn/Hive support, as I want to use
it on an older cluster (CDH5.1.0)).
I can start up a pyspark
-1
Found a problem on reading partitioned table. Right now, we may create a
SQL project/filter operator for every partition. When we have thousands of
partitions, there will be a huge number of SQLMetrics (accumulators), which
causes high memory pressure to the driver and then takes down the
Hi Jonathan,
Can you be more specific about what problem you're running into?
SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
pyspark archives separately to YARN. With that fix in place, pyspark
doesn't need to get anything from the Spark assembly, so it has no
problems
The binary archives seems to be having some issues, which seems consistent
on few of the different ones (different versions of hadoop) that I tried.
tar -xvf spark-1.5.0-bin-hadoop2.6.tgz
x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
x
Hi Everybody!
Thanks for participating in the spark-ec2 survey. The full results are
publicly viewable here:
https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics
The gist of the results is as follows:
Most people found spark-ec2 useful as an easy way to
-1 for regression on PySpark + YARN support
It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733
added a requirement for Java 7 in the build process. Due to some quirks
with the Java archive format changes between Java 6 and 7, using PySpark
with a YARN uberjar seems to break
Hello,
Similar to the thread below [1], when I tried to create an RDD from a 4GB
pandas dataframe I encountered the error
TypeError: cannot create an RDD from type: type 'list'
However looking into the code shows this is raised from a generic except
Exception: predicate
11 matches
Mail list logo