Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Yin Huai
-1

Found a problem on reading partitioned table. Right now, we may create a
SQL project/filter operator for every partition. When we have thousands of
partitions, there will be a huge number of SQLMetrics (accumulators), which
causes high memory pressure to the driver and then takes down the cluster
(long GC time causes different kinds of timeouts).

https://issues.apache.org/jira/browse/SPARK-10339

Will have a fix soon.

On Fri, Aug 28, 2015 at 3:18 PM, Jon Bender 
wrote:

> Marcelo,
>
> Thanks for replying -- after looking at my test again, I misinterpreted
> another issue I'm seeing which is unrelated (note I'm not using a pre-built
> binary, rather had to build my own with Yarn/Hive support, as I want to use
> it on an older cluster (CDH5.1.0)).
>
> I can start up a pyspark app on YARN, so I don't want to block this.  +1
>
> Best,
> Jonathan
>
> On Fri, Aug 28, 2015 at 2:34 PM, Marcelo Vanzin 
> wrote:
>
>> Hi Jonathan,
>>
>> Can you be more specific about what problem you're running into?
>>
>> SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
>> pyspark archives separately to YARN. With that fix in place, pyspark
>> doesn't need to get anything from the Spark assembly, so it has no
>> problems running on YARN. I just downloaded
>> spark-1.5.0-bin-hadoop2.6.tgz and tried that out, and pyspark works
>> fine on YARN for me.
>>
>>
>> On Fri, Aug 28, 2015 at 2:22 PM, Jonathan Bender
>>  wrote:
>> > -1 for regression on PySpark + YARN support
>> >
>> > It seems like this JIRA
>> https://issues.apache.org/jira/browse/SPARK-7733
>> > added a requirement for Java 7 in the build process.  Due to some quirks
>> > with the Java archive format changes between Java 6 and 7, using PySpark
>> > with a YARN uberjar seems to break when compiled with anything after
>> Java 6
>> > (see https://issues.apache.org/jira/browse/SPARK-1920 for reference).
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>>
>>
>> --
>> Marcelo
>>
>
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Jon Bender
Marcelo,

Thanks for replying -- after looking at my test again, I misinterpreted
another issue I'm seeing which is unrelated (note I'm not using a pre-built
binary, rather had to build my own with Yarn/Hive support, as I want to use
it on an older cluster (CDH5.1.0)).

I can start up a pyspark app on YARN, so I don't want to block this.  +1

Best,
Jonathan

On Fri, Aug 28, 2015 at 2:34 PM, Marcelo Vanzin  wrote:

> Hi Jonathan,
>
> Can you be more specific about what problem you're running into?
>
> SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
> pyspark archives separately to YARN. With that fix in place, pyspark
> doesn't need to get anything from the Spark assembly, so it has no
> problems running on YARN. I just downloaded
> spark-1.5.0-bin-hadoop2.6.tgz and tried that out, and pyspark works
> fine on YARN for me.
>
>
> On Fri, Aug 28, 2015 at 2:22 PM, Jonathan Bender
>  wrote:
> > -1 for regression on PySpark + YARN support
> >
> > It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733
> > added a requirement for Java 7 in the build process.  Due to some quirks
> > with the Java archive format changes between Java 6 and 7, using PySpark
> > with a YARN uberjar seems to break when compiled with anything after
> Java 6
> > (see https://issues.apache.org/jira/browse/SPARK-1920 for reference).
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
>
> --
> Marcelo
>


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Shivaram Venkataraman
I've seen similar tar file warnings and in my case it was because I
was using the default tar on a Macbook. Using gnu-tar from brew made
the warnings go away.

Thanks
Shivaram

On Fri, Aug 28, 2015 at 2:37 PM, Luciano Resende  wrote:
> The binary archives seems to be having some issues, which seems consistent
> on few of the different ones (different versions of hadoop) that I tried.
>
>  tar -xvf spark-1.5.0-bin-hadoop2.6.tgz
>
> x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
> x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
> x spark-1.5.0-bin-hadoop2.6/README.md
> tar: copyfile unpack
> (spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc)
> failed: No such file or directory
>
> tar tzf spark-1.5.0-bin-hadoop2.3.tgz | grep SUCCESS.crc
> spark-1.5.0-bin-hadoop2.3/python/test_support/sql/orc_partitioned/._SUCCESS.crc
>
> This seems similar to a problem Avro release was having recently.
>
>
> On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin  wrote:
>>
>> Please vote on releasing the following candidate as Apache Spark version
>> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes if
>> a majority of at least 3 +1 PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Spark 1.5.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>>
>> The tag to be voted on is v1.5.0-rc2:
>>
>> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>>
>> The release files, including signatures, digests, etc. can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>>
>> Release artifacts are signed with the following key:
>> https://people.apache.org/keys/committer/pwendell.asc
>>
>> The staging repository for this release (published as 1.5.0-rc2) can be
>> found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1141/
>>
>> The staging repository for this release (published as 1.5.0) can be found
>> at:
>> https://repository.apache.org/content/repositories/orgapachespark-1140/
>>
>> The documentation corresponding to this release can be found at:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>>
>>
>> ===
>> How can I help test this release?
>> ===
>> If you are a Spark user, you can help us test this release by taking an
>> existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>>
>> 
>> What justifies a -1 vote for this release?
>> 
>> This vote is happening towards the end of the 1.5 QA period, so -1 votes
>> should only occur for significant regressions from 1.4. Bugs already present
>> in 1.4, minor regressions, or bugs related to new features will not block
>> this release.
>>
>>
>> ===
>> What should happen to JIRA tickets still targeting 1.5.0?
>> ===
>> 1. It is OK for documentation patches to target 1.5.0 and still go into
>> branch-1.5, since documentations will be packaged separately from the
>> release.
>> 2. New features for non-alpha-modules should target 1.6+.
>> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
>> version.
>>
>>
>> ==
>> Major changes to help you focus your testing
>> ==
>>
>> As of today, Spark 1.5 contains more than 1000 commits from 220+
>> contributors. I've curated a list of important changes for 1.5. For the
>> complete list, please refer to Apache JIRA changelog.
>>
>> RDD/DataFrame/SQL APIs
>>
>> - New UDAF interface
>> - DataFrame hints for broadcast join
>> - expr function for turning a SQL expression into DataFrame column
>> - Improved support for NaN values
>> - StructType now supports ordering
>> - TimestampType precision is reduced to 1us
>> - 100 new built-in expressions, including date/time, string, math
>> - memory and local disk only checkpointing
>>
>> DataFrame/SQL Backend Execution
>>
>> - Code generation on by default
>> - Improved join, aggregation, shuffle, sorting with cache friendly
>> algorithms and external algorithms
>> - Improved window function performance
>> - Better metrics instrumentation and reporting for DF/SQL execution plans
>>
>> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>>
>> - Dynamic allocation support in all resource managers (Mesos, YARN,
>> Standalone)
>> - Improved Mesos support (framework authentication, roles, dynamic
>> allocation, constraints)
>> - Improved YARN support (dynamic allocation with preferred locations)
>> - Improved Hive support 

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas
Hi Everybody!

Thanks for participating in the spark-ec2 survey. The full results are
publicly viewable here:

https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics

The gist of the results is as follows:

Most people found spark-ec2 useful as an easy way to get a working Spark
cluster to run a quick experiment or do some benchmarking without having to
do a lot of manual configuration or setup work.

Many people lamented the slow launch times of spark-ec2, problems getting
it to launch clusters within a VPC, and broken Ganglia installs. Some also
mentioned that Hadoop 2 didn't work as expected.

Wish list items for spark-ec2 included faster launches, selectable Hadoop 2
versions, and more configuration options.

If you'd like to add your own feedback to what's already there, I've
decided to leave the survey open for a few more days:

http://goo.gl/forms/erct2s6KRR

As noted before, your results are anonymous and public.

Thanks again for participating! I hope this has been useful to the
community.

Nick

On Tue, Aug 25, 2015 at 1:31 PM Nicholas Chammas 
wrote:

> Final chance to fill out the survey!
>
> http://goo.gl/forms/erct2s6KRR
>
> I'm gonna close it to new responses tonight and send out a summary of the
> results.
>
> Nick
>
> On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I'm planning to close the survey to further responses early next week.
>>
>> If you haven't chimed in yet, the link to the survey is here:
>>
>> http://goo.gl/forms/erct2s6KRR
>>
>> We already have some great responses, which you can view. I'll share a
>> summary after the survey is closed.
>>
>> Cheers!
>>
>> Nick
>>
>>
>> On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Howdy folks!
>>>
>>> I’m interested in hearing about what people think of spark-ec2
>>>  outside of the
>>> formal JIRA process. Your answers will all be anonymous and public.
>>>
>>> If the embedded form below doesn’t work for you, you can use this link
>>> to get the same survey:
>>>
>>> http://goo.gl/forms/erct2s6KRR
>>>
>>> Cheers!
>>> Nick
>>> ​
>>>
>>


Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Luciano Resende
The binary archives seems to be having some issues, which seems consistent
on few of the different ones (different versions of hadoop) that I tried.

 tar -xvf spark-1.5.0-bin-hadoop2.6.tgz

x spark-1.5.0-bin-hadoop2.6/lib/spark-examples-1.5.0-hadoop2.6.0.jar
x spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar
x spark-1.5.0-bin-hadoop2.6/lib/spark-1.5.0-yarn-shuffle.jar
x spark-1.5.0-bin-hadoop2.6/README.md
tar: copyfile unpack
(spark-1.5.0-bin-hadoop2.6/python/test_support/sql/orc_partitioned/SUCCESS.crc)
failed: No such file or directory

tar tzf spark-1.5.0-bin-hadoop2.3.tgz | grep SUCCESS.crc
spark-1.5.0-bin-hadoop2.3/python/test_support/sql/orc_partitioned/._SUCCESS.crc

This seems similar to a problem Avro release was having recently.


On Tue, Aug 25, 2015 at 9:28 PM, Reynold Xin  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.5.0. The vote is open until Friday, Aug 29, 2015 at 5:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.5.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
>
> The tag to be voted on is v1.5.0-rc2:
>
> https://github.com/apache/spark/tree/727771352855dbb780008c449a877f5aaa5fc27a
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release (published as 1.5.0-rc2) can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1141/
>
> The staging repository for this release (published as 1.5.0) can be found
> at:
> https://repository.apache.org/content/repositories/orgapachespark-1140/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc2-docs/
>
>
> ===
> How can I help test this release?
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
>
> 
> What justifies a -1 vote for this release?
> 
> This vote is happening towards the end of the 1.5 QA period, so -1 votes
> should only occur for significant regressions from 1.4. Bugs already
> present in 1.4, minor regressions, or bugs related to new features will not
> block this release.
>
>
> ===
> What should happen to JIRA tickets still targeting 1.5.0?
> ===
> 1. It is OK for documentation patches to target 1.5.0 and still go into
> branch-1.5, since documentations will be packaged separately from the
> release.
> 2. New features for non-alpha-modules should target 1.6+.
> 3. Non-blocker bug fixes should target 1.5.1 or 1.6.0, or drop the target
> version.
>
>
> ==
> Major changes to help you focus your testing
> ==
>
> As of today, Spark 1.5 contains more than 1000 commits from 220+
> contributors. I've curated a list of important changes for 1.5. For the
> complete list, please refer to Apache JIRA changelog.
>
> RDD/DataFrame/SQL APIs
>
> - New UDAF interface
> - DataFrame hints for broadcast join
> - expr function for turning a SQL expression into DataFrame column
> - Improved support for NaN values
> - StructType now supports ordering
> - TimestampType precision is reduced to 1us
> - 100 new built-in expressions, including date/time, string, math
> - memory and local disk only checkpointing
>
> DataFrame/SQL Backend Execution
>
> - Code generation on by default
> - Improved join, aggregation, shuffle, sorting with cache friendly
> algorithms and external algorithms
> - Improved window function performance
> - Better metrics instrumentation and reporting for DF/SQL execution plans
>
> Data Sources, Hive, Hadoop, Mesos and Cluster Management
>
> - Dynamic allocation support in all resource managers (Mesos, YARN,
> Standalone)
> - Improved Mesos support (framework authentication, roles, dynamic
> allocation, constraints)
> - Improved YARN support (dynamic allocation with preferred locations)
> - Improved Hive support (metastore partition pruning, metastore
> connectivity to 0.13 to 1.2, internal Hive upgrade to 1.2)
> - Support persisting data in Hive compatible format in metastore
> - Support data partitioning for JSON data sources
> - Parquet improvements (upgrade to 1.7, predicate pushdown, faster
> metadata discovery and schema merging, support reading non-standard legacy

Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Marcelo Vanzin
Hi Jonathan,

Can you be more specific about what problem you're running into?

SPARK-6869 fixed the issue of pyspark vs. assembly jar by shipping the
pyspark archives separately to YARN. With that fix in place, pyspark
doesn't need to get anything from the Spark assembly, so it has no
problems running on YARN. I just downloaded
spark-1.5.0-bin-hadoop2.6.tgz and tried that out, and pyspark works
fine on YARN for me.


On Fri, Aug 28, 2015 at 2:22 PM, Jonathan Bender
 wrote:
> -1 for regression on PySpark + YARN support
>
> It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733
> added a requirement for Java 7 in the build process.  Due to some quirks
> with the Java archive format changes between Java 6 and 7, using PySpark
> with a YARN uberjar seems to break when compiled with anything after Java 6
> (see https://issues.apache.org/jira/browse/SPARK-1920 for reference).
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>



-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [VOTE] Release Apache Spark 1.5.0 (RC2)

2015-08-28 Thread Jonathan Bender
-1 for regression on PySpark + YARN support

It seems like this JIRA https://issues.apache.org/jira/browse/SPARK-7733
added a requirement for Java 7 in the build process.  Due to some quirks
with the Java archive format changes between Java 6 and 7, using PySpark
with a YARN uberjar seems to break when compiled with anything after Java 6
(see https://issues.apache.org/jira/browse/SPARK-1920 for reference).



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-5-0-RC2-tp13826p13890.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



IOError on createDataFrame

2015-08-28 Thread fsacerdoti
Hello,

Similar to the thread below [1], when I tried to create an RDD from a 4GB
pandas dataframe I encountered the error

TypeError: cannot create an RDD from type: 

However looking into the code shows this is raised from a generic "except
Exception:" predicate (pyspark/sql/context.py:238 in spark-1.4.1). A
debugging session reveals the true error is SPARK_LOCAL_DIRS ran out of
space:

-> rdd = self._sc.parallelize(data)
(Pdb) 
*IOError: (28, 'No space left on device')*

In this case, creating an RDD from a large matrix (~50mill rows) is required
for us. I'm a bit concerned about spark's process here:

   a. turning the dataframe into records (data.to_records)
   b. writing it to tmp
   c. reading it back again in scala.

Is there a better way? The intention would be to operate on slices of this
large dataframe using numpy operations via spark's transformations and
actions.

Thanks,
FDS
 
1. https://www.mail-archive.com/user@spark.apache.org/msg35139.html





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/IOError-on-createDataFrame-tp13888.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: High Availability of Spark Driver

2015-08-28 Thread Chester Chen
Ashish and Steve
 I am also working on the long running Yarn Spark Job. Just start to
focus on failure recovery. This thread of discussion is really helpful.

Chester

On Fri, Aug 28, 2015 at 12:53 AM, Ashish Rawat 
wrote:

> Thanks Steve. I had not spent many brain cycles on analysing the Yarn
> pieces, your insights would be extremely useful.
>
> I was also considering Zookeeper and Yarn registry for persisting state
> and sharing information. But for a basic POC, I used the file system and
> was able to
>
>1. Preserve Executors.
>2. Reconnect Executors back to Driver by storing the Executor
>endpoints info into a local file system. When driver restarts, use this
>info to send update driver message to executor endpoints. Executors can
>then update all of their Akka endpoints and reconnect.
>3. Reregister Block Manager and report back blocks. This utilises most
>of Spark’s existing code, I only had to update the BlockManagerMaster
>endpoint in executors.
>
> Surprisingly, Spark components took the restart in a much better way than
> I had anticipated and were easy to accept new work :-)
>
> I am still figuring out other complexities around preserving RDD lineage
> and computation. From my initial analysis, preserving the whole computation
> might be complex and may not be required. Perhaps, the lineage of only the
> cached RDDs can be preserved to recover any lost blocks.
>
> I am definitely not underestimating the effort, both within Spark and
> around interfacing with Yarn, but just trying to emphasise that a single
> node leading to full application restart, does not seem right for a long
> running service. Thoughts?
>
> Regards,
> Ashish
>
> From: Steve Loughran 
> Date: Thursday, 27 August 2015 4:19 pm
> To: Ashish Rawat 
> Cc: "dev@spark.apache.org" 
> Subject: Re: High Availability of Spark Driver
>
>
> On 27 Aug 2015, at 08:42, Ashish Rawat  wrote:
>
> Hi Patrick,
>
> As discussed in another thread, we are looking for a solution to the
> problem of lost state on Spark Driver failure. Can you please share Spark’s
> long term strategy for resolving this problem.
>
> <-- Original Mail Content Below -->
>
> We have come across the problem of Spark Applications (on Yarn) requiring
> a restart in case of Spark Driver (or application master) going down. This
> is hugely inconvenient for long running applications which are maintaining
> a big state in memory. The repopulation of state in itself may require a
> downtime of many minutes, which is not acceptable for most live systems.
>
> As you would have noticed that Yarn community has acknowledged "long
> running services" as an important class of use cases, and thus identified
> and removed problems in working with long running services in Yarn.
>
> http://hortonworks.com/blog/support-long-running-services-hadoop-yarn-clusters/
>
>
> Yeah, I spent a lot of time on that, or at least using the features, in
> other work under YARN-896, summarised in
> http://www.slideshare.net/steve_l/yarn-services
>
> It would be great if Spark, which is the most important processing engine
> on Yarn,
>
>
> I'f you look at the CPU-hours going in to the big hadoop clusters, it's
> actually MR work and things behind Hive. but: these apps don't attempt HA
>
> Why not? It requires whatever maintains the overall app status (spark: the
> driver) to persist that state in a way where it can be rebuilt. A restarted
> AM with the "retain containers" feature turned on gets nothing back from
> YARN except the list of previous allocated containers, and is left to sort
> itself out.
>
> also figures out issues in working with long running Spark applications
> and publishes recommendations or make framework changes for removing those.
> The need to keep the application running in case of Driver and Application
> Master failure, seems to be an important requirement from this perspective.
> The two most compelling use cases being:
>
>1. Huge state of historical data in *Spark Streaming*, required for
>stream processing
>2. Very large cached tables in *Spark SQL* (very close to our use case
>where we periodically cache RDDs and query using Spark SQL)
>
>
>
> Generally spark streaming is viewed as the big need here, but yes,
> long-lived cached data matters.
>
> Bear in mind that before Spark 1.5, you can't run any spark YARN app for
> longer than the expiry time of your delegation tokens, so in a secure
> cluster you have a limit of a couple of days anyway. Unless your cluster is
> particularly unreliable, AM failures are usually pretty unlikely in such a
> short timespan. Container failure is more likely as 1) you have more of
> them and 2) if you have pre-emption turned on in the scheduler or are
> pushing the work out to a label containing spot VMs, the will fail.
>
> In our analysis, for both of these use cases, a working HA solution can be
> built by
>
>1. Preserving the state of executors (not killing them on driver
>failures

Re: Feedback: Feature request

2015-08-28 Thread Manish Amde
Sounds good. It's a request I have seen a few times in the past and have
needed it personally. May be Joseph Bradley has something to add.

I think a JIRA to capture this will be great. We can move this discussion
to the JIRA then.

On Friday, August 28, 2015, Cody Koeninger  wrote:

> I wrote some code for this a while back, pretty sure it didn't need access
> to anything private in the decision tree / random forest model.  If people
> want it added to the api I can put together a PR.
>
> I think it's important to have separately parseable operators / operands
> though.  E.g
>
> "lhs":0,"op":"<=","rhs":-35.0
> On Aug 28, 2015 12:03 AM, "Manish Amde"  > wrote:
>
>> Hi James,
>>
>> It's a good idea. A JSON format is more convenient for visualization
>> though a little inconvenient to read. How about toJson() method? It might
>> make the mllib api inconsistent across models though.
>>
>> You should probably create a JIRA for this.
>>
>> CC: dev list
>>
>> -Manish
>>
>> On Aug 26, 2015, at 11:29 AM, Murphy, James > > wrote:
>>
>> Hey all,
>>
>>
>>
>> In working with the DecisionTree classifier, I found it difficult to
>> extract rules that could easily facilitate visualization with libraries
>> like D3.
>>
>>
>>
>> So for example, using : print(model.toDebugString()), I get the following
>> result =
>>
>>
>>
>>If (feature 0 <= -35.0)
>>
>>   If (feature 24 <= 176.0)
>>
>> Predict: 2.1
>>
>>   If (feature 24 = 176.0)
>>
>> Predict: 4.2
>>
>>   Else (feature 24 > 176.0)
>>
>> Predict: 6.3
>>
>> Else (feature 0 > -35.0)
>>
>>   If (feature 24 <= 11.0)
>>
>> Predict: 4.5
>>
>>   Else (feature 24 > 11.0)
>>
>> Predict: 10.2
>>
>>
>>
>> But ideally, I could see results in a more parseable format like JSON:
>>
>>
>>
>> {
>>
>> "node": [
>>
>> {
>>
>> "name":"node1",
>>
>> "rule":"feature 0 <= -35.0",
>>
>> "children":[
>>
>> {
>>
>>   "name":"node2",
>>
>>   "rule":"feature 24 <= 176.0",
>>
>>   "children":[
>>
>>   {
>>
>>   "name":"node4",
>>
>>   "rule":"feature 20 < 116.0",
>>
>>   "predict":  2.1
>>
>>   },
>>
>>   {
>>
>>   "name":"node5",
>>
>>   "rule":"feature 20 = 116.0",
>>
>>   "predict": 4.2
>>
>>   },
>>
>>   {
>>
>>   "name":"node5",
>>
>>   "rule":"feature 20 > 116.0",
>>
>>   "predict": 6.3
>>
>>   }
>>
>>   ]
>>
>> },
>>
>> {
>>
>> "name":"node3",
>>
>> "rule":"feature 0 > -35.0",
>>
>>   "children":[
>>
>>   {
>>
>>   "name":"node7",
>>
>>   "rule":"feature 3 <= 11.0",
>>
>>   "predict": 4.5
>>
>>   },
>>
>>   {
>>
>>   "name":"node8",
>>
>>   "rule":"feature 3 > 11.0",
>>
>>   "predict": 10.2
>>
>>   }
>>
>>   ]
>>
>> }
>>
>>
>>
>> ]
>>
>> }
>>
>> ]
>>
>> }
>>
>>
>>
>> Food for thought!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jim
>>
>>
>>
>>


Re: Feedback: Feature request

2015-08-28 Thread Cody Koeninger
I wrote some code for this a while back, pretty sure it didn't need access
to anything private in the decision tree / random forest model.  If people
want it added to the api I can put together a PR.

I think it's important to have separately parseable operators / operands
though.  E.g

"lhs":0,"op":"<=","rhs":-35.0
On Aug 28, 2015 12:03 AM, "Manish Amde"  wrote:

> Hi James,
>
> It's a good idea. A JSON format is more convenient for visualization
> though a little inconvenient to read. How about toJson() method? It might
> make the mllib api inconsistent across models though.
>
> You should probably create a JIRA for this.
>
> CC: dev list
>
> -Manish
>
> On Aug 26, 2015, at 11:29 AM, Murphy, James 
> wrote:
>
> Hey all,
>
>
>
> In working with the DecisionTree classifier, I found it difficult to
> extract rules that could easily facilitate visualization with libraries
> like D3.
>
>
>
> So for example, using : print(model.toDebugString()), I get the following
> result =
>
>
>
>If (feature 0 <= -35.0)
>
>   If (feature 24 <= 176.0)
>
> Predict: 2.1
>
>   If (feature 24 = 176.0)
>
> Predict: 4.2
>
>   Else (feature 24 > 176.0)
>
> Predict: 6.3
>
> Else (feature 0 > -35.0)
>
>   If (feature 24 <= 11.0)
>
> Predict: 4.5
>
>   Else (feature 24 > 11.0)
>
> Predict: 10.2
>
>
>
> But ideally, I could see results in a more parseable format like JSON:
>
>
>
> {
>
> "node": [
>
> {
>
> "name":"node1",
>
> "rule":"feature 0 <= -35.0",
>
> "children":[
>
> {
>
>   "name":"node2",
>
>   "rule":"feature 24 <= 176.0",
>
>   "children":[
>
>   {
>
>   "name":"node4",
>
>   "rule":"feature 20 < 116.0",
>
>   "predict":  2.1
>
>   },
>
>   {
>
>   "name":"node5",
>
>   "rule":"feature 20 = 116.0",
>
>   "predict": 4.2
>
>   },
>
>   {
>
>   "name":"node5",
>
>   "rule":"feature 20 > 116.0",
>
>   "predict": 6.3
>
>   }
>
>   ]
>
> },
>
> {
>
> "name":"node3",
>
> "rule":"feature 0 > -35.0",
>
>   "children":[
>
>   {
>
>   "name":"node7",
>
>   "rule":"feature 3 <= 11.0",
>
>   "predict": 4.5
>
>   },
>
>   {
>
>   "name":"node8",
>
>   "rule":"feature 3 > 11.0",
>
>   "predict": 10.2
>
>   }
>
>   ]
>
> }
>
>
>
> ]
>
> }
>
> ]
>
> }
>
>
>
> Food for thought!
>
>
>
> Thanks,
>
>
>
> Jim
>
>
>
>


Re: High Availability of Spark Driver

2015-08-28 Thread Ashish Rawat
Thanks Steve. I had not spent many brain cycles on analysing the Yarn pieces, 
your insights would be extremely useful.

I was also considering Zookeeper and Yarn registry for persisting state and 
sharing information. But for a basic POC, I used the file system and was able to

  1.  Preserve Executors.
  2.  Reconnect Executors back to Driver by storing the Executor endpoints info 
into a local file system. When driver restarts, use this info to send update 
driver message to executor endpoints. Executors can then update all of their 
Akka endpoints and reconnect.
  3.  Reregister Block Manager and report back blocks. This utilises most of 
Spark’s existing code, I only had to update the BlockManagerMaster endpoint in 
executors.

Surprisingly, Spark components took the restart in a much better way than I had 
anticipated and were easy to accept new work :-)

I am still figuring out other complexities around preserving RDD lineage and 
computation. From my initial analysis, preserving the whole computation might 
be complex and may not be required. Perhaps, the lineage of only the cached 
RDDs can be preserved to recover any lost blocks.

I am definitely not underestimating the effort, both within Spark and around 
interfacing with Yarn, but just trying to emphasise that a single node leading 
to full application restart, does not seem right for a long running service. 
Thoughts?

Regards,
Ashish

From: Steve Loughran mailto:ste...@hortonworks.com>>
Date: Thursday, 27 August 2015 4:19 pm
To: Ashish Rawat mailto:ashish.ra...@guavus.com>>
Cc: "dev@spark.apache.org" 
mailto:dev@spark.apache.org>>
Subject: Re: High Availability of Spark Driver


On 27 Aug 2015, at 08:42, Ashish Rawat 
mailto:ashish.ra...@guavus.com>> wrote:

Hi Patrick,

As discussed in another thread, we are looking for a solution to the problem of 
lost state on Spark Driver failure. Can you please share Spark’s long term 
strategy for resolving this problem.

<-- Original Mail Content Below -->

We have come across the problem of Spark Applications (on Yarn) requiring a 
restart in case of Spark Driver (or application master) going down. This is 
hugely inconvenient for long running applications which are maintaining a big 
state in memory. The repopulation of state in itself may require a downtime of 
many minutes, which is not acceptable for most live systems.

As you would have noticed that Yarn community has acknowledged "long running 
services" as an important class of use cases, and thus identified and removed 
problems in working with long running services in Yarn.
http://hortonworks.com/blog/support-long-running-services-hadoop-yarn-clusters/


Yeah, I spent a lot of time on that, or at least using the features, in other 
work under YARN-896, summarised in 
http://www.slideshare.net/steve_l/yarn-services

It would be great if Spark, which is the most important processing engine on 
Yarn,

I'f you look at the CPU-hours going in to the big hadoop clusters, it's 
actually MR work and things behind Hive. but: these apps don't attempt HA

Why not? It requires whatever maintains the overall app status (spark: the 
driver) to persist that state in a way where it can be rebuilt. A restarted AM 
with the "retain containers" feature turned on gets nothing back from YARN 
except the list of previous allocated containers, and is left to sort itself 
out.

also figures out issues in working with long running Spark applications and 
publishes recommendations or make framework changes for removing those. The 
need to keep the application running in case of Driver and Application Master 
failure, seems to be an important requirement from this perspective. The two 
most compelling use cases being:

  1.  Huge state of historical data in Spark Streaming, required for stream 
processing
  2.  Very large cached tables in Spark SQL (very close to our use case where 
we periodically cache RDDs and query using Spark SQL)


Generally spark streaming is viewed as the big need here, but yes, long-lived 
cached data matters.

Bear in mind that before Spark 1.5, you can't run any spark YARN app for longer 
than the expiry time of your delegation tokens, so in a secure cluster you have 
a limit of a couple of days anyway. Unless your cluster is particularly 
unreliable, AM failures are usually pretty unlikely in such a short timespan. 
Container failure is more likely as 1) you have more of them and 2) if you have 
pre-emption turned on in the scheduler or are pushing the work out to a label 
containing spot VMs, the will fail.

In our analysis, for both of these use cases, a working HA solution can be 
built by

  1.  Preserving the state of executors (not killing them on driver failures)

This is a critical one


  1.  Persisting some meta info required by Spark SQL and Block Manager.

again, needs a failure tolerant storage mechanism. HDFS and ZK can work 
together here, but your code needs to handle all the corner cases of 
in