Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Denny Lee
+1 Tested on OSX

Tested Scala 2.10.3, SparkSQL with Hive 0.12 / Hadoop 2.5, Thrift Server,
MLLib SVD


On Fri Dec 12 2014 at 8:57:16 PM Mark Hamstra 
wrote:

> +1
>
> On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen  wrote:
> >
> > +1.  Tested using spark-perf and the Spark EC2 scripts.  I didn’t notice
> > any performance regressions that could not be attributed to changes of
> > default configurations.  To be more specific, when running Spark 1.2.0
> with
> > the Spark 1.1.0 settings of spark.shuffle.manager=hash and
> > spark.shuffle.blockTransferService=nio, there was no performance
> regression
> > and, in fact, there were significant performance improvements for some
> > workloads.
> >
> > In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort
> > and spark.shuffle.blockTransferService=netty.  With these new settings,
> I
> > noticed a performance regression in the scala-sort-by-key-int spark-perf
> > test.  However, Spark 1.1.0 and 1.1.1 exhibit a similar performance
> > regression for that same test when run with spark.shuffle.manager=sort,
> so
> > this regression seems explainable by the change of defaults.  Besides
> this,
> > most of the other tests ran at the same speeds or faster with the new
> 1.2.0
> > defaults.  Also, keep in mind that this is a somewhat artificial micro
> > benchmark; I have heard anecdotal reports from many users that their real
> > workloads have run faster with 1.2.0.
> >
> > Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2.
> >
> > - Josh
> >
> > On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com)
> > wrote:
> >
> > +1 (non-binding). Tested on Ubuntu against YARN.
> >
> > On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin 
> wrote:
> >
> > > +1
> > >
> > > Tested on OS X.
> > >
> > > On Wednesday, December 10, 2014, Patrick Wendell 
> > > wrote:
> > >
> > > > Please vote on releasing the following candidate as Apache Spark
> > version
> > > > 1.2.0!
> > > >
> > > > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
> > > >
> > > >
> > >
> > https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=
> a428c446e23e628b746e0626cc02b7b3cadf588e
> > > >
> > > > The release files, including signatures, digests, etc. can be found
> at:
> > > > http://people.apache.org/~pwendell/spark-1.2.0-rc2/
> > > >
> > > > Release artifacts are signed with the following key:
> > > > https://people.apache.org/keys/committer/pwendell.asc
> > > >
> > > > The staging repository for this release can be found at:
> > > >
> > https://repository.apache.org/content/repositories/orgapachespark-1055/
> > > >
> > > > The documentation corresponding to this release can be found at:
> > > > http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
> > > >
> > > > Please vote on releasing this package as Apache Spark 1.2.0!
> > > >
> > > > The vote is open until Saturday, December 13, at 21:00 UTC and passes
> > > > if a majority of at least 3 +1 PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Spark 1.2.0
> > > > [ ] -1 Do not release this package because ...
> > > >
> > > > To learn more about Apache Spark, please see
> > > > http://spark.apache.org/
> > > >
> > > > == What justifies a -1 vote for this release? ==
> > > > This vote is happening relatively late into the QA period, so
> > > > -1 votes should only occur for significant regressions from
> > > > 1.0.2. Bugs already present in 1.1.X, minor
> > > > regressions, or bugs related to new features will not block this
> > > > release.
> > > >
> > > > == What default changes should I be aware of? ==
> > > > 1. The default value of "spark.shuffle.blockTransferService" has
> been
> > > > changed to "netty"
> > > > --> Old behavior can be restored by switching to "nio"
> > > >
> > > > 2. The default value of "spark.shuffle.manager" has been changed to
> > > "sort".
> > > > --> Old behavior can be restored by setting "spark.shuffle.manager"
> to
> > > > "hash".
> > > >
> > > > == How does this differ from RC1 ==
> > > > This has fixes for a handful of issues identified - some of the
> > > > notable fixes are:
> > > >
> > > > [Core]
> > > > SPARK-4498: Standalone Master can fail to recognize completed/failed
> > > > applications
> > > >
> > > > [SQL]
> > > > SPARK-4552: Query for empty parquet table in spark sql hive get
> > > > IllegalArgumentException
> > > > SPARK-4753: Parquet2 does not prune based on OR filters on partition
> > > > columns
> > > > SPARK-4761: With JDBC server, set Kryo as default serializer and
> > > > disable reference tracking
> > > > SPARK-4785: When called with arguments referring column fields, PMOD
> > > > throws NPE
> > > >
> > > > - Patrick
> > > >
> > > > 
> -
> > > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > 
> > > > For additional commands, e-mail: dev-h...@spark.apache.org
> > > 
> > > >
> > > >
> > >
> >
>


Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Mark Hamstra
+1

On Fri, Dec 12, 2014 at 8:00 PM, Josh Rosen  wrote:
>
> +1.  Tested using spark-perf and the Spark EC2 scripts.  I didn’t notice
> any performance regressions that could not be attributed to changes of
> default configurations.  To be more specific, when running Spark 1.2.0 with
> the Spark 1.1.0 settings of spark.shuffle.manager=hash and
> spark.shuffle.blockTransferService=nio, there was no performance regression
> and, in fact, there were significant performance improvements for some
> workloads.
>
> In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort
> and spark.shuffle.blockTransferService=netty.  With these new settings, I
> noticed a performance regression in the scala-sort-by-key-int spark-perf
> test.  However, Spark 1.1.0 and 1.1.1 exhibit a similar performance
> regression for that same test when run with spark.shuffle.manager=sort, so
> this regression seems explainable by the change of defaults.  Besides this,
> most of the other tests ran at the same speeds or faster with the new 1.2.0
> defaults.  Also, keep in mind that this is a somewhat artificial micro
> benchmark; I have heard anecdotal reports from many users that their real
> workloads have run faster with 1.2.0.
>
> Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2.
>
> - Josh
>
> On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com)
> wrote:
>
> +1 (non-binding). Tested on Ubuntu against YARN.
>
> On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin  wrote:
>
> > +1
> >
> > Tested on OS X.
> >
> > On Wednesday, December 10, 2014, Patrick Wendell 
> > wrote:
> >
> > > Please vote on releasing the following candidate as Apache Spark
> version
> > > 1.2.0!
> > >
> > > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):
> > >
> > >
> >
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
> > >
> > > The release files, including signatures, digests, etc. can be found at:
> > > http://people.apache.org/~pwendell/spark-1.2.0-rc2/
> > >
> > > Release artifacts are signed with the following key:
> > > https://people.apache.org/keys/committer/pwendell.asc
> > >
> > > The staging repository for this release can be found at:
> > >
> https://repository.apache.org/content/repositories/orgapachespark-1055/
> > >
> > > The documentation corresponding to this release can be found at:
> > > http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/
> > >
> > > Please vote on releasing this package as Apache Spark 1.2.0!
> > >
> > > The vote is open until Saturday, December 13, at 21:00 UTC and passes
> > > if a majority of at least 3 +1 PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Spark 1.2.0
> > > [ ] -1 Do not release this package because ...
> > >
> > > To learn more about Apache Spark, please see
> > > http://spark.apache.org/
> > >
> > > == What justifies a -1 vote for this release? ==
> > > This vote is happening relatively late into the QA period, so
> > > -1 votes should only occur for significant regressions from
> > > 1.0.2. Bugs already present in 1.1.X, minor
> > > regressions, or bugs related to new features will not block this
> > > release.
> > >
> > > == What default changes should I be aware of? ==
> > > 1. The default value of "spark.shuffle.blockTransferService" has been
> > > changed to "netty"
> > > --> Old behavior can be restored by switching to "nio"
> > >
> > > 2. The default value of "spark.shuffle.manager" has been changed to
> > "sort".
> > > --> Old behavior can be restored by setting "spark.shuffle.manager" to
> > > "hash".
> > >
> > > == How does this differ from RC1 ==
> > > This has fixes for a handful of issues identified - some of the
> > > notable fixes are:
> > >
> > > [Core]
> > > SPARK-4498: Standalone Master can fail to recognize completed/failed
> > > applications
> > >
> > > [SQL]
> > > SPARK-4552: Query for empty parquet table in spark sql hive get
> > > IllegalArgumentException
> > > SPARK-4753: Parquet2 does not prune based on OR filters on partition
> > > columns
> > > SPARK-4761: With JDBC server, set Kryo as default serializer and
> > > disable reference tracking
> > > SPARK-4785: When called with arguments referring column fields, PMOD
> > > throws NPE
> > >
> > > - Patrick
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> 
> > > For additional commands, e-mail: dev-h...@spark.apache.org
> > 
> > >
> > >
> >
>


Re: [VOTE] Release Apache Spark 1.2.0 (RC2)

2014-12-12 Thread Josh Rosen
+1.  Tested using spark-perf and the Spark EC2 scripts.  I didn’t notice any 
performance regressions that could not be attributed to changes of default 
configurations.  To be more specific, when running Spark 1.2.0 with the Spark 
1.1.0 settings of spark.shuffle.manager=hash and 
spark.shuffle.blockTransferService=nio, there was no performance regression 
and, in fact, there were significant performance improvements for some 
workloads.

In Spark 1.2.0, the new default settings are spark.shuffle.manager=sort and 
spark.shuffle.blockTransferService=netty.  With these new settings, I noticed a 
performance regression in the scala-sort-by-key-int spark-perf test.  However, 
Spark 1.1.0 and 1.1.1 exhibit a similar performance regression for that same 
test when run with spark.shuffle.manager=sort, so this regression seems 
explainable by the change of defaults.  Besides this, most of the other tests 
ran at the same speeds or faster with the new 1.2.0 defaults.  Also, keep in 
mind that this is a somewhat artificial micro benchmark; I have heard anecdotal 
reports from many users that their real workloads have run faster with 1.2.0.

Based on these results, I’m comfortable giving a +1 on 1.2.0 RC2.

- Josh

On December 11, 2014 at 9:52:39 AM, Sandy Ryza (sandy.r...@cloudera.com) wrote:

+1 (non-binding). Tested on Ubuntu against YARN.  

On Thu, Dec 11, 2014 at 9:38 AM, Reynold Xin  wrote:  

> +1  
>  
> Tested on OS X.  
>  
> On Wednesday, December 10, 2014, Patrick Wendell   
> wrote:  
>  
> > Please vote on releasing the following candidate as Apache Spark version  
> > 1.2.0!  
> >  
> > The tag to be voted on is v1.2.0-rc2 (commit a428c446e2):  
> >  
> >  
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=a428c446e23e628b746e0626cc02b7b3cadf588e
>   
> >  
> > The release files, including signatures, digests, etc. can be found at:  
> > http://people.apache.org/~pwendell/spark-1.2.0-rc2/  
> >  
> > Release artifacts are signed with the following key:  
> > https://people.apache.org/keys/committer/pwendell.asc  
> >  
> > The staging repository for this release can be found at:  
> > https://repository.apache.org/content/repositories/orgapachespark-1055/  
> >  
> > The documentation corresponding to this release can be found at:  
> > http://people.apache.org/~pwendell/spark-1.2.0-rc2-docs/  
> >  
> > Please vote on releasing this package as Apache Spark 1.2.0!  
> >  
> > The vote is open until Saturday, December 13, at 21:00 UTC and passes  
> > if a majority of at least 3 +1 PMC votes are cast.  
> >  
> > [ ] +1 Release this package as Apache Spark 1.2.0  
> > [ ] -1 Do not release this package because ...  
> >  
> > To learn more about Apache Spark, please see  
> > http://spark.apache.org/  
> >  
> > == What justifies a -1 vote for this release? ==  
> > This vote is happening relatively late into the QA period, so  
> > -1 votes should only occur for significant regressions from  
> > 1.0.2. Bugs already present in 1.1.X, minor  
> > regressions, or bugs related to new features will not block this  
> > release.  
> >  
> > == What default changes should I be aware of? ==  
> > 1. The default value of "spark.shuffle.blockTransferService" has been  
> > changed to "netty"  
> > --> Old behavior can be restored by switching to "nio"  
> >  
> > 2. The default value of "spark.shuffle.manager" has been changed to  
> "sort".  
> > --> Old behavior can be restored by setting "spark.shuffle.manager" to  
> > "hash".  
> >  
> > == How does this differ from RC1 ==  
> > This has fixes for a handful of issues identified - some of the  
> > notable fixes are:  
> >  
> > [Core]  
> > SPARK-4498: Standalone Master can fail to recognize completed/failed  
> > applications  
> >  
> > [SQL]  
> > SPARK-4552: Query for empty parquet table in spark sql hive get  
> > IllegalArgumentException  
> > SPARK-4753: Parquet2 does not prune based on OR filters on partition  
> > columns  
> > SPARK-4761: With JDBC server, set Kryo as default serializer and  
> > disable reference tracking  
> > SPARK-4785: When called with arguments referring column fields, PMOD  
> > throws NPE  
> >  
> > - Patrick  
> >  
> > -  
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org   
> > For additional commands, e-mail: dev-h...@spark.apache.org  
>   
> >  
> >  
>  


one hot encoding

2014-12-12 Thread Lochana Menikarachchi
Do we have one-hot encoding in spark MLLib 1.1.1 or 1.2.0 ? It wasn't 
available in 1.1.0.

Thanks.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: IBM open-sources Spark Kernel

2014-12-12 Thread Robert C Senkbeil

Hi Sam,

We developed the Spark Kernel with a focus on the newest version of the
IPython message protocol (5.0) for the upcoming IPython 3.0 release.

We are building around Apache Spark's REPL, which is used in the current
Spark Shell implementation.

The Spark Kernel was designed to be extensible through magics (
https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md),
providing functionality that might be needed outside the Scala interpreter.

Finally, a big part of our focus is on application development. Because of
this, we are providing a client library for applications to connect to the
Spark Kernel without needing to implement the ZeroMQ protocol.

Signed,
Chip Senkbeil



From:   Sam Bessalah 
To: Robert C Senkbeil/Austin/IBM@IBMUS
Date:   12/12/2014 04:20 PM
Subject:Re: IBM open-sources Spark Kernel



Wow. Thanks. Can't wait to try this out.
Great job.
How Is it different from Iscala or Ispark?


On Dec 12, 2014 11:17 PM, "Robert C Senkbeil"  wrote:



  We are happy to announce a developer preview of the Spark Kernel which
  enables remote applications to dynamically interact with Spark. You can
  think of the Spark Kernel as a remote Spark Shell that uses the IPython
  notebook interface to provide a common entrypoint for any application.
  The
  Spark Kernel obviates the need to submit jars using spark-submit, and can
  replace the existing Spark Shell.

  You can try out the Spark Kernel today by installing it from our github
  repo at https://github.com/ibm-et/spark-kernel. To help you get a demo
  environment up and running quickly, the repository also includes a
  Dockerfile and a Vagrantfile to build a Spark Kernel container and
  connect
  to it from an IPython notebook.

  We have included a number of documents with the project to help explain
  it
  and provide how-to information:

  * A high-level overview of the Spark Kernel and its client library (
  
https://issues.apache.org/jira/secure/attachment/12683624/Kernel%20Architecture.pdf

  ).

  * README (https://github.com/ibm-et/spark-kernel/blob/master/README.md) -
  building and testing the kernel, and deployment options including
  building
  the Docker container and packaging the kernel.

  * IPython instructions (
  https://github.com/ibm-et/spark-kernel/blob/master/docs/IPYTHON.md) -
  setting up the development version of IPython and connecting a Spark
  Kernel.

  * Client library tutorial (
  https://github.com/ibm-et/spark-kernel/blob/master/docs/CLIENT.md) -
  building and using the client library to connect to a Spark Kernel.

  * Magics documentation (
  https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md) - the
  magics in the kernel and how to write your own.

  We think the Spark Kernel will be useful for developing applications for
  Spark, and we are making it available with the intention of improving
  these
  capabilities within the context of the Spark community (
  https://issues.apache.org/jira/browse/SPARK-4605). We will continue to
  develop the codebase and welcome your comments and suggestions.


  Signed,

  Chip Senkbeil
  IBM Emerging Technology Software Engineer

Re: CrossValidator API in new spark.ml package

2014-12-12 Thread DB Tsai
Okay, I got it. In Estimator, fit(dataset: SchemaRDD, paramMaps:
Array[ParamMap]): Seq[M] can be overwritten to implement
regularization path. Correct me if I'm wrong.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Fri, Dec 12, 2014 at 11:37 AM, DB Tsai  wrote:
> Hi Xiangrui,
>
> It seems that it's stateless so will be hard to implement
> regularization path. Any suggestion to extend it? Thanks.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
protobuf comes from missing -Phadoop2.3

On Fri, Dec 12, 2014 at 2:34 PM, Sean Owen  wrote:
>
> What errors do you see? protobuf errors usually mean you didn't build
> for the right version of Hadoop, but if you are using -Phadoop-2.3 or
> better -Phadoop-2.4 that should be fine. Yes, a stack trace would be
> good. I'm still not sure what error you are seeing.
>
> On Fri, Dec 12, 2014 at 10:32 PM, Ganelin, Ilya
>  wrote:
> > Hi Sean - I should clarify : I was able to build the master but when
> running
> > I hit really random looking protobuf errors (just starting up a spark
> > shell), I can try doing a build later today and give the exact stack
> trace.
> >
> > I know that 5.2 is running 1.1 but I believe the latest and greatest Ml
> Lib
> > is much fresher than the one in 1.1 and specifically includes fixed for
> ALS
> > to help it scale better.
> >
> > I had built with the exact flags you suggested below. After doing so I
> tried
> > to run the test suite and run a spark she'll without success. Might you
> have
> > any other suggestions? Thanks!
> >
> >
> >
> > Sent with Good (www.good.com)
> >
> >
> >
> > -Original Message-
> > From: Sean Owen [so...@cloudera.com]
> > Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time
> > To: Ganelin, Ilya
> > Cc: dev
> > Subject: Re: Newest ML-Lib on Spark 1.1
> >
> > Could you specify what problems you're seeing? there is nothing
> > special about the CDH distribution at all.
> >
> > The latest and greatest is 1.1, and that is what is in CDH 5.2. You
> > can certainly compile even master for CDH and get it to work though.
> >
> > The safest build flags should be "-Phadoop-2.4
> > -Dhadoop.version=2.5.0-cdh5.2.1".
> >
> > 5.3 is just around the corner, and includes 1.2, which is also just
> > around the corner.
> >
> > On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
> >  wrote:
> >> Hi all – we’re running CDH 5.2 and would be interested in having the
> >> latest and greatest ML Lib version on our cluster (with YARN). Could
> anyone
> >> help me out in terms of figuring out what build profiles to use to get
> this
> >> to play well? Will I be able to update ML-Lib independently of updating
> the
> >> rest of spark to 1.2 and beyond? I ran into numerous issues trying to
> build
> >> 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed
> to
> >> get the trunk successfully built and tested against Cloudera’s YARN and
> >> Hadoop for 5.2 I would love some help. Thanks!
> >> 
> >>
> >> The information contained in this e-mail is confidential and/or
> >> proprietary to Capital One and/or its affiliates. The information
> >> transmitted herewith is intended only for use by the individual or
> entity to
> >> which it is addressed.  If the reader of this message is not the
> intended
> >> recipient, you are hereby notified that any review, retransmission,
> >> dissemination, distribution, copying or other use of, or taking of any
> >> action in reliance upon this information is strictly prohibited. If you
> have
> >> received this communication in error, please contact the sender and
> delete
> >> the material from your computer.
> >
> >
> > 
> >
> > The information contained in this e-mail is confidential and/or
> proprietary
> > to Capital One and/or its affiliates. The information transmitted
> herewith
> > is intended only for use by the individual or entity to which it is
> > addressed.  If the reader of this message is not the intended recipient,
> you
> > are hereby notified that any review, retransmission, dissemination,
> > distribution, copying or other use of, or taking of any action in
> reliance
> > upon this information is strictly prohibited. If you have received this
> > communication in error, please contact the sender and delete the material
> > from your computer.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Sean Owen
What errors do you see? protobuf errors usually mean you didn't build
for the right version of Hadoop, but if you are using -Phadoop-2.3 or
better -Phadoop-2.4 that should be fine. Yes, a stack trace would be
good. I'm still not sure what error you are seeing.

On Fri, Dec 12, 2014 at 10:32 PM, Ganelin, Ilya
 wrote:
> Hi Sean - I should clarify : I was able to build the master but when running
> I hit really random looking protobuf errors (just starting up a spark
> shell), I can try doing a build later today and give the exact stack trace.
>
> I know that 5.2 is running 1.1 but I believe the latest and greatest Ml Lib
> is much fresher than the one in 1.1 and specifically includes fixed for ALS
> to help it scale better.
>
> I had built with the exact flags you suggested below. After doing so I tried
> to run the test suite and run a spark she'll without success. Might you have
> any other suggestions? Thanks!
>
>
>
> Sent with Good (www.good.com)
>
>
>
> -Original Message-
> From: Sean Owen [so...@cloudera.com]
> Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time
> To: Ganelin, Ilya
> Cc: dev
> Subject: Re: Newest ML-Lib on Spark 1.1
>
> Could you specify what problems you're seeing? there is nothing
> special about the CDH distribution at all.
>
> The latest and greatest is 1.1, and that is what is in CDH 5.2. You
> can certainly compile even master for CDH and get it to work though.
>
> The safest build flags should be "-Phadoop-2.4
> -Dhadoop.version=2.5.0-cdh5.2.1".
>
> 5.3 is just around the corner, and includes 1.2, which is also just
> around the corner.
>
> On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
>  wrote:
>> Hi all – we’re running CDH 5.2 and would be interested in having the
>> latest and greatest ML Lib version on our cluster (with YARN). Could anyone
>> help me out in terms of figuring out what build profiles to use to get this
>> to play well? Will I be able to update ML-Lib independently of updating the
>> rest of spark to 1.2 and beyond? I ran into numerous issues trying to build
>> 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to
>> get the trunk successfully built and tested against Cloudera’s YARN and
>> Hadoop for 5.2 I would love some help. Thanks!
>> 
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates. The information
>> transmitted herewith is intended only for use by the individual or entity to
>> which it is addressed.  If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you have
>> received this communication in error, please contact the sender and delete
>> the material from your computer.
>
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary
> to Capital One and/or its affiliates. The information transmitted herewith
> is intended only for use by the individual or entity to which it is
> addressed.  If the reader of this message is not the intended recipient, you
> are hereby notified that any review, retransmission, dissemination,
> distribution, copying or other use of, or taking of any action in reliance
> upon this information is strictly prohibited. If you have received this
> communication in error, please contact the sender and delete the material
> from your computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Ganelin, Ilya
Hi Sean - I should clarify : I was able to build the master but when running I 
hit really random looking protobuf errors (just starting up a spark shell), I 
can try doing a build later today and give the exact stack trace.

I know that 5.2 is running 1.1 but I believe the latest and greatest Ml Lib is 
much fresher than the one in 1.1 and specifically includes fixed for ALS to 
help it scale better.

I had built with the exact flags you suggested below. After doing so I tried to 
run the test suite and run a spark she'll without success. Might you have any 
other suggestions? Thanks!



Sent with Good (www.good.com)


-Original Message-
From: Sean Owen [so...@cloudera.com]
Sent: Friday, December 12, 2014 04:54 PM Eastern Standard Time
To: Ganelin, Ilya
Cc: dev
Subject: Re: Newest ML-Lib on Spark 1.1


Could you specify what problems you're seeing? there is nothing
special about the CDH distribution at all.

The latest and greatest is 1.1, and that is what is in CDH 5.2. You
can certainly compile even master for CDH and get it to work though.

The safest build flags should be "-Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1".

5.3 is just around the corner, and includes 1.2, which is also just
around the corner.

On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
 wrote:
> Hi all – we’re running CDH 5.2 and would be interested in having the latest 
> and greatest ML Lib version on our cluster (with YARN). Could anyone help me 
> out in terms of figuring out what build profiles to use to get this to play 
> well? Will I be able to update ML-Lib independently of updating the rest of 
> spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 
> against CDH’s Hadoop deployment. Alternately, if anyone has managed to get 
> the trunk successfully built and tested against Cloudera’s YARN and Hadoop 
> for 5.2 I would love some help. Thanks!
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


IBM open-sources Spark Kernel

2014-12-12 Thread Robert C Senkbeil



We are happy to announce a developer preview of the Spark Kernel which
enables remote applications to dynamically interact with Spark. You can
think of the Spark Kernel as a remote Spark Shell that uses the IPython
notebook interface to provide a common entrypoint for any application. The
Spark Kernel obviates the need to submit jars using spark-submit, and can
replace the existing Spark Shell.

You can try out the Spark Kernel today by installing it from our github
repo at https://github.com/ibm-et/spark-kernel. To help you get a demo
environment up and running quickly, the repository also includes a
Dockerfile and a Vagrantfile to build a Spark Kernel container and connect
to it from an IPython notebook.

We have included a number of documents with the project to help explain it
and provide how-to information:

* A high-level overview of the Spark Kernel and its client library (
https://issues.apache.org/jira/secure/attachment/12683624/Kernel%20Architecture.pdf
).

* README (https://github.com/ibm-et/spark-kernel/blob/master/README.md) -
building and testing the kernel, and deployment options including building
the Docker container and packaging the kernel.

* IPython instructions (
https://github.com/ibm-et/spark-kernel/blob/master/docs/IPYTHON.md) -
setting up the development version of IPython and connecting a Spark
Kernel.

* Client library tutorial (
https://github.com/ibm-et/spark-kernel/blob/master/docs/CLIENT.md) -
building and using the client library to connect to a Spark Kernel.

* Magics documentation (
https://github.com/ibm-et/spark-kernel/blob/master/docs/MAGICS.md) - the
magics in the kernel and how to write your own.

We think the Spark Kernel will be useful for developing applications for
Spark, and we are making it available with the intention of improving these
capabilities within the context of the Spark community (
https://issues.apache.org/jira/browse/SPARK-4605). We will continue to
develop the codebase and welcome your comments and suggestions.


Signed,

Chip Senkbeil
IBM Emerging Technology Software Engineer

Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Sean Owen
Could you specify what problems you're seeing? there is nothing
special about the CDH distribution at all.

The latest and greatest is 1.1, and that is what is in CDH 5.2. You
can certainly compile even master for CDH and get it to work though.

The safest build flags should be "-Phadoop-2.4 -Dhadoop.version=2.5.0-cdh5.2.1".

5.3 is just around the corner, and includes 1.2, which is also just
around the corner.

On Fri, Dec 12, 2014 at 9:41 PM, Ganelin, Ilya
 wrote:
> Hi all – we’re running CDH 5.2 and would be interested in having the latest 
> and greatest ML Lib version on our cluster (with YARN). Could anyone help me 
> out in terms of figuring out what build profiles to use to get this to play 
> well? Will I be able to update ML-Lib independently of updating the rest of 
> spark to 1.2 and beyond? I ran into numerous issues trying to build 1.2 
> against CDH’s Hadoop deployment. Alternately, if anyone has managed to get 
> the trunk successfully built and tested against Cloudera’s YARN and Hadoop 
> for 5.2 I would love some help. Thanks!
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Newest ML-Lib on Spark 1.1

2014-12-12 Thread Debasish Das
For CDH this works well for me...tested till 5.1...

./make-distribution -Dhadoop.version=2.3.0-cdh5.1.0 -Phadoop-2.3 -Pyarn
-Phive -DskipTests

To build with hive thriftserver support for spark-sql

On Fri, Dec 12, 2014 at 1:41 PM, Ganelin, Ilya 
wrote:
>
> Hi all – we’re running CDH 5.2 and would be interested in having the
> latest and greatest ML Lib version on our cluster (with YARN). Could anyone
> help me out in terms of figuring out what build profiles to use to get this
> to play well? Will I be able to update ML-Lib independently of updating the
> rest of spark to 1.2 and beyond? I ran into numerous issues trying to build
> 1.2 against CDH’s Hadoop deployment. Alternately, if anyone has managed to
> get the trunk successfully built and tested against Cloudera’s YARN and
> Hadoop for 5.2 I would love some help. Thanks!
> 
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed.  If the reader of this message is not the
> intended recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying or other use of, or
> taking of any action in reliance upon this information is strictly
> prohibited. If you have received this communication in error, please
> contact the sender and delete the material from your computer.
>


Newest ML-Lib on Spark 1.1

2014-12-12 Thread Ganelin, Ilya
Hi all – we’re running CDH 5.2 and would be interested in having the latest and 
greatest ML Lib version on our cluster (with YARN). Could anyone help me out in 
terms of figuring out what build profiles to use to get this to play well? Will 
I be able to update ML-Lib independently of updating the rest of spark to 1.2 
and beyond? I ran into numerous issues trying to build 1.2 against CDH’s Hadoop 
deployment. Alternately, if anyone has managed to get the trunk successfully 
built and tested against Cloudera’s YARN and Hadoop for 5.2 I would love some 
help. Thanks!


The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


CrossValidator API in new spark.ml package

2014-12-12 Thread DB Tsai
Hi Xiangrui,

It seems that it's stateless so will be hard to implement
regularization path. Any suggestion to extend it? Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: zinc invocation examples

2014-12-12 Thread Patrick Wendell
Hey York - I'm sending some feedback off-list, feel free to open a PR as well.


On Tue, Dec 9, 2014 at 12:05 PM, York, Brennon
 wrote:
> Patrick, I¹ve nearly completed a basic build out for the SPARK-4501 issue
> (at https://github.com/brennonyork/spark/tree/SPARK-4501) and would be
> great to get your initial read on it. Per this thread I need to add in the
> -scala-home call to zinc, but its close to ready for a PR.
>
> On 12/5/14, 2:10 PM, "Patrick Wendell"  wrote:
>
>>One thing I created a JIRA for a while back was to have a similar
>>script to "sbt/sbt" that transparently downloads Zinc, Scala, and
>>Maven in a subdirectory of Spark and sets it up correctly. I.e.
>>"build/mvn".
>>
>>Outside of brew for MacOS there aren't good Zinc packages, and it's a
>>pain to figure out how to set it up.
>>
>>https://issues.apache.org/jira/browse/SPARK-4501
>>
>>Prashant Sharma looked at this for a bit but I don't think he's
>>working on it actively any more, so if someone wanted to do this, I'd
>>be extremely grateful.
>>
>>- Patrick
>>
>>On Fri, Dec 5, 2014 at 11:05 AM, Ryan Williams
>> wrote:
>>> fwiw I've been using `zinc -scala-home $SCALA_HOME -nailed -start`
>>>which:
>>>
>>> - starts a nailgun server as well,
>>> - uses my installed scala 2.{10,11}, as opposed to zinc's default 2.9.2
>>> : "If no options are passed
>>>to
>>> locate a version of Scala then Scala 2.9.2 is used by default (which is
>>> bundled with zinc)."
>>>
>>> The latter seems like it might be especially important.
>>>
>>>
>>> On Thu Dec 04 2014 at 4:25:32 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Oh, derp. I just assumed from looking at all the options that there was
 something to it. Thanks Sean.

 On Thu Dec 04 2014 at 7:47:33 AM Sean Owen  wrote:

 > You just run it once with "zinc -start" and leave it running as a
 > background process on your build machine. You don't have to do
 > anything for each build.
 >
 > On Wed, Dec 3, 2014 at 3:44 PM, Nicholas Chammas
 >  wrote:
 > > https://github.com/apache/spark/blob/master/docs/
 > building-spark.md#speeding-up-compilation-with-zinc
 > >
 > > Could someone summarize how they invoke zinc as part of a regular
 > > build-test-etc. cycle?
 > >
 > > I'll add it in to the aforelinked page if appropriate.
 > >
 > > Nick
 >

>>
>>-
>>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>For additional commands, e-mail: dev-h...@spark.apache.org
>>
>
> 
>
> The information contained in this e-mail is confidential and/or proprietary 
> to Capital One and/or its affiliates. The information transmitted herewith is 
> intended only for use by the individual or entity to which it is addressed.  
> If the reader of this message is not the intended recipient, you are hereby 
> notified that any review, retransmission, dissemination, distribution, 
> copying or other use of, or taking of any action in reliance upon this 
> information is strictly prohibited. If you have received this communication 
> in error, please contact the sender and delete the material from your 
> computer.
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
ok, we're back up w/all new jenkins workers.  i'll be keeping an eye on
these pretty closely today for any build failures caused by the new
systems, and if things look bleak, i'll switch back to the original five.

thanks for your patience!

On Fri, Dec 12, 2014 at 8:47 AM, shane knapp  wrote:

> downtime is extended to 10am PST so that i can finish testing the numpy
> upgrade...  besides that, everything looks good and the system updates and
> reboots went off w/o a hitch.
>
> shane
>
> On Fri, Dec 12, 2014 at 7:26 AM, shane knapp  wrote:
>
>> reminder:  jenkins is going down NOW.
>>
>> On Thu, Dec 11, 2014 at 3:08 PM, shane knapp  wrote:
>>
>>> here's the plan...  reboots, of course, come last.  :)
>>>
>>> pause build queue at 7am, kill off (and eventually retrigger) any
>>> stragglers at 8am.  then begin maintenance:
>>>
>>> all systems:
>>> * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
>>> amp-jenkins-worker-{01..08})
>>> * reboots
>>>
>>> jenkins slaves:
>>> * install python2.7 (along side 2.6, which would remain the default)
>>> * install numpy 1.9.1 (currently on 1.4, breaking some spark branch
>>> builds)
>>> * add new slaves to the master, remove old ones (keep them around just
>>> in case)
>>>
>>> there will be no jenkins system or plugin upgrades at this time.  things
>>> there seems to be working just fine!
>>>
>>> i'm expecting to be up and building by 9am at the latest.  i'll update
>>> this thread w/any new time estimates.
>>>
>>> word.
>>>
>>> shane, your rained-in devops guy :)
>>>
>>> On Wed, Dec 10, 2014 at 11:28 AM, shane knapp 
>>> wrote:
>>>
 reminder -- this is happening friday morning @ 730am!

 On Mon, Dec 1, 2014 at 5:10 PM, shane knapp 
 wrote:

> i'll send out a reminder next week, but i wanted to give a heads up:
>  i'll be bringing down the entire jenkins infrastructure for reboots and
> system updates.
>
> please let me know if there are any conflicts with this, thanks!
>
> shane
>


>>>
>>
>


Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
downtime is extended to 10am PST so that i can finish testing the numpy
upgrade...  besides that, everything looks good and the system updates and
reboots went off w/o a hitch.

shane

On Fri, Dec 12, 2014 at 7:26 AM, shane knapp  wrote:

> reminder:  jenkins is going down NOW.
>
> On Thu, Dec 11, 2014 at 3:08 PM, shane knapp  wrote:
>
>> here's the plan...  reboots, of course, come last.  :)
>>
>> pause build queue at 7am, kill off (and eventually retrigger) any
>> stragglers at 8am.  then begin maintenance:
>>
>> all systems:
>> * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
>> amp-jenkins-worker-{01..08})
>> * reboots
>>
>> jenkins slaves:
>> * install python2.7 (along side 2.6, which would remain the default)
>> * install numpy 1.9.1 (currently on 1.4, breaking some spark branch
>> builds)
>> * add new slaves to the master, remove old ones (keep them around just in
>> case)
>>
>> there will be no jenkins system or plugin upgrades at this time.  things
>> there seems to be working just fine!
>>
>> i'm expecting to be up and building by 9am at the latest.  i'll update
>> this thread w/any new time estimates.
>>
>> word.
>>
>> shane, your rained-in devops guy :)
>>
>> On Wed, Dec 10, 2014 at 11:28 AM, shane knapp 
>> wrote:
>>
>>> reminder -- this is happening friday morning @ 730am!
>>>
>>> On Mon, Dec 1, 2014 at 5:10 PM, shane knapp  wrote:
>>>
 i'll send out a reminder next week, but i wanted to give a heads up:
  i'll be bringing down the entire jenkins infrastructure for reboots and
 system updates.

 please let me know if there are any conflicts with this, thanks!

 shane

>>>
>>>
>>
>


Re: Tachyon in Spark

2014-12-12 Thread Haoyuan Li
Junfeng, by off the heap solution, did you mean "rdd.persist(OFF_HEAP)"?
That feature is different from the lineage feature. You can use this
feature (rdd.persist(OFF_HEAP)) now for any Spark version later than 1.0.0
with Tachyon without a problem.

Regarding Reynold's last email, those are good points. Tachyon had provided
this a while ago. We are working on enhancing this feature and the
integration part with Spark.

Thanks,

Haoyuan

On Fri, Dec 12, 2014 at 5:06 AM, Jun Feng Liu  wrote:
>
> I think the linage is the key feature of tachyon to reproduce the RDD when
> any error happen. Otherwise, there have to be some data replica among
> tachyon nodes to ensure the data redundancy for fault tolerant - I think
> tachyon is avoiding to go to this path. Dose it mean the off-heap solution
> is not ready yet if tachyon linage dose not work right now?
>
> Best Regards
>
>
> *Jun Feng Liu*
> IBM China Systems & Technology Laboratory in Beijing
>
>   --
>  [image: 2D barcode - encoded with contact information] *Phone: 
> *86-10-82452683
>
> * E-mail:* *liuj...@cn.ibm.com* 
> [image: IBM]
>
> BLD 28,ZGC Software Park
> No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> China
>
>
>
>
>
>  *Reynold Xin >*
>
> 2014/12/12 10:22
>   To
> Andrew Ash ,
> cc
> Jun Feng Liu/China/IBM@IBMCN, "dev@spark.apache.org"  >
> Subject
> Re: Tachyon in Spark
>
>
>
>
> Actually HY emailed me offline about this and this is supported in the
> latest version of Tachyon. It is a hard problem to push this into storage;
> need to think about how to handle isolation, resource allocation, etc.
>
>
> https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java
>
> On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin  wrote:
>
> > I don't think the lineage thing is even turned on in Tachyon - it was
> > mostly a research prototype, so I don't think it'd make sense for us to
> use
> > that.
> >
> >
> > On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash 
> wrote:
> >
> >> I'm interested in understanding this as well.  One of the main ways
> >> Tachyon
> >> is supposed to realize performance gains without sacrificing durability
> is
> >> by storing the lineage of data rather than full copies of it (similar to
> >> Spark).  But if Spark isn't sending lineage information into Tachyon,
> then
> >> I'm not sure how this isn't a durability concern.
> >>
> >> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu 
> wrote:
> >>
> >> > Dose Spark today really leverage Tachyon linage to process data? It
> >> seems
> >> > like the application should call createDependency function in
> TachyonFS
> >> > to create a new linage node. But I did not find any place call that in
> >> > Spark code. Did I missed anything?
> >> >
> >> > Best Regards
> >> >
> >> >
> >> > *Jun Feng Liu*
> >> > IBM China Systems & Technology Laboratory in Beijing
> >> >
> >> >   --
> >> >  [image: 2D barcode - encoded with contact information] *Phone:
> >> *86-10-82452683
> >> >
> >> > * E-mail:* *liuj...@cn.ibm.com* 
> >> > [image: IBM]
> >> >
> >> > BLD 28,ZGC Software Park
> >> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
> >> > China
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >
> >
>
>

-- 
Haoyuan Li
AMPLab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Re: jenkins downtime: 730-930am, 12/12/14

2014-12-12 Thread shane knapp
reminder:  jenkins is going down NOW.

On Thu, Dec 11, 2014 at 3:08 PM, shane knapp  wrote:

> here's the plan...  reboots, of course, come last.  :)
>
> pause build queue at 7am, kill off (and eventually retrigger) any
> stragglers at 8am.  then begin maintenance:
>
> all systems:
> * yum update all servers (amp-jekins-master, amp-jenkins-slave-{01..05},
> amp-jenkins-worker-{01..08})
> * reboots
>
> jenkins slaves:
> * install python2.7 (along side 2.6, which would remain the default)
> * install numpy 1.9.1 (currently on 1.4, breaking some spark branch builds)
> * add new slaves to the master, remove old ones (keep them around just in
> case)
>
> there will be no jenkins system or plugin upgrades at this time.  things
> there seems to be working just fine!
>
> i'm expecting to be up and building by 9am at the latest.  i'll update
> this thread w/any new time estimates.
>
> word.
>
> shane, your rained-in devops guy :)
>
> On Wed, Dec 10, 2014 at 11:28 AM, shane knapp  wrote:
>
>> reminder -- this is happening friday morning @ 730am!
>>
>> On Mon, Dec 1, 2014 at 5:10 PM, shane knapp  wrote:
>>
>>> i'll send out a reminder next week, but i wanted to give a heads up:
>>>  i'll be bringing down the entire jenkins infrastructure for reboots and
>>> system updates.
>>>
>>> please let me know if there are any conflicts with this, thanks!
>>>
>>> shane
>>>
>>
>>
>


Re: Tachyon in Spark

2014-12-12 Thread Jun Feng Liu
I think the linage is the key feature of tachyon to reproduce the RDD when 
any error happen. Otherwise, there have to be some data replica among 
tachyon nodes to ensure the data redundancy for fault tolerant - I think 
tachyon is avoiding to go to this path. Dose it mean the off-heap solution 
is not ready yet if tachyon linage dose not work right now? 
 
Best Regards
 
Jun Feng Liu
IBM China Systems & Technology Laboratory in Beijing



Phone: 86-10-82452683 
E-mail: liuj...@cn.ibm.com


BLD 28,ZGC Software Park 
No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193 
China 
 

 



Reynold Xin  
2014/12/12 10:22

To
Andrew Ash , 
cc
Jun Feng Liu/China/IBM@IBMCN, "dev@spark.apache.org" 

Subject
Re: Tachyon in Spark






Actually HY emailed me offline about this and this is supported in the
latest version of Tachyon. It is a hard problem to push this into storage;
need to think about how to handle isolation, resource allocation, etc.

https://github.com/amplab/tachyon/blob/master/core/src/main/java/tachyon/master/Dependency.java


On Thu, Dec 11, 2014 at 3:54 PM, Reynold Xin  wrote:

> I don't think the lineage thing is even turned on in Tachyon - it was
> mostly a research prototype, so I don't think it'd make sense for us to 
use
> that.
>
>
> On Thu, Dec 11, 2014 at 3:51 PM, Andrew Ash  
wrote:
>
>> I'm interested in understanding this as well.  One of the main ways
>> Tachyon
>> is supposed to realize performance gains without sacrificing durability 
is
>> by storing the lineage of data rather than full copies of it (similar 
to
>> Spark).  But if Spark isn't sending lineage information into Tachyon, 
then
>> I'm not sure how this isn't a durability concern.
>>
>> On Wed, Dec 10, 2014 at 5:47 AM, Jun Feng Liu  
wrote:
>>
>> > Dose Spark today really leverage Tachyon linage to process data? It
>> seems
>> > like the application should call createDependency function in 
TachyonFS
>> > to create a new linage node. But I did not find any place call that 
in
>> > Spark code. Did I missed anything?
>> >
>> > Best Regards
>> >
>> >
>> > *Jun Feng Liu*
>> > IBM China Systems & Technology Laboratory in Beijing
>> >
>> >   --
>> >  [image: 2D barcode - encoded with contact information] *Phone:
>> *86-10-82452683
>> >
>> > * E-mail:* *liuj...@cn.ibm.com* 
>> > [image: IBM]
>> >
>> > BLD 28,ZGC Software Park
>> > No.8 Rd.Dong Bei Wang West, Dist.Haidian Beijing 100193
>> > China
>> >
>> >
>> >
>> >
>> >
>>
>
>



JavaScript run-time contribution to Spark

2014-12-12 Thread Ohad Assulin
Hello there.
I am running the Internet Technologies Lab at http://new.huji.ac.il/en";>HUJI.
A team of my students would like to contribute a JavaScript run-time (
node.js/v8 based) to Spark.

I wonder
(1) What do you think about the necessity of such a project?
(2) Where should we get started? We have only experience as Spark users.
What are the right docs? Who should we talk to? What architecture guidance
we better follow? etc.

Thanks!
Ohad Assulin