Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
BTW, After I revert  SPARK-784, I can see all the jars under
lib_managed/jars

On Tue, Nov 17, 2015 at 2:46 PM, Jeff Zhang  wrote:

> Hi Josh,
>
> I notice the comments in https://github.com/apache/spark/pull/9575 said
> that Datanucleus related jars will still be copied to lib_managed/jars.
> But I don't see any jars under lib_managed/jars. The weird thing is that I
> see the jars on another machine, but could not see jars on my laptop even
> after I delete the whole spark project and start from scratch. Does it
> related with environments ? I try to add the following code in
> SparkBuild.scala to track the issue, it shows that the jars is empty. Any
> thoughts on that ?
>
>
> deployDatanucleusJars := {
>   val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data)
> .filter(_.getPath.contains("org.datanucleus"))
>   // this is what I added
>   println("*")
>   println("fullClasspath:"+fullClasspath)
>   println("assembly:"+assembly)
>   println("jars:"+jars.map(_.getAbsolutePath()).mkString(","))
>   //
>
>
> On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang  wrote:
>
>> This is the exception I got
>>
>> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default
>> database after error: Class
>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>> javax.jdo.JDOFatalUserException: Class
>> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
>> at
>> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>> at
>> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>> at
>> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>> at
>> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>> at
>> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
>> at
>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
>> at
>> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
>> at
>> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
>> at
>> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
>> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>> at
>> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>> at
>> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
>> at
>> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>>
>> On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang  wrote:
>>
>>> It's about the datanucleus related jars which is needed by spark sql.
>>> Without these jars, I could not call data frame related api ( I make
>>> HiveContext enabled)
>>>
>>>
>>>
>>> On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen 
>>> wrote:
>>>
 As of https://github.com/apache/spark/pull/9575, Spark's build will no
 longer place every dependency JAR into lib_managed. Can you say more about
 how this affected spark-shell for you (maybe share a stacktrace)?

 On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang  wrote:

>
> Sometimes, the jars under lib_managed is missing. And after I rebuild
> the spark, the jars under lib_managed is still not downloaded. This would
> cause the spark-shell fail due to jars missing. Anyone has hit this weird
> issue ?
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff 

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
Hi Josh,

I notice the comments in https://github.com/apache/spark/pull/9575 said
that Datanucleus related jars will still be copied to lib_managed/jars. But
I don't see any jars under lib_managed/jars. The weird thing is that I see
the jars on another machine, but could not see jars on my laptop even after
I delete the whole spark project and start from scratch. Does it related
with environments ? I try to add the following code in SparkBuild.scala to
track the issue, it shows that the jars is empty. Any thoughts on that ?


deployDatanucleusJars := {
  val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data)
.filter(_.getPath.contains("org.datanucleus"))
  // this is what I added
  println("*")
  println("fullClasspath:"+fullClasspath)
  println("assembly:"+assembly)
  println("jars:"+jars.map(_.getAbsolutePath()).mkString(","))
  //


On Mon, Nov 16, 2015 at 4:51 PM, Jeff Zhang  wrote:

> This is the exception I got
>
> 15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default
> database after error: Class
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
> javax.jdo.JDOFatalUserException: Class
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
> at
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
> at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
> at
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
> at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
> at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
> at
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
> at
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
> at
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
> at
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> at
> org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
> at
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)
>
> On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang  wrote:
>
>> It's about the datanucleus related jars which is needed by spark sql.
>> Without these jars, I could not call data frame related api ( I make
>> HiveContext enabled)
>>
>>
>>
>> On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen 
>> wrote:
>>
>>> As of https://github.com/apache/spark/pull/9575, Spark's build will no
>>> longer place every dependency JAR into lib_managed. Can you say more about
>>> how this affected spark-shell for you (maybe share a stacktrace)?
>>>
>>> On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang  wrote:
>>>

 Sometimes, the jars under lib_managed is missing. And after I rebuild
 the spark, the jars under lib_managed is still not downloaded. This would
 cause the spark-shell fail due to jars missing. Anyone has hit this weird
 issue ?



 --
 Best Regards

 Jeff Zhang

>>>
>>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Saisai Shao
Kafka now build-in supports managing metadata itself besides ZK, it is easy
to use and change from current ZK implementation. I think here the problem
is do we need to manage offset in Spark Streaming level or leave this
question to user.

If you want to manage offset in user level, letting Spark to offer a
convenient API, I think Cody's patch (
https://issues.apache.org/jira/browse/SPARK-10963) could satisfy your needs.

If you hope to let Spark Streaming to manage offsets for you (transparent
to the user level), I think I had a PR before but the community inclines to
leave this to user level.

On Tue, Nov 17, 2015 at 9:27 AM, Nick Evans  wrote:

> The only dependancy on Zookeeper I see is here:
> https://github.com/apache/spark/blob/1c5475f1401d2233f4c61f213d1e2c2ee9673067/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L244-L247
>
> If that's the only line that depends on Zookeeper, we could probably try
> to implement an abstract offset manager that could be switched out in
> favour of the new offset management system, yes? I
> know kafka.consumer.Consumer currently depends on Zookeeper, but I'm
> guessing this library will eventually be updated to use the new method.
>
> On Mon, Nov 16, 2015 at 5:28 PM, Cody Koeninger 
> wrote:
>
>> There are already private methods in the code for interacting with
>> Kafka's offset management api.
>>
>> There's a jira for making those methods public, but TD has been reluctant
>> to merge it
>>
>> https://issues.apache.org/jira/browse/SPARK-10963
>>
>> I think adding any ZK specific behavior to spark is a bad idea, since ZK
>> may no longer be the preferred storage location for Kafka offsets within
>> the next year.
>>
>>
>>
>> On Mon, Nov 16, 2015 at 9:53 AM, Nick Evans  wrote:
>>
>>> I really like the Streaming receiverless API for Kafka streaming jobs,
>>> but I'm finding the manual offset management adds a fair bit of complexity.
>>> I'm sure that others feel the same way, so I'm proposing that we add the
>>> ability to have consumer offsets managed via an easy-to-use API. This would
>>> be done similarly to how it is done in the receiver API.
>>>
>>> I haven't written any code yet, but I've looked at the current version
>>> of the codebase and have an idea of how it could be done.
>>>
>>> To keep the size of the pull requests small, I propose that the
>>> following distinct features are added in order:
>>>
>>>1. If a group ID is set in the Kafka params, and also if fromOffsets
>>>is not passed in to createDirectStream, then attempt to resume from the
>>>remembered offsets for that group ID.
>>>2. Add a method on KafkaRDDs that commits the offsets for that
>>>KafkaRDD to Zookeeper.
>>>3. Update the Python API with any necessary changes.
>>>
>>> My goal is to not break the existing API while adding the new
>>> functionality.
>>>
>>> One point that I'm not sure of is regarding the first point. I'm not
>>> sure whether it's a better idea to set the group ID as mentioned through
>>> Kafka params, or to define a new overload of createDirectStream that
>>> expects the group ID in place of the fromOffsets param. I think the latter
>>> is a cleaner interface, but I'm not sure whether adding a new param is a
>>> good idea.
>>>
>>> If anyone has any feedback on this general approach, I'd be very
>>> grateful. I'm going to open a JIRA in the next couple days and begin
>>> working on the first point, but I think comments from the community would
>>> be very helpful on building a good API here.
>>>
>>>
>>
>
>
> --
> *Nick Evans* 
> P. (613) 793-5565
> LinkedIn  | Website 
>
>


Re: let spark streaming sample come to stop

2015-11-16 Thread Bryan Cutler
Hi Renyi,

This is the intended behavior of the streaming HdfsWordCount example.  It
makes use of a 'textFileStream' which will monitor a hdfs directory for any
newly created files and push them into a dstream.  It is meant to be run
indefinitely, unless interrupted by ctrl-c, for example.

-bryan
On Nov 13, 2015 10:52 AM, "Renyi Xiong"  wrote:

> Hi,
>
> I try to run the following 1.4.1 sample by putting a words.txt under
> localdir
>
> bin\run-example org.apache.spark.examples.streaming.HdfsWordCount localdir
>
> 2 questions
>
> 1. it does not pick up words.txt because it's 'old' I guess - any option
> to let it picked up?
> 2. I managed to put a 'new' file on the fly which got picked up, but after
> processing, the program doesn't stop (keeps generating empty RDDs instead),
> any option to let it stop when no new files come in (otherwise it blocks
> others when I want to run multiple samples?)
>
> Thanks,
> Renyi.
>


Mesos cluster dispatcher doesn't respect most args from the submit req

2015-11-16 Thread Jo Voordeckers
Hi all,

I'm running the mesos cluster dispatcher, however when I submit jobs with
things like jvm args, classpath order and UI port aren't added to the
commandline executed by the mesos scheduler. In fact it only cares about
the class, jar and num cores/mem.

https://github.com/jayv/spark/blob/mesos_cluster_params/core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosClusterScheduler.scala#L412-L424

I've made an attempt at adding a few of the args that I believe are useful
to the MesosClusterScheduler class, which seems to solve my problem.

Please have a look:

https://github.com/apache/spark/pull/9752

Thanks

- Jo Voordeckers


Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
The only dependancy on Zookeeper I see is here:
https://github.com/apache/spark/blob/1c5475f1401d2233f4c61f213d1e2c2ee9673067/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/ReliableKafkaReceiver.scala#L244-L247

If that's the only line that depends on Zookeeper, we could probably try to
implement an abstract offset manager that could be switched out in favour
of the new offset management system, yes? I know kafka.consumer.Consumer
currently depends on Zookeeper, but I'm guessing this library will
eventually be updated to use the new method.

On Mon, Nov 16, 2015 at 5:28 PM, Cody Koeninger  wrote:

> There are already private methods in the code for interacting with Kafka's
> offset management api.
>
> There's a jira for making those methods public, but TD has been reluctant
> to merge it
>
> https://issues.apache.org/jira/browse/SPARK-10963
>
> I think adding any ZK specific behavior to spark is a bad idea, since ZK
> may no longer be the preferred storage location for Kafka offsets within
> the next year.
>
>
>
> On Mon, Nov 16, 2015 at 9:53 AM, Nick Evans  wrote:
>
>> I really like the Streaming receiverless API for Kafka streaming jobs,
>> but I'm finding the manual offset management adds a fair bit of complexity.
>> I'm sure that others feel the same way, so I'm proposing that we add the
>> ability to have consumer offsets managed via an easy-to-use API. This would
>> be done similarly to how it is done in the receiver API.
>>
>> I haven't written any code yet, but I've looked at the current version of
>> the codebase and have an idea of how it could be done.
>>
>> To keep the size of the pull requests small, I propose that the following
>> distinct features are added in order:
>>
>>1. If a group ID is set in the Kafka params, and also if fromOffsets
>>is not passed in to createDirectStream, then attempt to resume from the
>>remembered offsets for that group ID.
>>2. Add a method on KafkaRDDs that commits the offsets for that
>>KafkaRDD to Zookeeper.
>>3. Update the Python API with any necessary changes.
>>
>> My goal is to not break the existing API while adding the new
>> functionality.
>>
>> One point that I'm not sure of is regarding the first point. I'm not sure
>> whether it's a better idea to set the group ID as mentioned through Kafka
>> params, or to define a new overload of createDirectStream that expects the
>> group ID in place of the fromOffsets param. I think the latter is a cleaner
>> interface, but I'm not sure whether adding a new param is a good idea.
>>
>> If anyone has any feedback on this general approach, I'd be very
>> grateful. I'm going to open a JIRA in the next couple days and begin
>> working on the first point, but I think comments from the community would
>> be very helpful on building a good API here.
>>
>>
>


-- 
*Nick Evans* 
P. (613) 793-5565
LinkedIn  | Website 


Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Jeff Zhang
+1

On Tue, Nov 17, 2015 at 7:43 AM, Joseph Bradley 
wrote:

> That sounds useful; would you mind submitting a JIRA (and a PR if you're
> willing)?
> Thanks,
> Joseph
>
> On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier 
> wrote:
>
>> Hi,
>>
>> MLUtils.loadLibSVMFile verifies that indices are 1-based and
>> increasing, and otherwise triggers an error. I'd like to suggest that
>> the error message be a little more informative. I ran into this when
>> loading a malformed file. Exactly what gets printed isn't too crucial,
>> maybe you would want to print something else, all that matters is to
>> give some context so that the user can find the problem more quickly.
>>
>> Hope this helps in some way.
>>
>> Robert Dodier
>>
>> PS.
>>
>> diff --git
>> a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
>> b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
>> index 81c2f0c..6f5f680 100644
>> --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
>> +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
>> @@ -91,7 +91,7 @@ object MLUtils {
>>  val indicesLength = indices.length
>>  while (i < indicesLength) {
>>val current = indices(i)
>> -  require(current > previous, "indices should be one-based
>> and in ascending order" )
>> +  require(current > previous, "indices should be one-based
>> and in ascending order; found current=" + current + ", previous=" +
>> previous + "; line=\"" + line + "\"" )
>>previous = current
>>i += 1
>>  }
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


-- 
Best Regards

Jeff Zhang


Re: Spark 1.4.2 release and votes conversation?

2015-11-16 Thread Andrew Lee
I did, and it passes all of our test case, so I'm wondering what did I miss. I 
know there is the memory leak spill JIRA SPARK-11293, but not sure if that will 
go in 1.4.2 or 1.4.3, etc.




From: Reynold Xin 
Sent: Friday, November 13, 2015 1:31 PM
To: Andrew Lee
Cc: dev@spark.apache.org
Subject: Re: Spark 1.4.2 release and votes conversation?

In the interim, you can just build it off branch-1.4 if you want.


On Fri, Nov 13, 2015 at 1:30 PM, Reynold Xin 
mailto:r...@databricks.com>> wrote:
I actually tried to build a binary for 1.4.2 and wanted to start voting, but 
there was an issue with the release script that failed the jenkins job. Would 
be great to kick off a 1.4.2 release.


On Fri, Nov 13, 2015 at 1:00 PM, Andrew Lee 
mailto:alee...@hotmail.com>> wrote:

Hi All,


I'm wondering if Spark 1.4.2 had been voted by any chance or if I have 
overlooked and we are targeting 1.4.3?


By looking at the JIRA

https://issues.apache.org/jira/browse/SPARK/fixforversion/12332833/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel


All issues were resolved and no blockers. Anyone knows what happened to this 
release?



or was there any recommendation to skip that and ask users to use Spark 1.5.2 
instead?




Re: Unchecked contribution (JIRA and PR)

2015-11-16 Thread Joseph Bradley
Hi Sergio,

Apart from apologies about limited review bandwidth (from me too!), I
wanted to add: It would be interesting to hear what feedback you've gotten
from users of your package.  Perhaps you could collect feedback by (a)
emailing the user list and (b) adding a note in the Spark Packages pointing
to the JIRA, and encouraging users to add their comments directly to the
JIRA.  That'd be a nice way to get a sense of use cases and priority.

Thanks for your patience,
Joseph

On Wed, Nov 4, 2015 at 7:23 AM, Sergio Ramírez  wrote:

> OK, for me, time is not a problem. I was just worried about there was no
> movement in those issues. I think they are good contributions. For example,
> I have found no complex discretization algorithm in MLlib, which is rare.
> My algorithm, a Spark implementation of the well-know discretizer developed
> by Fayyad and Irani, could be considered a good starting point for the
> discretization part. Furthermore, this is also supported by two scientific
> articles.
>
> Anyway, I uploaded these two algorithms as two different packages to
> spark-packages.org, but I would like to contribute directly to MLlib. I
> understand you have a lot of requests, and it is not possible to include
> all the contributions made by the Spark community.
>
> I'll be patient and ready to collaborate.
>
> Thanks again
>
>
> On 03/11/15 16:30, Jerry Lam wrote:
>
> Sergio, you are not alone for sure. Check the RowSimilarity implementation
> [SPARK-4823]. It has been there for 6 months. It is very likely those which
> don't merge in the version of spark that it was developed will never merged
> because spark changes quite significantly from version to version if the
> algorithm depends a lot of internal api.
>
> On Tue, Nov 3, 2015 at 10:24 AM, Reynold Xin  wrote:
>
>> Sergio,
>>
>> Usually it takes a lot of effort to get something merged into Spark
>> itself, especially for relatively new algorithms that might not have
>> established itself yet. I will leave it to mllib maintainers to comment on
>> the specifics of the individual algorithms proposed here.
>>
>> Just another general comment: we have been working on making packages be
>> as easy to use as possible for Spark users. Right now it only requires a
>> simple flag to pass to the spark-submit script to include a package.
>>
>>
>> On Tue, Nov 3, 2015 at 2:49 AM, Sergio Ramírez < 
>> sramire...@ugr.es> wrote:
>>
>>> Hello all:
>>>
>>> I developed two packages for MLlib in March. These have been also upload
>>> to the spark-packages repository. Associated to these packages, I created
>>> two JIRA's threads and the correspondent pull requests, which are listed
>>> below:
>>>
>>> https://github.com/apache/spark/pull/5184
>>> https://github.com/apache/spark/pull/5170
>>>
>>> https://issues.apache.org/jira/browse/SPARK-6531
>>> https://issues.apache.org/jira/browse/SPARK-6509
>>>
>>> These remain unassigned in JIRA and unverified in GitHub.
>>>
>>> Could anyone explain why are they in this state yet? Is it normal?
>>>
>>> Thanks!
>>>
>>> Sergio R.
>>>
>>> --
>>>
>>> Sergio Ramírez Gallego
>>> Research group on Soft Computing and Intelligent Information Systems,
>>> Dept. Computer Science and Artificial Intelligence,
>>> University of Granada, Granada, Spain.
>>> Email: srami...@decsai.ugr.es
>>> Research Group URL: http://sci2s.ugr.es/
>>>
>>> -
>>>
>>> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo,
>>> contiene información de carácter confidencial exclusivamente dirigida a
>>> su destinatario o destinatarios. Si no es vd. el destinatario indicado,
>>> queda notificado que la lectura, utilización, divulgación y/o copia sin
>>> autorización está prohibida en virtud de la legislación vigente. En el
>>> caso de haber recibido este correo electrónico por error, se ruega
>>> notificar inmediatamente esta circunstancia mediante reenvío a la
>>> dirección electrónica del remitente.
>>> Evite imprimir este mensaje si no es estrictamente necesario.
>>>
>>> This email and any file attached to it (when applicable) contain(s)
>>> confidential information that is exclusively addressed to its
>>> recipient(s). If you are not the indicated recipient, you are informed
>>> that reading, using, disseminating and/or copying it without
>>> authorisation is forbidden in accordance with the legislation in effect.
>>> If you have received this email by mistake, please immediately notify
>>> the sender of the situation by resending it to their email address.
>>> Avoid printing this message if it is not absolutely necessary.
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: 
>>> dev-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
>
> Sergio Ramírez Gallego
> Research group on Soft Computing and Intelligent Information Systems,
> Dept. Computer Science and Arti

Re: Spark Implementation of XGBoost

2015-11-16 Thread Joseph Bradley
One comment about
"""
1) I agree the sorting method you suggested is a very efficient way to
handle the unordered categorical variables in binary classification
and regression. I propose we have a Spark ML Transformer to do the
sorting and encoding, bringing the benefits to many tree based
methods. How about I open a jira for this?
"""

--> MLlib trees do this currently, so you could check out that code as an
example.
I'm not sure how this would work as a generic transformer, though; it seems
more like an internal part of space-partitioning algorithms.



On Tue, Oct 27, 2015 at 5:04 PM, Meihua Wu 
wrote:

> Hi DB Tsai,
>
> Thank you again for your insightful comments!
>
> 1) I agree the sorting method you suggested is a very efficient way to
> handle the unordered categorical variables in binary classification
> and regression. I propose we have a Spark ML Transformer to do the
> sorting and encoding, bringing the benefits to many tree based
> methods. How about I open a jira for this?
>
> 2) For L2/L1 regularization vs Learning rate (I use this name instead
> shrinkage to avoid confusion), I have the following observations:
>
> Suppose G and H are the sum (over the data assigned to a leaf node) of
> the 1st and 2nd derivative of the loss evaluated at f_m, respectively.
> Then for this leaf node,
>
> * With a learning rate eta, f_{m+1} = f_m - G/H*eta
>
> * With a L2 regularization coefficient lambda, f_{m+1} =f_m - G/(H+lambda)
>
> If H>0 (convex loss), both approach lead to "shrinkage":
>
> * For the learning rate approach, the percentage of shrinkage is
> uniform for any leaf node.
>
> * For L2 regularization, the percentage of shrinkage would adapt to
> the number of instances assigned to a leaf node: more instances =>
> larger G and H => less shrinkage. This behavior is intuitive to me. If
> the value estimated from this node is based on a large amount of data,
> the value should be reliable and less shrinkage is needed.
>
> I suppose we could have something similar for L1.
>
> I am not aware of theoretical results to conclude which method is
> better. Likely to be dependent on the data at hand. Implementing
> learning rate is on my radar for version 0.2. I should be able to add
> it in a week or so. I will send you a note once it is done.
>
> Thanks,
>
> Meihua
>
> On Tue, Oct 27, 2015 at 1:02 AM, DB Tsai  wrote:
> > Hi Meihua,
> >
> > For categorical features, the ordinal issue can be solved by trying
> > all kind of different partitions 2^(q-1) -1 for q values into two
> > groups. However, it's computational expensive. In Hastie's book, in
> > 9.2.4, the trees can be trained by sorting the residuals and being
> > learnt as if they are ordered. It can be proven that it will give the
> > optimal solution. I have a proof that this works for learning
> > regression trees through variance reduction.
> >
> > I'm also interested in understanding how the L1 and L2 regularization
> > within the boosting works (and if it helps with overfitting more than
> > shrinkage).
> >
> > Thanks.
> >
> > Sincerely,
> >
> > DB Tsai
> > --
> > Web: https://www.dbtsai.com
> > PGP Key ID: 0xAF08DF8D
> >
> >
> > On Mon, Oct 26, 2015 at 8:37 PM, Meihua Wu 
> wrote:
> >> Hi DB Tsai,
> >>
> >> Thank you very much for your interest and comment.
> >>
> >> 1) feature sub-sample is per-node, like random forest.
> >>
> >> 2) The current code heavily exploits the tree structure to speed up
> >> the learning (such as processing multiple learning node in one pass of
> >> the training data). So a generic GBM is likely to be a different
> >> codebase. Do you have any nice reference of efficient GBM? I am more
> >> than happy to look into that.
> >>
> >> 3) The algorithm accept training data as a DataFrame with the
> >> featureCol indexed by VectorIndexer. You can specify which variable is
> >> categorical in the VectorIndexer. Please note that currently all
> >> categorical variables are treated as ordered. If you want some
> >> categorical variables as unordered, you can pass the data through
> >> OneHotEncoder before the VectorIndexer. I do have a plan to handle
> >> unordered categorical variable using the approach in RF in Spark ML
> >> (Please see roadmap in the README.md)
> >>
> >> Thanks,
> >>
> >> Meihua
> >>
> >>
> >>
> >> On Mon, Oct 26, 2015 at 4:06 PM, DB Tsai  wrote:
> >>> Interesting. For feature sub-sampling, is it per-node or per-tree? Do
> >>> you think you can implement generic GBM and have it merged as part of
> >>> Spark codebase?
> >>>
> >>> Sincerely,
> >>>
> >>> DB Tsai
> >>> --
> >>> Web: https://www.dbtsai.com
> >>> PGP Key ID: 0xAF08DF8D
> >>>
> >>>
> >>> On Mon, Oct 26, 2015 at 11:42 AM, Meihua Wu
> >>>  wrote:
>  Hi Spark User/Dev,
> 
>  Inspired by the success of XGBoost, I have created a Spark package for
>  gradient boosting tree with 2nd order approximation of arbitrary
>  us

Re: slightly more informative error message in MLUtils.loadLibSVMFile

2015-11-16 Thread Joseph Bradley
That sounds useful; would you mind submitting a JIRA (and a PR if you're
willing)?
Thanks,
Joseph

On Fri, Oct 23, 2015 at 12:43 PM, Robert Dodier 
wrote:

> Hi,
>
> MLUtils.loadLibSVMFile verifies that indices are 1-based and
> increasing, and otherwise triggers an error. I'd like to suggest that
> the error message be a little more informative. I ran into this when
> loading a malformed file. Exactly what gets printed isn't too crucial,
> maybe you would want to print something else, all that matters is to
> give some context so that the user can find the problem more quickly.
>
> Hope this helps in some way.
>
> Robert Dodier
>
> PS.
>
> diff --git a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> index 81c2f0c..6f5f680 100644
> --- a/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> +++ b/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
> @@ -91,7 +91,7 @@ object MLUtils {
>  val indicesLength = indices.length
>  while (i < indicesLength) {
>val current = indices(i)
> -  require(current > previous, "indices should be one-based
> and in ascending order" )
> +  require(current > previous, "indices should be one-based
> and in ascending order; found current=" + current + ", previous=" +
> previous + "; line=\"" + line + "\"" )
>previous = current
>i += 1
>  }
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Persisting DStreams

2015-11-16 Thread Jean-Baptiste Onofré

Hi Fernando,

the "persistence" of a DStream is defined depending of the StorageLevel.

Window is not related to persistence: it's the processing of multiple 
DStream in one, a kind of "gather of DStreams". The transformation is 
applied on a "slide window". For instance, you define a window of 3 
DStreams, and you apply the transformation on this window.


I hope it helps.

Regards
JB

On 11/16/2015 07:38 PM, Fernando O. wrote:

Hi all,
I was wondering if someone could give me a brief explanation or
point me in the right direction in the code for where DStream
persistence is donde.
I'm looking at DStream.java but all it does is setting the StorageLevel,
and neither WindowedDStream or ReducedWindowedDStream seem to change
that behaviour a lot:
WindowedDStream delegates on the parent and
ReducedWindowedDStream calls super and delegates on an aggregated DStream

So it seems like persist will only change the StorageLevel and then you
need to compute the DStream to persist the inner RDDs

So basically what I'm trying to verify is: persist is just lazy
persistence method that sets the Storage level and the actual
persistence takes place when you compute the DStream right?






--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Persisting DStreams

2015-11-16 Thread Fernando O.
Hi all,
   I was wondering if someone could give me a brief explanation or point me
in the right direction in the code for where DStream persistence is donde.
I'm looking at DStream.java but all it does is setting the StorageLevel,
and neither WindowedDStream or ReducedWindowedDStream seem to change that
behaviour a lot:
WindowedDStream delegates on the parent and
ReducedWindowedDStream calls super and delegates on an aggregated DStream

So it seems like persist will only change the StorageLevel and then you
need to compute the DStream to persist the inner RDDs

So basically what I'm trying to verify is: persist is just lazy persistence
method that sets the Storage level and the actual persistence takes place
when you compute the DStream right?


Re: Sort Merge Join from the filesystem

2015-11-16 Thread Alex Nastetsky
Done, thanks.

On Mon, Nov 9, 2015 at 7:23 PM, Cheng, Hao  wrote:

> Yes, we definitely need to think how to handle this case, probably even
> more common than both sorted/partitioned tables case, can you jump to the
> jira and leave comment there?
>
>
>
> *From:* Alex Nastetsky [mailto:alex.nastet...@vervemobile.com]
> *Sent:* Tuesday, November 10, 2015 3:03 AM
> *To:* Cheng, Hao
> *Cc:* Reynold Xin; dev@spark.apache.org
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> Thanks for creating that ticket.
>
>
>
> Another thing I was thinking of, is doing this type of join between
> dataset A which is already partitioned/sorted on disk and dataset B, which
> gets generated during the run of the application.
>
>
>
> Dataset B would need something like repartitionAndSortWithinPartitions to
> be performed on it, using the same partitioner that was used with dataset
> A. Then dataset B could be joined with dataset A without needing to write
> it to disk first (unless it's too big to fit in memory, then it would need
> to be [partially] spilled).
>
>
>
> On Wed, Nov 4, 2015 at 7:51 PM, Cheng, Hao  wrote:
>
> Yes, we probably need more change for the data source API if we need to
> implement it in a generic way.
>
> BTW, I create the JIRA by copy most of words from Alex. J
>
>
>
> https://issues.apache.org/jira/browse/SPARK-11512
>
>
>
>
>
> *From:* Reynold Xin [mailto:r...@databricks.com]
> *Sent:* Thursday, November 5, 2015 1:36 AM
> *To:* Alex Nastetsky
> *Cc:* dev@spark.apache.org
> *Subject:* Re: Sort Merge Join from the filesystem
>
>
>
> It's not supported yet, and not sure if there is a ticket for it. I don't
> think there is anything fundamentally hard here either.
>
>
>
>
>
> On Wed, Nov 4, 2015 at 6:37 AM, Alex Nastetsky <
> alex.nastet...@vervemobile.com> wrote:
>
> (this is kind of a cross-post from the user list)
>
>
>
> Does Spark support doing a sort merge join on two datasets on the file
> system that have already been partitioned the same with the same number of
> partitions and sorted within each partition, without needing to
> repartition/sort them again?
>
>
>
> This functionality exists in
>
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
>
> - Pig (USING 'merge')
>
> - MapReduce (CompositeInputFormat)
>
>
>
> If this is not supported in Spark, is a ticket already open for it? Does
> the Spark architecture present unique difficulties to having this feature?
>
>
>
> It is very useful to have this ability, as you can prepare dataset A to be
> joined with dataset B before B even exists, by pre-processing A with a
> partition/sort.
>
>
>
> Thanks.
>
>
>
>
>


Re: Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Cody Koeninger
There are already private methods in the code for interacting with Kafka's
offset management api.

There's a jira for making those methods public, but TD has been reluctant
to merge it

https://issues.apache.org/jira/browse/SPARK-10963

I think adding any ZK specific behavior to spark is a bad idea, since ZK
may no longer be the preferred storage location for Kafka offsets within
the next year.



On Mon, Nov 16, 2015 at 9:53 AM, Nick Evans  wrote:

> I really like the Streaming receiverless API for Kafka streaming jobs, but
> I'm finding the manual offset management adds a fair bit of complexity. I'm
> sure that others feel the same way, so I'm proposing that we add the
> ability to have consumer offsets managed via an easy-to-use API. This would
> be done similarly to how it is done in the receiver API.
>
> I haven't written any code yet, but I've looked at the current version of
> the codebase and have an idea of how it could be done.
>
> To keep the size of the pull requests small, I propose that the following
> distinct features are added in order:
>
>1. If a group ID is set in the Kafka params, and also if fromOffsets
>is not passed in to createDirectStream, then attempt to resume from the
>remembered offsets for that group ID.
>2. Add a method on KafkaRDDs that commits the offsets for that
>KafkaRDD to Zookeeper.
>3. Update the Python API with any necessary changes.
>
> My goal is to not break the existing API while adding the new
> functionality.
>
> One point that I'm not sure of is regarding the first point. I'm not sure
> whether it's a better idea to set the group ID as mentioned through Kafka
> params, or to define a new overload of createDirectStream that expects the
> group ID in place of the fromOffsets param. I think the latter is a cleaner
> interface, but I'm not sure whether adding a new param is a good idea.
>
> If anyone has any feedback on this general approach, I'd be very grateful.
> I'm going to open a JIRA in the next couple days and begin working on the
> first point, but I think comments from the community would be very helpful
> on building a good API here.
>
>


Streaming Receiverless Kafka API + Offset Management

2015-11-16 Thread Nick Evans
I really like the Streaming receiverless API for Kafka streaming jobs, but
I'm finding the manual offset management adds a fair bit of complexity. I'm
sure that others feel the same way, so I'm proposing that we add the
ability to have consumer offsets managed via an easy-to-use API. This would
be done similarly to how it is done in the receiver API.

I haven't written any code yet, but I've looked at the current version of
the codebase and have an idea of how it could be done.

To keep the size of the pull requests small, I propose that the following
distinct features are added in order:

   1. If a group ID is set in the Kafka params, and also if fromOffsets is
   not passed in to createDirectStream, then attempt to resume from the
   remembered offsets for that group ID.
   2. Add a method on KafkaRDDs that commits the offsets for that KafkaRDD
   to Zookeeper.
   3. Update the Python API with any necessary changes.

My goal is to not break the existing API while adding the new functionality.

One point that I'm not sure of is regarding the first point. I'm not sure
whether it's a better idea to set the group ID as mentioned through Kafka
params, or to define a new overload of createDirectStream that expects the
group ID in place of the fromOffsets param. I think the latter is a cleaner
interface, but I'm not sure whether adding a new param is a good idea.

If anyone has any feedback on this general approach, I'd be very grateful.
I'm going to open a JIRA in the next couple days and begin working on the
first point, but I think comments from the community would be very helpful
on building a good API here.


Re: releasing Spark 1.4.2

2015-11-16 Thread Ted Yu
See this thread:

http://search-hadoop.com/m/q3RTtLKc2ctNPcq&subj=Re+Spark+1+4+2+release+and+votes+conversation+

> On Nov 15, 2015, at 10:53 PM, Niranda Perera  wrote:
> 
> Hi, 
> 
> I am wondering when spark 1.4.2 will be released?
> 
> is it in the voting stage at the moment?
> 
> rgds
> 
> -- 
> Niranda 
> @n1r44
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/


Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
This is the exception I got

15/11/16 16:50:48 WARN metastore.HiveMetaStore: Retrying creating default
database after error: Class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
javax.jdo.JDOFatalUserException: Class
org.datanucleus.api.jdo.JDOPersistenceManagerFactory was not found.
at
javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1175)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
at org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
at
org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
at
org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
at
org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
at
org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:593)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:571)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:620)
at
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:461)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.(RetryingHMSHandler.java:66)
at
org.apache.hadoop.hive.metastore.RetryingHMSHandler.getProxy(RetryingHMSHandler.java:72)
at
org.apache.hadoop.hive.metastore.HiveMetaStore.newRetryingHMSHandler(HiveMetaStore.java:5762)
at
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:199)
at
org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:74)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at
org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1521)
at
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.(RetryingMetaStoreClient.java:86)

On Mon, Nov 16, 2015 at 4:47 PM, Jeff Zhang  wrote:

> It's about the datanucleus related jars which is needed by spark sql.
> Without these jars, I could not call data frame related api ( I make
> HiveContext enabled)
>
>
>
> On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen 
> wrote:
>
>> As of https://github.com/apache/spark/pull/9575, Spark's build will no
>> longer place every dependency JAR into lib_managed. Can you say more about
>> how this affected spark-shell for you (maybe share a stacktrace)?
>>
>> On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang  wrote:
>>
>>>
>>> Sometimes, the jars under lib_managed is missing. And after I rebuild
>>> the spark, the jars under lib_managed is still not downloaded. This would
>>> cause the spark-shell fail due to jars missing. Anyone has hit this weird
>>> issue ?
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>
>
> --
> Best Regards
>
> Jeff Zhang
>



-- 
Best Regards

Jeff Zhang


Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
It's about the datanucleus related jars which is needed by spark sql.
Without these jars, I could not call data frame related api ( I make
HiveContext enabled)



On Mon, Nov 16, 2015 at 4:10 PM, Josh Rosen 
wrote:

> As of https://github.com/apache/spark/pull/9575, Spark's build will no
> longer place every dependency JAR into lib_managed. Can you say more about
> how this affected spark-shell for you (maybe share a stacktrace)?
>
> On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang  wrote:
>
>>
>> Sometimes, the jars under lib_managed is missing. And after I rebuild the
>> spark, the jars under lib_managed is still not downloaded. This would cause
>> the spark-shell fail due to jars missing. Anyone has hit this weird issue ?
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>


-- 
Best Regards

Jeff Zhang


Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Mark Hamstra
FiloDB is also closely reated.  https://github.com/tuplejump/FiloDB

On Mon, Nov 16, 2015 at 12:24 AM, Nick Pentreath 
wrote:

> Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop
> input/output format support:
> https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java
>
> On Mon, Nov 16, 2015 at 7:52 AM, Reynold Xin  wrote:
>
>> This (updates) is something we are going to think about in the next
>> release or two.
>>
>> On Thu, Nov 12, 2015 at 8:57 AM, Cristian O <
>> cristian.b.op...@googlemail.com> wrote:
>>
>>> Sorry, apparently only replied to Reynold, meant to copy the list as
>>> well, so I'm self replying and taking the opportunity to illustrate with an
>>> example.
>>>
>>> Basically I want to conceptually do this:
>>>
>>> val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i => 
>>> (i, 1)).toDF("k", "v")
>>> val deltaDf = sqlContext.sparkContext.parallelize(Array(1, 5)).map(i => 
>>> (i, 1)).toDF("k", "v")
>>>
>>> bigDf.cache()
>>>
>>> bigDf.registerTempTable("big")
>>> deltaDf.registerTempTable("delta")
>>>
>>> val newBigDf = sqlContext.sql("SELECT big.k, big.v + IF(delta.v is null, 0, 
>>> delta.v) FROM big LEFT JOIN delta on big.k = delta.k")
>>>
>>> newBigDf.cache()
>>> bigDf.unpersist()
>>>
>>>
>>> This is essentially an update of keys "1" and "5" only, in a dataset
>>> of 1 million keys.
>>>
>>> This can be achieved efficiently if the join would preserve the cached
>>> blocks that have been unaffected, and only copy and mutate the 2 affected
>>> blocks corresponding to the matching join keys.
>>>
>>> Statistics can determine which blocks actually need mutating. Note also
>>> that shuffling is not required assuming both dataframes are pre-partitioned
>>> by the same key K.
>>>
>>> In SQL this could actually be expressed as an UPDATE statement or for a
>>> more generalized use as a MERGE UPDATE:
>>> https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
>>>
>>> While this may seem like a very special case optimization, it would
>>> effectively implement UPDATE support for cached DataFrames, for both
>>> optimal and non-optimal usage.
>>>
>>> I appreciate there's quite a lot here, so thank you for taking the time
>>> to consider it.
>>>
>>> Cristian
>>>
>>>
>>>
>>> On 12 November 2015 at 15:49, Cristian O <
>>> cristian.b.op...@googlemail.com> wrote:
>>>
 Hi Reynold,

 Thanks for your reply.

 Parquet may very well be used as the underlying implementation, but
 this is more than about a particular storage representation.

 There are a few things here that are inter-related and open different
 possibilities, so it's hard to structure, but I'll give it a try:

 1. Checkpointing DataFrames - while a DF can be saved locally as
 parquet, just using that as a checkpoint would currently require explicitly
 reading it back. A proper checkpoint implementation would just save
 (perhaps asynchronously) and prune the logical plan while allowing to
 continue using the same DF, now backed by the checkpoint.

 It's important to prune the logical plan to avoid all kinds of issues
 that may arise from unbounded expansion with iterative use-cases, like this
 one I encountered recently:
 https://issues.apache.org/jira/browse/SPARK-11596

 But really what I'm after here is:

 2. Efficient updating of cached DataFrames - The main use case here is
 keeping a relatively large dataset cached and updating it iteratively from
 streaming. For example one would like to perform ad-hoc queries on an
 incrementally updated, cached DataFrame. I expect this is already becoming
 an increasingly common use case. Note that the dataset may require merging
 (like adding) or overrriding values by key, so simply appending is not
 sufficient.

 This is very similar in concept with updateStateByKey for regular RDDs,
 i.e. an efficient copy-on-write mechanism, albeit perhaps at CachedBatch
 level  (the row blocks for the columnar representation).

 This can be currently simulated with UNION or (OUTER) JOINs however is
 very inefficient as it requires copying and recaching the entire dataset,
 and unpersisting the original one. There are also the aforementioned
 problems with unbounded logical plans (physical plans are fine)

 These two together, checkpointing and updating cached DataFrames, would
 give fault-tolerant efficient updating of DataFrames, meaning streaming
 apps can take advantage of the compact columnar representation and Tungsten
 optimisations.

 I'm not quite sure if something like this can be achieved by other
 means or has been investigated before, hence why I'm looking for feedback
 here.

 While one could use external data stores, they would have the added IO
 penalty, plus most of what's availab

Re: Support for local disk columnar storage for DataFrames

2015-11-16 Thread Nick Pentreath
Cloudera's Kudu also looks interesting here (getkudu.io) - Hadoop
input/output format support:
https://github.com/cloudera/kudu/blob/master/java/kudu-mapreduce/src/main/java/org/kududb/mapreduce/KuduTableInputFormat.java

On Mon, Nov 16, 2015 at 7:52 AM, Reynold Xin  wrote:

> This (updates) is something we are going to think about in the next
> release or two.
>
> On Thu, Nov 12, 2015 at 8:57 AM, Cristian O <
> cristian.b.op...@googlemail.com> wrote:
>
>> Sorry, apparently only replied to Reynold, meant to copy the list as
>> well, so I'm self replying and taking the opportunity to illustrate with an
>> example.
>>
>> Basically I want to conceptually do this:
>>
>> val bigDf = sqlContext.sparkContext.parallelize((1 to 100)).map(i => (i, 
>> 1)).toDF("k", "v")
>> val deltaDf = sqlContext.sparkContext.parallelize(Array(1, 5)).map(i => 
>> (i, 1)).toDF("k", "v")
>>
>> bigDf.cache()
>>
>> bigDf.registerTempTable("big")
>> deltaDf.registerTempTable("delta")
>>
>> val newBigDf = sqlContext.sql("SELECT big.k, big.v + IF(delta.v is null, 0, 
>> delta.v) FROM big LEFT JOIN delta on big.k = delta.k")
>>
>> newBigDf.cache()
>> bigDf.unpersist()
>>
>>
>> This is essentially an update of keys "1" and "5" only, in a dataset
>> of 1 million keys.
>>
>> This can be achieved efficiently if the join would preserve the cached
>> blocks that have been unaffected, and only copy and mutate the 2 affected
>> blocks corresponding to the matching join keys.
>>
>> Statistics can determine which blocks actually need mutating. Note also
>> that shuffling is not required assuming both dataframes are pre-partitioned
>> by the same key K.
>>
>> In SQL this could actually be expressed as an UPDATE statement or for a
>> more generalized use as a MERGE UPDATE:
>> https://technet.microsoft.com/en-us/library/bb522522(v=sql.105).aspx
>>
>> While this may seem like a very special case optimization, it would
>> effectively implement UPDATE support for cached DataFrames, for both
>> optimal and non-optimal usage.
>>
>> I appreciate there's quite a lot here, so thank you for taking the time
>> to consider it.
>>
>> Cristian
>>
>>
>>
>> On 12 November 2015 at 15:49, Cristian O > > wrote:
>>
>>> Hi Reynold,
>>>
>>> Thanks for your reply.
>>>
>>> Parquet may very well be used as the underlying implementation, but this
>>> is more than about a particular storage representation.
>>>
>>> There are a few things here that are inter-related and open different
>>> possibilities, so it's hard to structure, but I'll give it a try:
>>>
>>> 1. Checkpointing DataFrames - while a DF can be saved locally as
>>> parquet, just using that as a checkpoint would currently require explicitly
>>> reading it back. A proper checkpoint implementation would just save
>>> (perhaps asynchronously) and prune the logical plan while allowing to
>>> continue using the same DF, now backed by the checkpoint.
>>>
>>> It's important to prune the logical plan to avoid all kinds of issues
>>> that may arise from unbounded expansion with iterative use-cases, like this
>>> one I encountered recently:
>>> https://issues.apache.org/jira/browse/SPARK-11596
>>>
>>> But really what I'm after here is:
>>>
>>> 2. Efficient updating of cached DataFrames - The main use case here is
>>> keeping a relatively large dataset cached and updating it iteratively from
>>> streaming. For example one would like to perform ad-hoc queries on an
>>> incrementally updated, cached DataFrame. I expect this is already becoming
>>> an increasingly common use case. Note that the dataset may require merging
>>> (like adding) or overrriding values by key, so simply appending is not
>>> sufficient.
>>>
>>> This is very similar in concept with updateStateByKey for regular RDDs,
>>> i.e. an efficient copy-on-write mechanism, albeit perhaps at CachedBatch
>>> level  (the row blocks for the columnar representation).
>>>
>>> This can be currently simulated with UNION or (OUTER) JOINs however is
>>> very inefficient as it requires copying and recaching the entire dataset,
>>> and unpersisting the original one. There are also the aforementioned
>>> problems with unbounded logical plans (physical plans are fine)
>>>
>>> These two together, checkpointing and updating cached DataFrames, would
>>> give fault-tolerant efficient updating of DataFrames, meaning streaming
>>> apps can take advantage of the compact columnar representation and Tungsten
>>> optimisations.
>>>
>>> I'm not quite sure if something like this can be achieved by other means
>>> or has been investigated before, hence why I'm looking for feedback here.
>>>
>>> While one could use external data stores, they would have the added IO
>>> penalty, plus most of what's available at the moment is either HDFS
>>> (extremely inefficient for updates) or key-value stores that have 5-10x
>>> space overhead over columnar formats.
>>>
>>> Thanks,
>>> Cristian
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 12 November 2015 at 03:31, Reynold Xin  wrote:
>>>
 Thanks for the ema

Re: Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Josh Rosen
As of https://github.com/apache/spark/pull/9575, Spark's build will no
longer place every dependency JAR into lib_managed. Can you say more about
how this affected spark-shell for you (maybe share a stacktrace)?

On Mon, Nov 16, 2015 at 12:03 AM, Jeff Zhang  wrote:

>
> Sometimes, the jars under lib_managed is missing. And after I rebuild the
> spark, the jars under lib_managed is still not downloaded. This would cause
> the spark-shell fail due to jars missing. Anyone has hit this weird issue ?
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Does anyone meet the issue that jars under lib_managed is never downloaded ?

2015-11-16 Thread Jeff Zhang
Sometimes, the jars under lib_managed is missing. And after I rebuild the
spark, the jars under lib_managed is still not downloaded. This would cause
the spark-shell fail due to jars missing. Anyone has hit this weird issue ?



-- 
Best Regards

Jeff Zhang