Checkpoint directory structure

2015-09-23 Thread Bin Wang
I find the checkpoint directory structure is like this: -rw-r--r-- 1 root root 134820 2015-09-23 16:55 /user/root/checkpoint/checkpoint-144299850 -rw-r--r-- 1 root root 134768 2015-09-23 17:00 /user/root/checkpoint/checkpoint-144299880 -rw-r--r-- 1 root root 134895

Re: SparkR package path

2015-09-23 Thread Hossein
Yes, I think exposing SparkR in CRAN can significantly expand the reach of both SparkR and Spark itself to a larger community of data scientists (and statisticians). I have been getting questions on how to use SparkR in RStudio. Most of these folks have a Spark Cluster and wish to talk to it from

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-23 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline

Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-23 Thread Reynold Xin
There is an implicit conversion in scope https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153 /** * An implicit conversion function internal to this class for us to avoid doing * "new DataFrame(...)" everywhere. */ @inline

using Codahale counters in source

2015-09-23 Thread Steve Loughran
Quick question: is it OK to use Codahale Metric classes (e.g. Counter) in source as generic thread-safe counters, with the option of hooking them to a Codahale metrics registry if there is one in the spark context? The Counter class does extend LongAdder, which is by Doug Lea and promises to

RFC: packaging Spark without assemblies

2015-09-23 Thread Marcelo Vanzin
Hey all, This is something that we've discussed several times internally, but never really had much time to look into; but as time passes by, it's increasingly becoming an issue for us and I'd like to throw some ideas around about how to fix it. So, without further ado:

RE: SparkR package path

2015-09-23 Thread Sun, Rui
SparkR package is not a standalone R package, as it is actually R API of Spark and needs to co-operate with a matching version of Spark, so exposing it in CRAN does not ease use of R users as they need to download matching Spark distribution, unless we expose a bundled SparkR package to CRAN

Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
I've attached the full log. The error is like this: 15/09/23 17:47:39 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: requirement failed: Checkpoint directory does not exist: hdfs://

Re: RFC: packaging Spark without assemblies

2015-09-23 Thread Patrick Wendell
I think it would be a big improvement to get rid of it. It's not how jars are supposed to be packaged and it has caused problems in many different context over the years. For me a key step in moving away would be to fully audit/understand all compatibility implications of removing it. If other

Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
BTW, I just kill the application and restart it. Then the application cannot recover from checkpoint because of some lost of RDD. So I'm wonder, if there are some failure in the application, won't it possible not be able to recovery from checkpoint? Bin Wang 于2015年9月23日周三

Re: Checkpoint directory structure

2015-09-23 Thread Tathagata Das
Could you provide the logs on when and how you are seeing this error? On Wed, Sep 23, 2015 at 6:32 PM, Bin Wang wrote: > BTW, I just kill the application and restart it. Then the application > cannot recover from checkpoint because of some lost of RDD. So I'm wonder, > if

Get only updated RDDs from or after updateStateBykey

2015-09-23 Thread Bin Wang
I've read the source code and it seems to be impossible, but I'd like to confirm it. It is a very useful feature. For example, I need to store the state of DStream into my database, in order to recovery them from next redeploy. But I only need to save the updated ones. Save all keys into database