Get only updated RDDs from or after updateStateBykey

2015-09-23 Thread Bin Wang
I've read the source code and it seems to be impossible, but I'd like to confirm it. It is a very useful feature. For example, I need to store the state of DStream into my database, in order to recovery them from next redeploy. But I only need to save the updated ones. Save all keys into database

Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
I've attached the full log. The error is like this: 15/09/23 17:47:39 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.IllegalArgumentException: requirement failed: Checkpoint directory does not exist: hdfs:// szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1

RE: SparkR package path

2015-09-23 Thread Sun, Rui
SparkR package is not a standalone R package, as it is actually R API of Spark and needs to co-operate with a matching version of Spark, so exposing it in CRAN does not ease use of R users as they need to download matching Spark distribution, unless we expose a bundled SparkR package to CRAN (pa

Re: Checkpoint directory structure

2015-09-23 Thread Tathagata Das
Could you provide the logs on when and how you are seeing this error? On Wed, Sep 23, 2015 at 6:32 PM, Bin Wang wrote: > BTW, I just kill the application and restart it. Then the application > cannot recover from checkpoint because of some lost of RDD. So I'm wonder, > if there are some failure

Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
BTW, I just kill the application and restart it. Then the application cannot recover from checkpoint because of some lost of RDD. So I'm wonder, if there are some failure in the application, won't it possible not be able to recovery from checkpoint? Bin Wang 于2015年9月23日周三 下午6:58写道: > I find the c

Re: RFC: packaging Spark without assemblies

2015-09-23 Thread Patrick Wendell
I think it would be a big improvement to get rid of it. It's not how jars are supposed to be packaged and it has caused problems in many different context over the years. For me a key step in moving away would be to fully audit/understand all compatibility implications of removing it. If other peo

RFC: packaging Spark without assemblies

2015-09-23 Thread Marcelo Vanzin
Hey all, This is something that we've discussed several times internally, but never really had much time to look into; but as time passes by, it's increasingly becoming an issue for us and I'd like to throw some ideas around about how to fix it. So, without further ado: https://github.com/vanzin/

Re: SparkR package path

2015-09-23 Thread Hossein
Yes, I think exposing SparkR in CRAN can significantly expand the reach of both SparkR and Spark itself to a larger community of data scientists (and statisticians). I have been getting questions on how to use SparkR in RStudio. Most of these folks have a Spark Cluster and wish to talk to it from

Checkpoint directory structure

2015-09-23 Thread Bin Wang
I find the checkpoint directory structure is like this: -rw-r--r-- 1 root root 134820 2015-09-23 16:55 /user/root/checkpoint/checkpoint-144299850 -rw-r--r-- 1 root root 134768 2015-09-23 17:00 /user/root/checkpoint/checkpoint-144299880 -rw-r--r-- 1 root root 134895 2015-0

using Codahale counters in source

2015-09-23 Thread Steve Loughran
Quick question: is it OK to use Codahale Metric classes (e.g. Counter) in source as generic thread-safe counters, with the option of hooking them to a Codahale metrics registry if there is one in the spark context? The Counter class does extend LongAdder, which is by Doug Lea and promises to