Checkpoint directory structure

2015-09-23 Thread Bin Wang
I find the checkpoint directory structure is like this:

-rw-r--r--   1 root root 134820 2015-09-23 16:55
/user/root/checkpoint/checkpoint-144299850
-rw-r--r--   1 root root 134768 2015-09-23 17:00
/user/root/checkpoint/checkpoint-144299880
-rw-r--r--   1 root root 134895 2015-09-23 17:05
/user/root/checkpoint/checkpoint-144299910
-rw-r--r--   1 root root 134899 2015-09-23 17:10
/user/root/checkpoint/checkpoint-144299940
-rw-r--r--   1 root root 134913 2015-09-23 17:15
/user/root/checkpoint/checkpoint-144299970
-rw-r--r--   1 root root 134928 2015-09-23 17:20
/user/root/checkpoint/checkpoint-14430
-rw-r--r--   1 root root 134987 2015-09-23 17:25
/user/root/checkpoint/checkpoint-144300030
-rw-r--r--   1 root root 134944 2015-09-23 17:30
/user/root/checkpoint/checkpoint-144300060
-rw-r--r--   1 root root 134956 2015-09-23 17:35
/user/root/checkpoint/checkpoint-144300090
-rw-r--r--   1 root root 135244 2015-09-23 17:40
/user/root/checkpoint/checkpoint-144300120
drwxr-xr-x   - root root  0 2015-09-23 18:48
/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
drwxr-xr-x   - root root  0 2015-09-23 17:44
/user/root/checkpoint/receivedBlockMetadata


I restart spark and it reads from
/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
that the data in it lost some rdds so it is not able to recovery. While I
find other directories in checkpoint/, like
 /user/root/checkpoint/checkpoint-144300120.  What does it used for?
Can I recovery my data from that?


Re: SparkR package path

2015-09-23 Thread Hossein
Yes, I think exposing SparkR in CRAN can significantly expand the reach of
both SparkR and Spark itself to a larger community of data scientists (and
statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of
these folks have a Spark Cluster and wish to talk to it from RStudio. While
that is a bigger task, for now, first step could be not requiring them to
download Spark source and run a script that is named install-dev.sh. I
filed SPARK-10776 to track this.


--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> As Rui says it would be good to understand the use case we want to
> support (supporting CRAN installs could be one for example). I don't
> think it should be very hard to do as the RBackend itself doesn't use
> the R source files. The RRDD does use it and the value comes from
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
> AFAIK -- So we could introduce a new config flag that can be used for
> this new mode.
>
> Thanks
> Shivaram
>
> On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui  wrote:
> > Hossein,
> >
> >
> >
> > Any strong reason to download and install SparkR source package
> separately
> > from the Spark distribution?
> >
> > An R user can simply download the spark distribution, which contains
> SparkR
> > source and binary package, and directly use sparkR. No need to install
> > SparkR package at all.
> >
> >
> >
> > From: Hossein [mailto:fal...@gmail.com]
> > Sent: Tuesday, September 22, 2015 9:19 AM
> > To: dev@spark.apache.org
> > Subject: SparkR package path
> >
> >
> >
> > Hi dev list,
> >
> >
> >
> > SparkR backend assumes SparkR source files are located under
> > "SPARK_HOME/R/lib/." This directory is created by running
> R/install-dev.sh.
> > This setting makes sense for Spark developers, but if an R user downloads
> > and installs SparkR source package, the source files are going to be in
> > placed different locations.
> >
> >
> >
> > In the R runtime it is easy to find location of package files using
> > path.package("SparkR"). But we need to make some changes to R backend
> and/or
> > spark-submit so that, JVM process learns the location of worker.R and
> > daemon.R and shell.R from the R runtime.
> >
> >
> >
> > Do you think this change is feasible?
> >
> >
> >
> > Thanks,
> >
> > --Hossein
>


Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-23 Thread Reynold Xin
There is an implicit conversion in scope

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153


  /**
   * An implicit conversion function internal to this class for us to avoid
doing
   * "new DataFrame(...)" everywhere.
   */
  @inline private implicit def logicalPlanToDataFrame(logicalPlan:
LogicalPlan): DataFrame = {
new DataFrame(sqlContext, logicalPlan)
  }


On Tue, Sep 22, 2015 at 10:57 PM, qiuhai <986775...@qq.com> wrote:

> Hi,
>   Recently,I am reading source code(1.5 version) about sparksql .
>   In DataFrame.scala, there is a funtion named filter in the 737 row
>
> *def filter(condition: Column): DataFrame = Filter(condition.expr,
> logicalPlan)*
>
>   The fucntion return  a Filter object,but it require a DataFrame object.
>
>
>   thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Filter-return-a-DataFrame-object-in-DataFrame-scala-tp14295.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Why Filter return a DataFrame object in DataFrame.scala?

2015-09-23 Thread Reynold Xin
There is an implicit conversion in scope

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L153


  /**
   * An implicit conversion function internal to this class for us to avoid
doing
   * "new DataFrame(...)" everywhere.
   */
  @inline private implicit def logicalPlanToDataFrame(logicalPlan:
LogicalPlan): DataFrame = {
new DataFrame(sqlContext, logicalPlan)
  }


On Tue, Sep 22, 2015 at 10:57 PM, qiuhai <986775...@qq.com> wrote:

> Hi,
>   Recently,I am reading source code(1.5 version) about sparksql .
>   In DataFrame.scala, there is a funtion named filter in the 737 row
>
> *def filter(condition: Column): DataFrame = Filter(condition.expr,
> logicalPlan)*
>
>   The fucntion return  a Filter object,but it require a DataFrame object.
>
>
>   thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Why-Filter-return-a-DataFrame-object-in-DataFrame-scala-tp14295.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


using Codahale counters in source

2015-09-23 Thread Steve Loughran



Quick question: is it OK to use Codahale Metric classes (e.g. Counter) in 
source as generic thread-safe counters, with the option of hooking them to a 
Codahale metrics registry if there is one in the spark context?

The Counter class does extend LongAdder, which is by Doug Lea and promises to 
be a better performing long counter when update contention is low —such as when 
precisely one thread is doing the updates.

I've done that in other projects, and it works relatively well, in that you 
don't have to add extra code for metrics, you just have to make sure you 
implement counters that are relevant, and, when registring them, given them a 
useful name.

-Steve

RFC: packaging Spark without assemblies

2015-09-23 Thread Marcelo Vanzin
Hey all,

This is something that we've discussed several times internally, but
never really had much time to look into; but as time passes by, it's
increasingly becoming an issue for us and I'd like to throw some ideas
around about how to fix it.

So, without further ado:
https://github.com/vanzin/spark/pull/2/files

(You can comment there or click "View" to read the formatted document.
I thought that would be easier than sharing on Google Drive or Box or
something.)

It would be great to get people's feedback, especially if there are
strong reasons for the assemblies that I'm not aware of.


-- 
Marcelo

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



RE: SparkR package path

2015-09-23 Thread Sun, Rui
SparkR package is not a standalone R package, as it is actually R API of Spark 
and needs to co-operate with a matching version of Spark, so exposing it in 
CRAN does not ease use of R users as they need to download matching Spark 
distribution, unless we expose a bundled SparkR package to CRAN (packageing 
with Spark), is this desirable? Actually, for normal users who are not 
developers, they are not required to download Spark source, build and install 
SparkR package. They just need to download a Spark distribution, and then use 
SparkR.

For using SparkR in Rstudio, there is a documentation at 
https://github.com/apache/spark/tree/master/R



From: Hossein [mailto:fal...@gmail.com]
Sent: Thursday, September 24, 2015 1:42 AM
To: shiva...@eecs.berkeley.edu
Cc: Sun, Rui; dev@spark.apache.org
Subject: Re: SparkR package path

Yes, I think exposing SparkR in CRAN can significantly expand the reach of both 
SparkR and Spark itself to a larger community of data scientists (and 
statisticians).

I have been getting questions on how to use SparkR in RStudio. Most of these 
folks have a Spark Cluster and wish to talk to it from RStudio. While that is a 
bigger task, for now, first step could be not requiring them to download Spark 
source and run a script that is named install-dev.sh. I filed SPARK-10776 to 
track this.


--Hossein

On Tue, Sep 22, 2015 at 7:21 PM, Shivaram Venkataraman 
> wrote:
As Rui says it would be good to understand the use case we want to
support (supporting CRAN installs could be one for example). I don't
think it should be very hard to do as the RBackend itself doesn't use
the R source files. The RRDD does use it and the value comes from
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RUtils.scala#L29
AFAIK -- So we could introduce a new config flag that can be used for
this new mode.

Thanks
Shivaram

On Mon, Sep 21, 2015 at 8:15 PM, Sun, Rui 
> wrote:
> Hossein,
>
>
>
> Any strong reason to download and install SparkR source package separately
> from the Spark distribution?
>
> An R user can simply download the spark distribution, which contains SparkR
> source and binary package, and directly use sparkR. No need to install
> SparkR package at all.
>
>
>
> From: Hossein [mailto:fal...@gmail.com]
> Sent: Tuesday, September 22, 2015 9:19 AM
> To: dev@spark.apache.org
> Subject: SparkR package path
>
>
>
> Hi dev list,
>
>
>
> SparkR backend assumes SparkR source files are located under
> "SPARK_HOME/R/lib/." This directory is created by running R/install-dev.sh.
> This setting makes sense for Spark developers, but if an R user downloads
> and installs SparkR source package, the source files are going to be in
> placed different locations.
>
>
>
> In the R runtime it is easy to find location of package files using
> path.package("SparkR"). But we need to make some changes to R backend and/or
> spark-submit so that, JVM process learns the location of worker.R and
> daemon.R and shell.R from the R runtime.
>
>
>
> Do you think this change is feasible?
>
>
>
> Thanks,
>
> --Hossein



Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
I've attached the full log. The error is like this:

15/09/23 17:47:39 ERROR yarn.ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: requirement failed: Checkpoint
directory does not exist: hdfs://
szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
java.lang.IllegalArgumentException: requirement failed: Checkpoint
directory does not exist: hdfs://
szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.rdd.ReliableCheckpointRDD.(ReliableCheckpointRDD.scala:45)
at
org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1227)
at
org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1227)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.checkpointFile(SparkContext.scala:1226)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:112)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:109)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData.restore(DStreamCheckpointData.scala:109)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:487)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.streaming.DStreamGraph.restoreCheckpointData(DStreamGraph.scala:153)
at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:158)
at
org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
at
org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:837)
at com.appadhoc.data.main.StatCounter$.main(StatCounter.scala:51)
at com.appadhoc.data.main.StatCounter.main(StatCounter.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
15/09/23 17:47:39 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.lang.IllegalArgumentException: requirement failed: Checkpoint
directory does not exist: hdfs://
szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
)
15/09/23 17:47:39 INFO spark.SparkContext: Invoking stop() from shutdown
hook


Tathagata Das 于2015年9月24日周四 上午9:45写道:

> Could you provide the logs on when and how you are seeing this error?
>
> On Wed, Sep 23, 2015 at 6:32 PM, Bin Wang  wrote:
>
>> BTW, I just kill the application and restart 

Re: RFC: packaging Spark without assemblies

2015-09-23 Thread Patrick Wendell
I think it would be a big improvement to get rid of it. It's not how
jars are supposed to be packaged and it has caused problems in many
different context over the years.

For me a key step in moving away would be to fully audit/understand
all compatibility implications of removing it. If other people are
supportive of this plan I can offer to help spend some time thinking
about any potential corner cases, etc.

- Patrick

On Wed, Sep 23, 2015 at 3:13 PM, Marcelo Vanzin  wrote:
> Hey all,
>
> This is something that we've discussed several times internally, but
> never really had much time to look into; but as time passes by, it's
> increasingly becoming an issue for us and I'd like to throw some ideas
> around about how to fix it.
>
> So, without further ado:
> https://github.com/vanzin/spark/pull/2/files
>
> (You can comment there or click "View" to read the formatted document.
> I thought that would be easier than sharing on Google Drive or Box or
> something.)
>
> It would be great to get people's feedback, especially if there are
> strong reasons for the assemblies that I'm not aware of.
>
>
> --
> Marcelo
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
BTW, I just kill the application and restart it. Then the application
cannot recover from checkpoint because of some lost of RDD. So I'm wonder,
if there are some failure in the application, won't it possible not be able
to recovery from checkpoint?

Bin Wang 于2015年9月23日周三 下午6:58写道:

> I find the checkpoint directory structure is like this:
>
> -rw-r--r--   1 root root 134820 2015-09-23 16:55
> /user/root/checkpoint/checkpoint-144299850
> -rw-r--r--   1 root root 134768 2015-09-23 17:00
> /user/root/checkpoint/checkpoint-144299880
> -rw-r--r--   1 root root 134895 2015-09-23 17:05
> /user/root/checkpoint/checkpoint-144299910
> -rw-r--r--   1 root root 134899 2015-09-23 17:10
> /user/root/checkpoint/checkpoint-144299940
> -rw-r--r--   1 root root 134913 2015-09-23 17:15
> /user/root/checkpoint/checkpoint-144299970
> -rw-r--r--   1 root root 134928 2015-09-23 17:20
> /user/root/checkpoint/checkpoint-14430
> -rw-r--r--   1 root root 134987 2015-09-23 17:25
> /user/root/checkpoint/checkpoint-144300030
> -rw-r--r--   1 root root 134944 2015-09-23 17:30
> /user/root/checkpoint/checkpoint-144300060
> -rw-r--r--   1 root root 134956 2015-09-23 17:35
> /user/root/checkpoint/checkpoint-144300090
> -rw-r--r--   1 root root 135244 2015-09-23 17:40
> /user/root/checkpoint/checkpoint-144300120
> drwxr-xr-x   - root root  0 2015-09-23 18:48
> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
> drwxr-xr-x   - root root  0 2015-09-23 17:44
> /user/root/checkpoint/receivedBlockMetadata
>
>
> I restart spark and it reads from
> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
> that the data in it lost some rdds so it is not able to recovery. While I
> find other directories in checkpoint/, like
>  /user/root/checkpoint/checkpoint-144300120.  What does it used for?
> Can I recovery my data from that?
>


Re: Checkpoint directory structure

2015-09-23 Thread Tathagata Das
Could you provide the logs on when and how you are seeing this error?

On Wed, Sep 23, 2015 at 6:32 PM, Bin Wang  wrote:

> BTW, I just kill the application and restart it. Then the application
> cannot recover from checkpoint because of some lost of RDD. So I'm wonder,
> if there are some failure in the application, won't it possible not be able
> to recovery from checkpoint?
>
> Bin Wang 于2015年9月23日周三 下午6:58写道:
>
>> I find the checkpoint directory structure is like this:
>>
>> -rw-r--r--   1 root root 134820 2015-09-23 16:55
>> /user/root/checkpoint/checkpoint-144299850
>> -rw-r--r--   1 root root 134768 2015-09-23 17:00
>> /user/root/checkpoint/checkpoint-144299880
>> -rw-r--r--   1 root root 134895 2015-09-23 17:05
>> /user/root/checkpoint/checkpoint-144299910
>> -rw-r--r--   1 root root 134899 2015-09-23 17:10
>> /user/root/checkpoint/checkpoint-144299940
>> -rw-r--r--   1 root root 134913 2015-09-23 17:15
>> /user/root/checkpoint/checkpoint-144299970
>> -rw-r--r--   1 root root 134928 2015-09-23 17:20
>> /user/root/checkpoint/checkpoint-14430
>> -rw-r--r--   1 root root 134987 2015-09-23 17:25
>> /user/root/checkpoint/checkpoint-144300030
>> -rw-r--r--   1 root root 134944 2015-09-23 17:30
>> /user/root/checkpoint/checkpoint-144300060
>> -rw-r--r--   1 root root 134956 2015-09-23 17:35
>> /user/root/checkpoint/checkpoint-144300090
>> -rw-r--r--   1 root root 135244 2015-09-23 17:40
>> /user/root/checkpoint/checkpoint-144300120
>> drwxr-xr-x   - root root  0 2015-09-23 18:48
>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
>> drwxr-xr-x   - root root  0 2015-09-23 17:44
>> /user/root/checkpoint/receivedBlockMetadata
>>
>>
>> I restart spark and it reads from
>> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
>> that the data in it lost some rdds so it is not able to recovery. While I
>> find other directories in checkpoint/, like
>>  /user/root/checkpoint/checkpoint-144300120.  What does it used for?
>> Can I recovery my data from that?
>>
>


Get only updated RDDs from or after updateStateBykey

2015-09-23 Thread Bin Wang
I've read the source code and it seems to be impossible, but I'd like to
confirm it.

It is a very useful feature. For example, I need to store the state of
DStream into my database, in order to recovery them from next redeploy. But
I only need to save the updated ones. Save all keys into database is a lot
of waste.

Through the source code, I think it could be add easily: StateDStream can
get prevStateRDD so that it can make a diff. Is there any chance to add
this as an API of StateDStream? If so, I can work on this feature.

If not possible, is there any work around or hack to do this by myself?