Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
Data that are not updated should be saved earlier: while the data added to
the DStream at the first time, it should be considered as updated. So save
the same data again is a waste.

What are the community is doing? Is there any doc or discussion that I can
look for? Thanks.



Shixiong Zhu <zsxw...@gmail.com>于2015年9月24日周四 下午4:27写道:

> For data that are not updated, where do you save? Or do you only want to
> avoid accessing database for those that are not updated?
>
> Besides,  the community is working on optimizing "updateStateBykey"'s
> performance. Hope it will be delivered soon.
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 13:45 GMT+08:00 Bin Wang <wbi...@gmail.com>:
>
>> I've read the source code and it seems to be impossible, but I'd like to
>> confirm it.
>>
>> It is a very useful feature. For example, I need to store the state of
>> DStream into my database, in order to recovery them from next redeploy. But
>> I only need to save the updated ones. Save all keys into database is a lot
>> of waste.
>>
>> Through the source code, I think it could be add easily: StateDStream can
>> get prevStateRDD so that it can make a diff. Is there any chance to add
>> this as an API of StateDStream? If so, I can work on this feature.
>>
>> If not possible, is there any work around or hack to do this by myself?
>>
>
>


Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
It seems like a work around. But I don't know how to get the database
connection from the working nodes.

Shixiong Zhu <zsxw...@gmail.com>于2015年9月24日周四 下午5:37写道:

> Could you write your update func like this?
>
> val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
> => {
>   iterator.flatMap { case (key, values, stateOption) =>
> if (values.isEmpty) {
>   // don't access database
> } else {
>   // update to new state and save to database
> }
> // return new state
>   }
> }
>
> and use this overload:
>
> def updateStateByKey[S: ClassTag](
>   updateFunc: (Seq[V], Option[S]) => Option[S],
>   partitioner: Partitioner
> ): DStream[(K, S)]
>
> There is a JIRA: https://issues.apache.org/jira/browse/SPARK-2629 but
> doesn't have a doc now...
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 17:26 GMT+08:00 Bin Wang <wbi...@gmail.com>:
>
>> Data that are not updated should be saved earlier: while the data added
>> to the DStream at the first time, it should be considered as updated. So
>> save the same data again is a waste.
>>
>> What are the community is doing? Is there any doc or discussion that I
>> can look for? Thanks.
>>
>>
>>
>> Shixiong Zhu <zsxw...@gmail.com>于2015年9月24日周四 下午4:27写道:
>>
>>> For data that are not updated, where do you save? Or do you only want to
>>> avoid accessing database for those that are not updated?
>>>
>>> Besides,  the community is working on optimizing "updateStateBykey"'s
>>> performance. Hope it will be delivered soon.
>>>
>>> Best Regards,
>>> Shixiong Zhu
>>>
>>> 2015-09-24 13:45 GMT+08:00 Bin Wang <wbi...@gmail.com>:
>>>
>>>> I've read the source code and it seems to be impossible, but I'd like
>>>> to confirm it.
>>>>
>>>> It is a very useful feature. For example, I need to store the state of
>>>> DStream into my database, in order to recovery them from next redeploy. But
>>>> I only need to save the updated ones. Save all keys into database is a lot
>>>> of waste.
>>>>
>>>> Through the source code, I think it could be add easily: StateDStream
>>>> can get prevStateRDD so that it can make a diff. Is there any chance to add
>>>> this as an API of StateDStream? If so, I can work on this feature.
>>>>
>>>> If not possible, is there any work around or hack to do this by myself?
>>>>
>>>
>>>
>


Re: Get only updated RDDs from or after updateStateBykey

2015-09-24 Thread Bin Wang
Thanks, it seems good, though a little hack.

And here is another question. updateByKey compute on all the data from the
beginning, but in many situation, we just need to update the coming data.
This could be a big improve on speed and resource. Would this to be support
in the future?

Shixiong Zhu <zsxw...@gmail.com>于2015年9月24日周四 下午6:01写道:

> You can create connection like this:
>
> val updateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])])
> => {
>   val dbConnection = create a db connection
>   iterator.flatMap { case (key, values, stateOption) =>
> if (values.isEmpty) {
>   // don't access database
> } else {
>   // update to new state and save to database
> }
> // return new state
>   }
>   TaskContext.get().addTaskCompletionListener(_ => db.disconnect())
> }
>
>
> Best Regards,
> Shixiong Zhu
>
> 2015-09-24 17:42 GMT+08:00 Bin Wang <wbi...@gmail.com>:
>
>> It seems like a work around. But I don't know how to get the database
>> connection from the working nodes.
>>
>> Shixiong Zhu <zsxw...@gmail.com>于2015年9月24日周四 下午5:37写道:
>>
>>> Could you write your update func like this?
>>>
>>> val updateFunc = (iterator: Iterator[(String, Seq[Int],
>>> Option[Int])]) => {
>>>   iterator.flatMap { case (key, values, stateOption) =>
>>> if (values.isEmpty) {
>>>   // don't access database
>>> } else {
>>>   // update to new state and save to database
>>> }
>>> // return new state
>>>   }
>>> }
>>>
>>> and use this overload:
>>>
>>> def updateStateByKey[S: ClassTag](
>>>   updateFunc: (Seq[V], Option[S]) => Option[S],
>>>   partitioner: Partitioner
>>> ): DStream[(K, S)]
>>>
>>> There is a JIRA: https://issues.apache.org/jira/browse/SPARK-2629 but
>>> doesn't have a doc now...
>>>
>>>
>>> Best Regards,
>>> Shixiong Zhu
>>>
>>> 2015-09-24 17:26 GMT+08:00 Bin Wang <wbi...@gmail.com>:
>>>
>>>> Data that are not updated should be saved earlier: while the data added
>>>> to the DStream at the first time, it should be considered as updated. So
>>>> save the same data again is a waste.
>>>>
>>>> What are the community is doing? Is there any doc or discussion that I
>>>> can look for? Thanks.
>>>>
>>>>
>>>>
>>>> Shixiong Zhu <zsxw...@gmail.com>于2015年9月24日周四 下午4:27写道:
>>>>
>>>>> For data that are not updated, where do you save? Or do you only want
>>>>> to avoid accessing database for those that are not updated?
>>>>>
>>>>> Besides,  the community is working on optimizing "updateStateBykey"'s
>>>>> performance. Hope it will be delivered soon.
>>>>>
>>>>> Best Regards,
>>>>> Shixiong Zhu
>>>>>
>>>>> 2015-09-24 13:45 GMT+08:00 Bin Wang <wbi...@gmail.com>:
>>>>>
>>>>>> I've read the source code and it seems to be impossible, but I'd like
>>>>>> to confirm it.
>>>>>>
>>>>>> It is a very useful feature. For example, I need to store the state
>>>>>> of DStream into my database, in order to recovery them from next 
>>>>>> redeploy.
>>>>>> But I only need to save the updated ones. Save all keys into database is 
>>>>>> a
>>>>>> lot of waste.
>>>>>>
>>>>>> Through the source code, I think it could be add easily: StateDStream
>>>>>> can get prevStateRDD so that it can make a diff. Is there any chance to 
>>>>>> add
>>>>>> this as an API of StateDStream? If so, I can work on this feature.
>>>>>>
>>>>>> If not possible, is there any work around or hack to do this by
>>>>>> myself?
>>>>>>
>>>>>
>>>>>
>>>
>


Checkpoint directory structure

2015-09-23 Thread Bin Wang
I find the checkpoint directory structure is like this:

-rw-r--r--   1 root root 134820 2015-09-23 16:55
/user/root/checkpoint/checkpoint-144299850
-rw-r--r--   1 root root 134768 2015-09-23 17:00
/user/root/checkpoint/checkpoint-144299880
-rw-r--r--   1 root root 134895 2015-09-23 17:05
/user/root/checkpoint/checkpoint-144299910
-rw-r--r--   1 root root 134899 2015-09-23 17:10
/user/root/checkpoint/checkpoint-144299940
-rw-r--r--   1 root root 134913 2015-09-23 17:15
/user/root/checkpoint/checkpoint-144299970
-rw-r--r--   1 root root 134928 2015-09-23 17:20
/user/root/checkpoint/checkpoint-14430
-rw-r--r--   1 root root 134987 2015-09-23 17:25
/user/root/checkpoint/checkpoint-144300030
-rw-r--r--   1 root root 134944 2015-09-23 17:30
/user/root/checkpoint/checkpoint-144300060
-rw-r--r--   1 root root 134956 2015-09-23 17:35
/user/root/checkpoint/checkpoint-144300090
-rw-r--r--   1 root root 135244 2015-09-23 17:40
/user/root/checkpoint/checkpoint-144300120
drwxr-xr-x   - root root  0 2015-09-23 18:48
/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
drwxr-xr-x   - root root  0 2015-09-23 17:44
/user/root/checkpoint/receivedBlockMetadata


I restart spark and it reads from
/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
that the data in it lost some rdds so it is not able to recovery. While I
find other directories in checkpoint/, like
 /user/root/checkpoint/checkpoint-144300120.  What does it used for?
Can I recovery my data from that?


Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
I've attached the full log. The error is like this:

15/09/23 17:47:39 ERROR yarn.ApplicationMaster: User class threw exception:
java.lang.IllegalArgumentException: requirement failed: Checkpoint
directory does not exist: hdfs://
szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
java.lang.IllegalArgumentException: requirement failed: Checkpoint
directory does not exist: hdfs://
szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.rdd.ReliableCheckpointRDD.(ReliableCheckpointRDD.scala:45)
at
org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1227)
at
org.apache.spark.SparkContext$$anonfun$checkpointFile$1.apply(SparkContext.scala:1227)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
at org.apache.spark.SparkContext.checkpointFile(SparkContext.scala:1226)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:112)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData$$anonfun$restore$1.apply(DStreamCheckpointData.scala:109)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at
scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at
org.apache.spark.streaming.dstream.DStreamCheckpointData.restore(DStreamCheckpointData.scala:109)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:487)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$restoreCheckpointData$2.apply(DStream.scala:488)
at scala.collection.immutable.List.foreach(List.scala:318)
at
org.apache.spark.streaming.dstream.DStream.restoreCheckpointData(DStream.scala:488)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$restoreCheckpointData$2.apply(DStreamGraph.scala:153)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.streaming.DStreamGraph.restoreCheckpointData(DStreamGraph.scala:153)
at
org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:158)
at
org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
at
org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:837)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:837)
at com.appadhoc.data.main.StatCounter$.main(StatCounter.scala:51)
at com.appadhoc.data.main.StatCounter.main(StatCounter.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)
15/09/23 17:47:39 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception:
java.lang.IllegalArgumentException: requirement failed: Checkpoint
directory does not exist: hdfs://
szq2.appadhoc.com:8020/user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2/rdd-26909
)
15/09/23 17:47:39 INFO spark.SparkContext: Invoking stop() from shutdown
hook


Tathagata Das <tathagata.das1...@gmail.com>于2015年9月24日周四 上午9:45写道:

> Could you provide the logs on when and how you are seeing this error?
>
> On Wed, Sep 23, 2015 at 6:32 PM, Bin Wang <wbi...@gmail.com> wrote:
>
>> BTW, I just ki

Re: Checkpoint directory structure

2015-09-23 Thread Bin Wang
BTW, I just kill the application and restart it. Then the application
cannot recover from checkpoint because of some lost of RDD. So I'm wonder,
if there are some failure in the application, won't it possible not be able
to recovery from checkpoint?

Bin Wang <wbi...@gmail.com>于2015年9月23日周三 下午6:58写道:

> I find the checkpoint directory structure is like this:
>
> -rw-r--r--   1 root root 134820 2015-09-23 16:55
> /user/root/checkpoint/checkpoint-144299850
> -rw-r--r--   1 root root 134768 2015-09-23 17:00
> /user/root/checkpoint/checkpoint-144299880
> -rw-r--r--   1 root root 134895 2015-09-23 17:05
> /user/root/checkpoint/checkpoint-144299910
> -rw-r--r--   1 root root 134899 2015-09-23 17:10
> /user/root/checkpoint/checkpoint-144299940
> -rw-r--r--   1 root root 134913 2015-09-23 17:15
> /user/root/checkpoint/checkpoint-144299970
> -rw-r--r--   1 root root 134928 2015-09-23 17:20
> /user/root/checkpoint/checkpoint-14430
> -rw-r--r--   1 root root 134987 2015-09-23 17:25
> /user/root/checkpoint/checkpoint-144300030
> -rw-r--r--   1 root root 134944 2015-09-23 17:30
> /user/root/checkpoint/checkpoint-144300060
> -rw-r--r--   1 root root 134956 2015-09-23 17:35
> /user/root/checkpoint/checkpoint-144300090
> -rw-r--r--   1 root root 135244 2015-09-23 17:40
> /user/root/checkpoint/checkpoint-144300120
> drwxr-xr-x   - root root  0 2015-09-23 18:48
> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2
> drwxr-xr-x   - root root  0 2015-09-23 17:44
> /user/root/checkpoint/receivedBlockMetadata
>
>
> I restart spark and it reads from
> /user/root/checkpoint/d3714249-e03a-45c7-a0d5-1dc870b7d9f2. But it seems
> that the data in it lost some rdds so it is not able to recovery. While I
> find other directories in checkpoint/, like
>  /user/root/checkpoint/checkpoint-144300120.  What does it used for?
> Can I recovery my data from that?
>


Get only updated RDDs from or after updateStateBykey

2015-09-23 Thread Bin Wang
I've read the source code and it seems to be impossible, but I'd like to
confirm it.

It is a very useful feature. For example, I need to store the state of
DStream into my database, in order to recovery them from next redeploy. But
I only need to save the updated ones. Save all keys into database is a lot
of waste.

Through the source code, I think it could be add easily: StateDStream can
get prevStateRDD so that it can make a diff. Is there any chance to add
this as an API of StateDStream? If so, I can work on this feature.

If not possible, is there any work around or hack to do this by myself?


Re: Why there is no snapshots for 1.5 branch?

2015-09-22 Thread Bin Wang
Thanks. I've solved it. I modified pom.xml and add my own repo into it,
then use "mvn deploy".

Fengdong Yu <fengdo...@everstring.com>于2015年9月22日周二 下午2:08写道:

> basically, you can build snapshot by yourself.
>
> just clone the source code, and then 'mvn package/deploy/install…..’
>
>
> Azuryy Yu
>
>
>
> On Sep 22, 2015, at 13:36, Bin Wang <wbi...@gmail.com> wrote:
>
> However I find some scripts in dev/audit-release, can I use them?
>
> Bin Wang <wbi...@gmail.com>于2015年9月22日周二 下午1:34写道:
>
>> No, I mean push spark to my private repository. Spark don't have a
>> build.sbt as far as I see.
>>
>> Fengdong Yu <fengdo...@everstring.com>于2015年9月22日周二 下午1:29写道:
>>
>>> Do you mean you want to publish the artifact to your private repository?
>>>
>>> if so, please using ‘sbt publish’
>>>
>>> add the following in your build.sb:
>>>
>>> publishTo := {
>>>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/
>>> <https://your_private_repo_hosts/>"
>>>   if (version.value.endsWith("SNAPSHOT"))
>>> Some("snapshots" at nexus + "content/repositories/snapshots")
>>>   else
>>> Some("releases"  at nexus + "content/repositories/releases")
>>>
>>> }
>>>
>>>
>>>
>>> On Sep 22, 2015, at 13:26, Bin Wang <wbi...@gmail.com> wrote:
>>>
>>> My project is using sbt (or maven), which need to download dependency
>>> from a maven repo. I have my own private maven repo with nexus but I don't
>>> know how to push my own build to it, can you give me a hint?
>>>
>>> Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午1:25写道:
>>>
>>>> Yeah, whoever is maintaining the scripts and snapshot builds has fallen
>>>> down on the job -- but there is nothing preventing you from checking out
>>>> branch-1.5 and creating your own build, which is arguably a smarter thing
>>>> to do anyway.  If I'm going to use a non-release build, then I want the
>>>> full git commit history of exactly what is in that build readily available,
>>>> not just somewhat arbitrary JARs.
>>>>
>>>> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang <wbi...@gmail.com> wrote:
>>>>
>>>>> But I cannot find 1.5.1-SNAPSHOT either at
>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>>>>>
>>>>> Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午12:55写道:
>>>>>
>>>>>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
>>>>>> The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1
>>>>>> release candidates and then the 1.5.1 release.
>>>>>>
>>>>>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang <wbi...@gmail.com> wrote:
>>>>>>
>>>>>>> I'd like to use some important bug fixes in 1.5 branch and I look
>>>>>>> for the apache maven host, but don't find any snapshot for 1.5 branch.
>>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>>>>>
>>>>>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
>>>>>>> 1.5.X?
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
But I cannot find 1.5.1-SNAPSHOT either at
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/

Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午12:55写道:

> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The
> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release
> candidates and then the 1.5.1 release.
>
> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang <wbi...@gmail.com> wrote:
>
>> I'd like to use some important bug fixes in 1.5 branch and I look for the
>> apache maven host, but don't find any snapshot for 1.5 branch.
>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>
>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?
>>
>
>


Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
I'd like to use some important bug fixes in 1.5 branch and I look for the
apache maven host, but don't find any snapshot for 1.5 branch.
https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/

I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
My project is using sbt (or maven), which need to download dependency from
a maven repo. I have my own private maven repo with nexus but I don't know
how to push my own build to it, can you give me a hint?

Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午1:25写道:

> Yeah, whoever is maintaining the scripts and snapshot builds has fallen
> down on the job -- but there is nothing preventing you from checking out
> branch-1.5 and creating your own build, which is arguably a smarter thing
> to do anyway.  If I'm going to use a non-release build, then I want the
> full git commit history of exactly what is in that build readily available,
> not just somewhat arbitrary JARs.
>
> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang <wbi...@gmail.com> wrote:
>
>> But I cannot find 1.5.1-SNAPSHOT either at
>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>>
>> Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午12:55写道:
>>
>>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.  The
>>> current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1 release
>>> candidates and then the 1.5.1 release.
>>>
>>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang <wbi...@gmail.com> wrote:
>>>
>>>> I'd like to use some important bug fixes in 1.5 branch and I look for
>>>> the apache maven host, but don't find any snapshot for 1.5 branch.
>>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>>
>>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for 1.5.X?
>>>>
>>>
>>>
>


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
However I find some scripts in dev/audit-release, can I use them?

Bin Wang <wbi...@gmail.com>于2015年9月22日周二 下午1:34写道:

> No, I mean push spark to my private repository. Spark don't have a
> build.sbt as far as I see.
>
> Fengdong Yu <fengdo...@everstring.com>于2015年9月22日周二 下午1:29写道:
>
>> Do you mean you want to publish the artifact to your private repository?
>>
>> if so, please using ‘sbt publish’
>>
>> add the following in your build.sb:
>>
>> publishTo := {
>>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
>>   if (version.value.endsWith("SNAPSHOT"))
>> Some("snapshots" at nexus + "content/repositories/snapshots")
>>   else
>> Some("releases"  at nexus + "content/repositories/releases")
>>
>> }
>>
>>
>>
>> On Sep 22, 2015, at 13:26, Bin Wang <wbi...@gmail.com> wrote:
>>
>> My project is using sbt (or maven), which need to download dependency
>> from a maven repo. I have my own private maven repo with nexus but I don't
>> know how to push my own build to it, can you give me a hint?
>>
>> Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午1:25写道:
>>
>>> Yeah, whoever is maintaining the scripts and snapshot builds has fallen
>>> down on the job -- but there is nothing preventing you from checking out
>>> branch-1.5 and creating your own build, which is arguably a smarter thing
>>> to do anyway.  If I'm going to use a non-release build, then I want the
>>> full git commit history of exactly what is in that build readily available,
>>> not just somewhat arbitrary JARs.
>>>
>>> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang <wbi...@gmail.com> wrote:
>>>
>>>> But I cannot find 1.5.1-SNAPSHOT either at
>>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>>>>
>>>> Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午12:55写道:
>>>>
>>>>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
>>>>> The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1
>>>>> release candidates and then the 1.5.1 release.
>>>>>
>>>>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang <wbi...@gmail.com> wrote:
>>>>>
>>>>>> I'd like to use some important bug fixes in 1.5 branch and I look for
>>>>>> the apache maven host, but don't find any snapshot for 1.5 branch.
>>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>>>>
>>>>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
>>>>>> 1.5.X?
>>>>>>
>>>>>
>>>>>
>>>
>>


Re: Why there is no snapshots for 1.5 branch?

2015-09-21 Thread Bin Wang
No, I mean push spark to my private repository. Spark don't have a
build.sbt as far as I see.

Fengdong Yu <fengdo...@everstring.com>于2015年9月22日周二 下午1:29写道:

> Do you mean you want to publish the artifact to your private repository?
>
> if so, please using ‘sbt publish’
>
> add the following in your build.sb:
>
> publishTo := {
>   val nexus = "https://YOUR_PRIVATE_REPO_HOSTS/;
>   if (version.value.endsWith("SNAPSHOT"))
> Some("snapshots" at nexus + "content/repositories/snapshots")
>   else
> Some("releases"  at nexus + "content/repositories/releases")
>
> }
>
>
>
> On Sep 22, 2015, at 13:26, Bin Wang <wbi...@gmail.com> wrote:
>
> My project is using sbt (or maven), which need to download dependency from
> a maven repo. I have my own private maven repo with nexus but I don't know
> how to push my own build to it, can you give me a hint?
>
> Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午1:25写道:
>
>> Yeah, whoever is maintaining the scripts and snapshot builds has fallen
>> down on the job -- but there is nothing preventing you from checking out
>> branch-1.5 and creating your own build, which is arguably a smarter thing
>> to do anyway.  If I'm going to use a non-release build, then I want the
>> full git commit history of exactly what is in that build readily available,
>> not just somewhat arbitrary JARs.
>>
>> On Mon, Sep 21, 2015 at 9:57 PM, Bin Wang <wbi...@gmail.com> wrote:
>>
>>> But I cannot find 1.5.1-SNAPSHOT either at
>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/
>>>
>>> Mark Hamstra <m...@clearstorydata.com>于2015年9月22日周二 下午12:55写道:
>>>
>>>> There is no 1.5.0-SNAPSHOT because 1.5.0 has already been released.
>>>> The current head of branch-1.5 is 1.5.1-SNAPSHOT -- soon to be 1.5.1
>>>> release candidates and then the 1.5.1 release.
>>>>
>>>> On Mon, Sep 21, 2015 at 9:51 PM, Bin Wang <wbi...@gmail.com> wrote:
>>>>
>>>>> I'd like to use some important bug fixes in 1.5 branch and I look for
>>>>> the apache maven host, but don't find any snapshot for 1.5 branch.
>>>>> https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.10/1.5.0-SNAPSHOT/
>>>>>
>>>>> I can find 1.4.X and 1.6.0 versions, why there is no snapshot for
>>>>> 1.5.X?
>>>>>
>>>>
>>>>
>>
>


Re: QueueStream doesn't support checkpoint makes it difficult to do unit test

2015-09-17 Thread Bin Wang
Never mind. I've found a PR and it merged:
https://github.com/apache/spark/pull/8624/commits

Bin Wang <wbi...@gmail.com>于2015年9月17日周四 下午4:50写道:

> I'm using spark streaming and use updateStateByKey, which forced to use
> checkpoint. In my unit test, I create a queueStream to test. But in spark
> 1.5, QueueStream will throw an exception while use it with checkpoint, it
> makes difficult to do unit test. Is there an option to disable this? Though
> I know it will fail to recover from checkpoint but since it is a test I
> don't care it.
>
> I've found the git commit here
> https://mail-archives.apache.org/mod_mbox/incubator-spark-commits/201506.mbox/%3c8efccd27016447fb8d1e0b3d9582b...@git.apache.org%3E
>


Optimize the first map reduce of DStream

2015-03-24 Thread Bin Wang
Hi,

I'm learning Spark and I find there could be some optimize for the current
streaming implementation. Correct me if I'm wrong.

The current streaming implementation put the data of one batch into memory
(as RDD). But it seems not necessary.

For example, if I want to count the lines which contains word Spark, I
just need to map every line to see if it contains word, then reduce it with
a sum function. After that, this line is no longer useful to keep it in
memory.

That is said, if the DStream only have one map and/or reduce operation on
it. It is not necessary to keep all the batch data in the memory. Something
like a pipeline should be OK.

Is it difficult to implement on top of the current implementation?

Thanks.

---
Bin Wang