Re: Is there any way to stop a jenkins build

2015-12-29 Thread Josh Rosen
Yeah, I thought that my quick fix might address the
HiveThriftBinaryServerSuite hanging issue, but it looks like it didn't work
so I'll now have to do the more principled fix of using a UDF which sleeps
for some amount of time.

In order to stop builds, you need to have a Jenkins account with the proper
permissions. I believe that it's generally only Spark committers and AMPLab
members who have accounts + Jenkins SSH access.

I've gone ahead killed the build for you. It looks like someone had
configured the pull request builder timeout to be 300 minutes (5 hours),
but I think we should consider decreasing that to match the timeout used by
the Spark full test jobs.

On Tue, Dec 29, 2015 at 10:04 AM, Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Thanks. I'll merge the most recent master...
>
> Still curious if we can stop a build.
>
> Kind regards,
>
> Herman van Hövell tot Westerflier
>
> 2015-12-29 18:59 GMT+01:00 Ted Yu :
>
>> HiveThriftBinaryServerSuite got stuck.
>>
>> I thought Josh has fixed this issue:
>>
>> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in
>> HiveThriftBinaryServerSuite
>>
>> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier <
>> hvanhov...@questtec.nl> wrote:
>>
>>> My AMPLAB jenkins build has been stuck for a few hours now:
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull
>>>
>>> Is there a way for me to stop the build?
>>>
>>> Kind regards,
>>>
>>> Herman van Hövell
>>>
>>>
>>
>


Re: Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
Thanks. I'll merge the most recent master...

Still curious if we can stop a build.

Kind regards,

Herman van Hövell tot Westerflier

2015-12-29 18:59 GMT+01:00 Ted Yu :

> HiveThriftBinaryServerSuite got stuck.
>
> I thought Josh has fixed this issue:
>
> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in
> HiveThriftBinaryServerSuite
>
> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> My AMPLAB jenkins build has been stuck for a few hours now:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull
>>
>> Is there a way for me to stop the build?
>>
>> Kind regards,
>>
>> Herman van Hövell
>>
>>
>


Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-29 Thread Shixiong Zhu
Could you create a JIRA? We can continue the discussion there. Thanks!

Best Regards,
Shixiong Zhu

2015-12-29 3:42 GMT-08:00 Jan Uyttenhove :

> Hi guys,
>
> I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new
> mapWithState API, after previously reporting issue SPARK-11932 (
> https://issues.apache.org/jira/browse/SPARK-11932).
>
> My Spark streaming job involves reading data from a Kafka topic (using
> KafkaUtils.createDirectStream), stateful processing (using checkpointing
> & mapWithState) & publishing the results back to Kafka.
>
> I'm now facing the NullPointerException below when restoring from a
> checkpoint in the following scenario:
> 1/ run job (with local[2]), process data from Kafka while creating &
> keeping state
> 2/ stop the job
> 3/ generate some extra message on the input Kafka topic
> 4/ start the job again (and restore offsets & state from the checkpoints)
>
> The problem is caused by (or at least related to) step 3, i.e. publishing
> data to the input topic while the job is stopped.
> The above scenario has been tested successfully when:
> - step 3 is excluded, so restoring state from a checkpoint is successful
> when no messages are added when the job is stopped
> - after step 2, the checkpoints are deleted
>
> Any clues? Am I doing something wrong here, or is there still a problem
> with the mapWithState impl?
>
> Thanx,
>
> Jan
>
>
>
> 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage
> 3.0 (TID 24)
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
> at
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_25_1 in memory
> on localhost:10003 (size: 1024.0 B, free: 511.1 MB)
> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
> non-empty blocks out of 8 blocks
> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0
> remote fetches in 0 ms
> 15/12/29 11:56:12 INFO storage.MemoryStore: Block rdd_29_1 stored as
> values in memory (estimated size 1824.0 B, free 488.0 KB)
> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_29_1 in memory
> on localhost:10003 (size: 1824.0 B, free: 511.1 MB)
> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
> non-empty blocks out of 8 blocks
> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0
> remote fetches in 0 ms
> 15/12/29 11:56:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage
> 3.0 (TID 24, localhost): java.lang.NullPointerException
> at
> 

Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
My AMPLAB jenkins build has been stuck for a few hours now:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull

Is there a way for me to stop the build?

Kind regards,

Herman van Hövell


Re: Is there any way to stop a jenkins build

2015-12-29 Thread Herman van Hövell tot Westerflier
Hi Josh,

Your HiveThriftBinaryServerSuite fix wasn't in the build I was running (I
forgot to merge the latest master). So it might actually work.

As for stopping the build, it is understandable that you cannot do that
without the proper permissions. It would still be cool to be able to issue
a 'stop build' command from github though.

Kind regards,

Herman

2015-12-29 19:19 GMT+01:00 Josh Rosen :

> Yeah, I thought that my quick fix might address the
> HiveThriftBinaryServerSuite hanging issue, but it looks like it didn't work
> so I'll now have to do the more principled fix of using a UDF which sleeps
> for some amount of time.
>
> In order to stop builds, you need to have a Jenkins account with the
> proper permissions. I believe that it's generally only Spark committers and
> AMPLab members who have accounts + Jenkins SSH access.
>
> I've gone ahead killed the build for you. It looks like someone had
> configured the pull request builder timeout to be 300 minutes (5 hours),
> but I think we should consider decreasing that to match the timeout used by
> the Spark full test jobs.
>
> On Tue, Dec 29, 2015 at 10:04 AM, Herman van Hövell tot Westerflier <
> hvanhov...@questtec.nl> wrote:
>
>> Thanks. I'll merge the most recent master...
>>
>> Still curious if we can stop a build.
>>
>> Kind regards,
>>
>> Herman van Hövell tot Westerflier
>>
>> 2015-12-29 18:59 GMT+01:00 Ted Yu :
>>
>>> HiveThriftBinaryServerSuite got stuck.
>>>
>>> I thought Josh has fixed this issue:
>>>
>>> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in
>>> HiveThriftBinaryServerSuite
>>>
>>> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier <
>>> hvanhov...@questtec.nl> wrote:
>>>
 My AMPLAB jenkins build has been stuck for a few hours now:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull

 Is there a way for me to stop the build?

 Kind regards,

 Herman van Hövell


>>>
>>
>


Re: RDD[Vector] Immutability issue

2015-12-29 Thread ai he
Hi salexln,

RDD's immutability depends on the underlying structure. I have the
following example.

--
scala> val m = Array.fill(2, 2)(0)
m: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0))


scala> val rdd = sc.parallelize(m)
rdd: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[1]
at parallelize at :23


scala> rdd.collect()
res6: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0))


scala> m(0)(1) = 2


scala> rdd.collect()
res8: Array[Array[Int]] = Array(Array(0, 2), Array(0, 0))
--

You see that variable rdd actually changes when its underlying array
changes. Hopefully this helps you.

Best,
Ai

On Mon, Dec 28, 2015 at 12:36 PM, salexln  wrote:
> Hi guys,
> I know the RDDs are immutable and therefore their value cannot be changed
> but I see the following behaviour:
> I wrote an implementation for FuzzyCMeans algorithm and now I'm testing it,
> so i run the following example:
>
> import org.apache.spark.mllib.clustering.FuzzyCMeans
> import org.apache.spark.mllib.linalg.Vectors
>
> val data =
> sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt")
> val parsedData = data.map(s => Vectors.dense(s.split('
> ').map(_.toDouble))).cache()
>> parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
>> = MapPartitionsRDD[2] at map at :31
>
> val numClusters = 2
> val numIterations = 20
>
> parsedData.foreach{ point => println(point) }
>> [0.0,-8.0]
> [-3.0,-2.0]
> [-3.0,0.0]
> [-3.0,2.0]
> [-2.0,-1.0]
> [-2.0,0.0]
> [-2.0,1.0]
> [-1.0,0.0]
> [0.0,0.0]
> [1.0,0.0]
> [2.0,-1.0]
> [2.0,0.0]
> [2.0,1.0]
> [3.0,-2.0]
> [3.0,0.0]
> [3.0,2.0]
> [0.0,8.0]
>
> val clusters = FuzzyCMeans.train(parsedData, numClusters, numIteration
> parsedData.foreach{ point => println(point) }
>>
> [0.0,-0.480185624595]
> [-0.1811743096972924,-0.12078287313152826]
> [-0.06638890786148487,0.0]
> [-0.04005925925925929,0.02670617283950619]
> [-0.12193263222069807,-0.060966316110349035]
> [-0.0512,0.0]
> [NaN,NaN]
> [-0.049382716049382706,0.0]
> [NaN,NaN]
> [0.006830134553650707,0.0]
> [0.05122,-0.02561]
> [0.04755220304297078,0.0]
> [0.06581619798335057,0.03290809899167529]
> [0.12010867103812725,-0.0800724473587515]
> [0.10946638900458144,0.0]
> [0.14814814814814817,0.09876543209876545]
> [0.0,0.49119985188436205]
>
>
>
> But how can this be that my method changes the Immutable RDD?
>
> BTW, the signature of the train method, is the following:
>
> train( data: RDD[Vector], clusters: Int, maxIterations: Int)
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>



-- 
Best
Ai

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD[Vector] Immutability issue

2015-12-29 Thread ai he
Same thing.

Say, your underlying structure is like Array(ArrayBuffer(1, 2),
ArrayBuffer(3, 4)).

Then you can add/remove data in ArrayBuffers and then the change will
be reflected in the rdd.



On Tue, Dec 29, 2015 at 11:19 AM, salexln  wrote:
> I see, so in order the RDD to be completely immutable, its content should be
> immutable as well.
>
> And if the content is not immutable, we can change its content, but cannot
> add / remove data?
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827p15841.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>



-- 
Best
Ai

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD[Vector] Immutability issue

2015-12-29 Thread Mark Hamstra
You can, but you shouldn't.  Using backdoors to mutate the data in an RDD
is a good way to produce confusing and inconsistent results when, e.g., an
RDD's lineage needs to be recomputed or a Task is resubmitted on fetch
failure.

On Tue, Dec 29, 2015 at 11:24 AM, ai he  wrote:

> Same thing.
>
> Say, your underlying structure is like Array(ArrayBuffer(1, 2),
> ArrayBuffer(3, 4)).
>
> Then you can add/remove data in ArrayBuffers and then the change will
> be reflected in the rdd.
>
>
>
> On Tue, Dec 29, 2015 at 11:19 AM, salexln  wrote:
> > I see, so in order the RDD to be completely immutable, its content
> should be
> > immutable as well.
> >
> > And if the content is not immutable, we can change its content, but
> cannot
> > add / remove data?
> >
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827p15841.html
> > Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
> >
> > -
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> > For additional commands, e-mail: dev-h...@spark.apache.org
> >
>
>
>
> --
> Best
> Ai
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: RDD[Vector] Immutability issue

2015-12-29 Thread Vivekananda Venkateswaran
RDD is collection of object And if these objects are mutable and changed
then the same will reflect in RDD.
For immutable objects it will not. Changing the mutable objects that are in
the RDD is not right practise.

The RDD is immutable in the sense that any transformation on the RDD  will
result in new RDD object.


On Tue, Dec 29, 2015 at 2:50 PM, Mark Hamstra 
wrote:

> You can, but you shouldn't.  Using backdoors to mutate the data in an RDD
> is a good way to produce confusing and inconsistent results when, e.g., an
> RDD's lineage needs to be recomputed or a Task is resubmitted on fetch
> failure.
>
> On Tue, Dec 29, 2015 at 11:24 AM, ai he  wrote:
>
>> Same thing.
>>
>> Say, your underlying structure is like Array(ArrayBuffer(1, 2),
>> ArrayBuffer(3, 4)).
>>
>> Then you can add/remove data in ArrayBuffers and then the change will
>> be reflected in the rdd.
>>
>>
>>
>> On Tue, Dec 29, 2015 at 11:19 AM, salexln  wrote:
>> > I see, so in order the RDD to be completely immutable, its content
>> should be
>> > immutable as well.
>> >
>> > And if the content is not immutable, we can change its content, but
>> cannot
>> > add / remove data?
>> >
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827p15841.html
>> > Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: dev-h...@spark.apache.org
>> >
>>
>>
>>
>> --
>> Best
>> Ai
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>


Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-29 Thread Shixiong(Ryan) Zhu
 Hi Jan, could you post your codes? I could not reproduce this issue in my
environment.

Best Regards,
Shixiong Zhu

2015-12-29 10:22 GMT-08:00 Shixiong Zhu :

> Could you create a JIRA? We can continue the discussion there. Thanks!
>
> Best Regards,
> Shixiong Zhu
>
> 2015-12-29 3:42 GMT-08:00 Jan Uyttenhove :
>
>> Hi guys,
>>
>> I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new
>> mapWithState API, after previously reporting issue SPARK-11932 (
>> https://issues.apache.org/jira/browse/SPARK-11932).
>>
>> My Spark streaming job involves reading data from a Kafka topic
>> (using KafkaUtils.createDirectStream), stateful processing (using
>> checkpointing & mapWithState) & publishing the results back to Kafka.
>>
>> I'm now facing the NullPointerException below when restoring from a
>> checkpoint in the following scenario:
>> 1/ run job (with local[2]), process data from Kafka while creating &
>> keeping state
>> 2/ stop the job
>> 3/ generate some extra message on the input Kafka topic
>> 4/ start the job again (and restore offsets & state from the checkpoints)
>>
>> The problem is caused by (or at least related to) step 3, i.e. publishing
>> data to the input topic while the job is stopped.
>> The above scenario has been tested successfully when:
>> - step 3 is excluded, so restoring state from a checkpoint is successful
>> when no messages are added when the job is stopped
>> - after step 2, the checkpoints are deleted
>>
>> Any clues? Am I doing something wrong here, or is there still a problem
>> with the mapWithState impl?
>>
>> Thanx,
>>
>> Jan
>>
>>
>>
>> 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage
>> 3.0 (TID 24)
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
>> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> at
>> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_25_1 in memory
>> on localhost:10003 (size: 1024.0 B, free: 511.1 MB)
>> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
>> non-empty blocks out of 8 blocks
>> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0
>> remote fetches in 0 ms
>> 15/12/29 11:56:12 INFO storage.MemoryStore: Block rdd_29_1 stored as
>> values in memory (estimated size 1824.0 B, free 488.0 KB)
>> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_29_1 in memory
>> on localhost:10003 (size: 1824.0 B, free: 511.1 MB)
>> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
>> non-empty blocks out of 8 blocks
>> 15/12/29 

Re: running lda in spark throws exception

2015-12-29 Thread Joseph Bradley
Hi Li,

I'm wondering if you're running into the same bug reported here:
https://issues.apache.org/jira/browse/SPARK-12488

I haven't figured out yet what is causing it.  Do you have a small corpus
which reproduces this error, and which you can share on the JIRA?  If so,
that would help a lot in debugging this failure.

Thanks!
Joseph

On Sun, Dec 27, 2015 at 7:26 PM, Li Li  wrote:

> I ran my lda example in a yarn 2.6.2 cluster with spark 1.5.2.
> it throws exception in line:   Matrix topics = ldaModel.topicsMatrix();
> But in yarn job history ui, it's successful. What's wrong with it?
> I submit job with
> .bin/spark-submit --class Myclass \
> --master yarn-client \
> --num-executors 2 \
> --driver-memory 4g \
> --executor-memory 4g \
> --executor-cores 1 \
>
>
> My codes:
>
>corpus.cache();
>
>
> // Cluster the documents into three topics using LDA
>
> DistributedLDAModel ldaModel = (DistributedLDAModel) new
>
> LDA().setOptimizer("em").setMaxIterations(iterNumber).setK(topicNumber).run(corpus);
>
>
> // Output topics. Each is a distribution over words (matching word
> count vectors)
>
> System.out.println("Learned topics (as distributions over vocab of
> " + ldaModel.vocabSize()
>
> + " words):");
>
>//Line81, exception here:Matrix topics = ldaModel.topicsMatrix();
>
> for (int topic = 0; topic < topicNumber; topic++) {
>
>   System.out.print("Topic " + topic + ":");
>
>   for (int word = 0; word < ldaModel.vocabSize(); word++) {
>
> System.out.print(" " + topics.apply(word, topic));
>
>   }
>
>   System.out.println();
>
> }
>
>
> ldaModel.save(sc.sc(), modelPath);
>
>
> Exception in thread "main" java.lang.IndexOutOfBoundsException:
> (1025,0) not in [-58,58) x [-100,100)
>
> at
> breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:534)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:531)
>
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531)
>
> at
> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523)
>
> at
> com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:81)
>
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:606)
>
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>
> at
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>
> at
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
> 15/12/23 00:01:16 INFO spark.SparkContext: Invoking stop() from shutdown
> hook
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


IndentationCheck of checkstyle

2015-12-29 Thread Ted Yu
Hi,
I noticed that there are a lot of checkstyle warnings in the following form:



To my knowledge, we use two spaces for each tab. Not sure why all of a
sudden we have so many IndentationCheck warnings:

grep 'hild have incorrect indentati' trunkCheckstyle.xml | wc
3133   52645  678294

If there is no objection, I would create a JIRA and relax IndentationCheck
warning.

Cheers


problem with reading source code-pull out nondeterministic expresssions

2015-12-29 Thread 汪洋
Hi fellas,
I am new to spark and I have a newbie question. I am currently reading the 
source code in spark sql catalyst analyzer. I not quite understand the partial 
function in PullOutNondeterministric. What does it mean by "pull out”? Why do 
we have to do the "pulling out”?
I would really appreciate it if somebody explain it to me. 
Thanks. 

Re: IndentationCheck of checkstyle

2015-12-29 Thread Reynold Xin
OK to close the loop - this thread has nothing to do with Spark?


On Tue, Dec 29, 2015 at 9:55 PM, Ted Yu  wrote:

> Oops, wrong list :-)
>
> On Dec 29, 2015, at 9:48 PM, Reynold Xin  wrote:
>
> +Herman
>
> Is this coming from the newly merged Hive parser?
>
>
>
> On Tue, Dec 29, 2015 at 9:46 PM, Allen Zhang 
> wrote:
>
>>
>>
>> format issue I think, go ahead
>>
>>
>>
>>
>> At 2015-12-30 13:36:05, "Ted Yu"  wrote:
>>
>> Hi,
>> I noticed that there are a lot of checkstyle warnings in the following
>> form:
>>
>> > source="com.puppycrawl.tools.checkstyle.
>> checks.indentation.IndentationCheck"/>
>>
>> To my knowledge, we use two spaces for each tab. Not sure why all of a
>> sudden we have so many IndentationCheck warnings:
>>
>> grep 'hild have incorrect indentati' trunkCheckstyle.xml | wc
>> 3133   52645  678294
>>
>> If there is no objection, I would create a JIRA and
>> relax IndentationCheck warning.
>>
>> Cheers
>>
>>
>>
>>
>>
>
>


Re: IndentationCheck of checkstyle

2015-12-29 Thread Ted Yu
Oops, wrong list :-)

> On Dec 29, 2015, at 9:48 PM, Reynold Xin  wrote:
> 
> +Herman
> 
> Is this coming from the newly merged Hive parser?
> 
> 
> 
>> On Tue, Dec 29, 2015 at 9:46 PM, Allen Zhang  wrote:
>> 
>> 
>> format issue I think, go ahead
>> 
>> 
>> 
>> 
>> At 2015-12-30 13:36:05, "Ted Yu"  wrote:
>> Hi,
>> I noticed that there are a lot of checkstyle warnings in the following form:
>> 
>> > source="com.puppycrawl.tools.checkstyle.   
>> checks.indentation.IndentationCheck"/>
>> 
>> To my knowledge, we use two spaces for each tab. Not sure why all of a 
>> sudden we have so many IndentationCheck warnings:
>> 
>> grep 'hild have incorrect indentati' trunkCheckstyle.xml | wc
>> 3133   52645  678294
>> 
>> If there is no objection, I would create a JIRA and relax IndentationCheck 
>> warning.
>> 
>> Cheers
> 


Re: running lda in spark throws exception

2015-12-29 Thread Li Li
I will use a portion of data and try. will the hdfs block affect
spark?(if so, it's hard to reproduce)

On Wed, Dec 30, 2015 at 3:22 AM, Joseph Bradley  wrote:
> Hi Li,
>
> I'm wondering if you're running into the same bug reported here:
> https://issues.apache.org/jira/browse/SPARK-12488
>
> I haven't figured out yet what is causing it.  Do you have a small corpus
> which reproduces this error, and which you can share on the JIRA?  If so,
> that would help a lot in debugging this failure.
>
> Thanks!
> Joseph
>
> On Sun, Dec 27, 2015 at 7:26 PM, Li Li  wrote:
>>
>> I ran my lda example in a yarn 2.6.2 cluster with spark 1.5.2.
>> it throws exception in line:   Matrix topics = ldaModel.topicsMatrix();
>> But in yarn job history ui, it's successful. What's wrong with it?
>> I submit job with
>> .bin/spark-submit --class Myclass \
>> --master yarn-client \
>> --num-executors 2 \
>> --driver-memory 4g \
>> --executor-memory 4g \
>> --executor-cores 1 \
>>
>>
>> My codes:
>>
>>corpus.cache();
>>
>>
>> // Cluster the documents into three topics using LDA
>>
>> DistributedLDAModel ldaModel = (DistributedLDAModel) new
>>
>> LDA().setOptimizer("em").setMaxIterations(iterNumber).setK(topicNumber).run(corpus);
>>
>>
>> // Output topics. Each is a distribution over words (matching word
>> count vectors)
>>
>> System.out.println("Learned topics (as distributions over vocab of
>> " + ldaModel.vocabSize()
>>
>> + " words):");
>>
>>//Line81, exception here:Matrix topics = ldaModel.topicsMatrix();
>>
>> for (int topic = 0; topic < topicNumber; topic++) {
>>
>>   System.out.print("Topic " + topic + ":");
>>
>>   for (int word = 0; word < ldaModel.vocabSize(); word++) {
>>
>> System.out.print(" " + topics.apply(word, topic));
>>
>>   }
>>
>>   System.out.println();
>>
>> }
>>
>>
>> ldaModel.save(sc.sc(), modelPath);
>>
>>
>> Exception in thread "main" java.lang.IndexOutOfBoundsException:
>> (1025,0) not in [-58,58) x [-100,100)
>>
>> at
>> breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112)
>>
>> at
>> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:534)
>>
>> at
>> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:531)
>>
>> at
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>
>> at
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>
>> at
>> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531)
>>
>> at
>> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523)
>>
>> at
>> com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:81)
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:606)
>>
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
>>
>> at
>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>>
>> at
>> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>>
>> at
>> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>>
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>> 15/12/23 00:01:16 INFO spark.SparkContext: Invoking stop() from shutdown
>> hook
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Partitioning of RDD across worker machines

2015-12-29 Thread Disha Shrivastava
Hi,

Suppose I have a file locally on my master machine and the same file is
also present in the same path on all the worker machines , say
/home/user_name/Desktop. I wanted to know that when we partition the data
using sc.parallelize , Spark actually broadcasts parts of the RDD to all
the worker machines or it reads the corresponding segment locally from the
memory of the worker machine?

How to I avoid movement of this data? Will it help if I store the file in
HDFS?

Thanks and Regards,
Disha


Spark streaming 1.6.0-RC4 NullPointerException using mapWithState

2015-12-29 Thread Jan Uyttenhove
Hi guys,

I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new
mapWithState API, after previously reporting issue SPARK-11932 (
https://issues.apache.org/jira/browse/SPARK-11932).

My Spark streaming job involves reading data from a Kafka topic (using
KafkaUtils.createDirectStream), stateful processing (using checkpointing &
mapWithState) & publishing the results back to Kafka.

I'm now facing the NullPointerException below when restoring from a
checkpoint in the following scenario:
1/ run job (with local[2]), process data from Kafka while creating &
keeping state
2/ stop the job
3/ generate some extra message on the input Kafka topic
4/ start the job again (and restore offsets & state from the checkpoints)

The problem is caused by (or at least related to) step 3, i.e. publishing
data to the input topic while the job is stopped.
The above scenario has been tested successfully when:
- step 3 is excluded, so restoring state from a checkpoint is successful
when no messages are added when the job is stopped
- after step 2, the checkpoints are deleted

Any clues? Am I doing something wrong here, or is there still a problem
with the mapWithState impl?

Thanx,

Jan



15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage
3.0 (TID 24)
java.lang.NullPointerException
at
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55)
at
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at
org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_25_1 in memory
on localhost:10003 (size: 1024.0 B, free: 511.1 MB)
15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
non-empty blocks out of 8 blocks
15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0
remote fetches in 0 ms
15/12/29 11:56:12 INFO storage.MemoryStore: Block rdd_29_1 stored as values
in memory (estimated size 1824.0 B, free 488.0 KB)
15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_29_1 in memory
on localhost:10003 (size: 1824.0 B, free: 511.1 MB)
15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0
non-empty blocks out of 8 blocks
15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0
remote fetches in 0 ms
15/12/29 11:56:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0
(TID 24, localhost): java.lang.NullPointerException
at
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103)
at
org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111)
at
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56)
at
org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55)
at