Re: getting different results from same line of code repeated

2015-11-20 Thread Walrus theCat
I'm running into all kinds of problems with Spark 1.5.1 -- does anyone have
a version that's working smoothly for them?

On Fri, Nov 20, 2015 at 10:50 AM, Dean Wampler 
wrote:

> I didn't expect that to fail. I would call it a bug for sure, since it's
> practically useless if this method doesn't work.
>
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
> Typesafe <http://typesafe.com>
> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
> On Fri, Nov 20, 2015 at 12:45 PM, Walrus theCat 
> wrote:
>
>> Dean,
>>
>> What's the point of Scala without magic? :-)
>>
>> Thanks for your help.  It's still giving me unreliable results.  There
>> just has to be a way to do this in Spark.  It's a pretty fundamental thing.
>>
>> scala> targets.takeOrdered(1) // imported as implicit here
>> res23: Array[(String, Int)] = Array()
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res24: Array[(String, Int)] = Array((\bmurders?\b,717))
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res25: Array[(String, Int)] = Array((\bmurders?\b,717))
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res26: Array[(String, Int)] = Array((\bguns?\b,1253))
>>
>> scala> targets.takeOrdered(1)(CountOrdering)
>> res27: Array[(String, Int)] = Array((\bmurders?\b,717))
>>
>>
>>
>> On Wed, Nov 18, 2015 at 6:20 PM, Dean Wampler 
>> wrote:
>>
>>> You don't have to use sortBy (although that would be better...). You
>>> have to define an Ordering object and pass it as the second argument list
>>> to takeOrdered()(), or declare it "implicitly". This is more fancy Scala
>>> than Spark should require here. Here's an example I've used:
>>>
>>>   // schema with (String,Int). Order by the Int descending
>>>   object CountOrdering extends Ordering[(String,Int)] {
>>> def compare(a:(String,Int), b:(String,Int)) =
>>>   -(a._2 compare b._2)  // - so that it sorts descending
>>>   }
>>>
>>>   myRDD.takeOrdered(100)(CountOrdering)
>>>
>>>
>>> Or, if you add the keyword "implicit" before "object CountOrdering
>>> {...}", then you can omit the second argument list. That's more magic than
>>> is justified. ;)
>>>
>>> HTH,
>>>
>>> dean
>>>
>>>
>>> Dean Wampler, Ph.D.
>>> Author: Programming Scala, 2nd Edition
>>> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly)
>>> Typesafe <http://typesafe.com>
>>> @deanwampler <http://twitter.com/deanwampler>
>>> http://polyglotprogramming.com
>>>
>>> On Wed, Nov 18, 2015 at 6:37 PM, Walrus theCat 
>>> wrote:
>>>
>>>> Dean,
>>>>
>>>> Thanks a lot.  Very helpful.  How would I use takeOrdered to order by
>>>> the second member of the tuple, as I am attempting to do with
>>>> rdd.sortBy(_._2).first?
>>>>
>>>> On Wed, Nov 18, 2015 at 4:24 PM, Dean Wampler 
>>>> wrote:
>>>>
>>>>> Someone please correct me if I'm wrong, but I think the answer is
>>>>> actually "it's not implemented that way" in the sort methods, and it 
>>>>> should
>>>>> either be documented more explicitly or fixed.
>>>>>
>>>>> Reading the Spark source code, it looks like each partition is sorted
>>>>> internally, and each partition holds a contiguous range of keys in the 
>>>>> RDD.
>>>>> So, if you know which order the partitions should be in, you can produce a
>>>>> total order and hence allow take(n) to do what you expect.
>>>>>
>>>>> The take(n) appears to walk the list of partitions in order, but it's
>>>>> that list that's not deterministic. I can't find any evidence that the RDD
>>>>> output by sortBy has this list of partitions in the correct order. So, 
>>>>> each
>>>>> time you ran your job, the "targets" RDD had sorted partitions, but the
>>>>> list of partitions itself was not properly ordered globally. When you got
>>>>> an exception, probably the first partition happened to be empty.
>>>>>
>>>>> Now, you could argue that take(n) i

getting different results from same line of code repeated

2015-11-18 Thread Walrus theCat
Hi,

I'm launching a Spark cluster with the spark-ec2 script and playing around
in spark-shell. I'm running the same line of code over and over again, and
getting different results, and sometimes exceptions.  Towards the end,
after I cache the first RDD, it gives me the correct result multiple times
in a row before throwing an exception.  How can I get correct behavior out
of these operations on these RDDs?

scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[116] at
sortBy at :36

scala> targets.first
res26: (String, Int) = (\bguns?\b,1253)

scala> val targets = data map {_.REGEX} groupBy{identity} map {
Function.tupled(_->_.size)} sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[125] at
sortBy at :36

scala> targets.first
res27: (String, Int) = (nika,7)


scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[134] at
sortBy at :36

scala> targets.first
res28: (String, Int) = (\bcalientes?\b,6)

scala> targets.sortBy(_._2,false).first
java.lang.UnsupportedOperationException: empty collection

scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false).cache
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[283] at
sortBy at :36

scala> targets.first
res46: (String, Int) = (\bhurting\ yous?\b,8)

scala> val targets =
data.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false).cache
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[292] at
sortBy at :36

scala> targets.first
java.lang.UnsupportedOperationException: empty collection

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[301] at
sortBy at :36

scala> targets.first
res48: (String, Int) = (\bguns?\b,1253)

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[310] at
sortBy at :36

scala> targets.first
res49: (String, Int) = (\bguns?\b,1253)

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[319] at
sortBy at :36

scala> targets.first
res50: (String, Int) = (\bguns?\b,1253)

scala> val targets =
data.cache.map(_.REGEX).groupBy(identity).map(Function.tupled(_->_.size)).sortBy(_._2,false)
targets: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[328] at
sortBy at :36

scala> targets.first
java.lang.UnsupportedOperationException: empty collection


Re: send transformed RDD to s3 from slaves

2015-11-16 Thread Walrus theCat
Update:

You can now answer this on stackoverflow for 100 bounty:

http://stackoverflow.com/questions/33704073/how-to-send-transformed-data-from-partitions-to-s3

On Fri, Nov 13, 2015 at 4:56 PM, Walrus theCat 
wrote:

> Hi,
>
> I have an RDD which crashes the driver when being collected.  I want to
> send the data on its partitions out to S3 without bringing it back to the
> driver. I try calling rdd.foreachPartition, but the data that gets sent has
> not gone through the chain of transformations that I need.  It's the data
> as it was ingested initially.  After specifying my chain of
> transformations, but before calling foreachPartition, I call rdd.count in
> order to force the RDD to transform.  The data it sends out is still not
> transformed.  How do I get the RDD to send out transformed data when
> calling foreachPartition?
>
> Thanks
>


send transformed RDD to s3 from slaves

2015-11-13 Thread Walrus theCat
Hi,

I have an RDD which crashes the driver when being collected.  I want to
send the data on its partitions out to S3 without bringing it back to the
driver. I try calling rdd.foreachPartition, but the data that gets sent has
not gone through the chain of transformations that I need.  It's the data
as it was ingested initially.  After specifying my chain of
transformations, but before calling foreachPartition, I call rdd.count in
order to force the RDD to transform.  The data it sends out is still not
transformed.  How do I get the RDD to send out transformed data when
calling foreachPartition?

Thanks


Re: SparkSQL 1.2.0 sources API error

2015-01-17 Thread Walrus theCat
I'm getting this also, with Scala 2.11 and Scala 2.10:

15/01/18 07:34:51 INFO slf4j.Slf4jLogger: Slf4jLogger started
15/01/18 07:34:51 INFO Remoting: Starting remoting
15/01/18 07:34:51 ERROR actor.ActorSystemImpl: Uncaught fatal error from
thread [sparkDriver-akka.remote.default-remote-dispatcher-7] shutting down
ActorSystem [sparkDriver]
java.lang.NoSuchMethodError:
org.jboss.netty.channel.socket.nio.NioWorkerPool.(Ljava/util/concurrent/Executor;I)V
at
akka.remote.transport.netty.NettyTransport.(NettyTransport.scala:283)
at
akka.remote.transport.netty.NettyTransport.(NettyTransport.scala:240)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:161)
at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at scala.util.Success.flatMap(Try.scala:200)
at
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:692)
at akka.remote.EndpointManager$$anonfun$9.apply(Remoting.scala:684)
at
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at
akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:684)
at
akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:492)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at akka.remote.EndpointManager.aroundReceive(Remoting.scala:395)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.
15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remote daemon shut down; proceeding with flushing remote transports.
15/01/18 07:34:51 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Remoting shut down.
Exception in thread "main" java.util.concurrent.TimeoutException: Futures
timed out after [1 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at akka.remote.Remoting.start(Remoting.scala:180)
at
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl.liftedTree2$1(ActorSystem.scala:618)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:615)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:632)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
at
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:121)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:53)
at
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1676)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1667)
at
org.apache.spark.util

maven doesn't build dependencies with Scala 2.11

2015-01-17 Thread Walrus theCat
Hi,

When I run this:

dev/change-version-to-2.11.sh
mvn -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests clean package

as per here
,
maven doesn't build Spark's dependencies.

Only when I run:

dev/change-version-to-2.11.sh
sbt/sbt -Pyarn -Phadoop-2.4 -Dscala-2.11 -DskipTests  clean package

as gathered from here
,
do I get Spark's dependencies built without any cross-compilation
errors.

*Question*:

- How can I make maven do this?

- How can I specify the use of Scala 2.11 in my own .pom files?

Thanks


Re: using multiple dstreams together (spark streaming)

2014-07-17 Thread Walrus theCat
Thanks!


On Wed, Jul 16, 2014 at 6:34 PM, Tathagata Das 
wrote:

> Have you taken a look at DStream.transformWith( ... ) . That allows you
> apply arbitrary transformation between RDDs (of the same timestamp) of two
> different streams.
>
> So you can do something like this.
>
> 2s-window-stream.transformWith(1s-window-stream, (rdd1: RDD[...], rdd2:
> RDD[...]) => {
>  ...
>   // return a new RDD
> })
>
>
> And streamingContext.transform() extends it to N DStreams. :)
>
> Hope this helps!
>
> TD
>
>
>
>
> On Wed, Jul 16, 2014 at 10:42 AM, Walrus theCat 
> wrote:
>
>> hey at least it's something (thanks!) ... not sure what i'm going to do
>> if i can't find a solution (other than not use spark) as i really need
>> these capabilities.  anyone got anything else?
>>
>>
>> On Wed, Jul 16, 2014 at 10:34 AM, Luis Ángel Vicente Sánchez <
>> langel.gro...@gmail.com> wrote:
>>
>>> hum... maybe consuming all streams at the same time with an actor that
>>> would act as a new DStream source... but this is just a random idea... I
>>> don't really know if that would be a good idea or even possible.
>>>
>>>
>>> 2014-07-16 18:30 GMT+01:00 Walrus theCat :
>>>
>>> Yeah -- I tried the .union operation and it didn't work for that
>>>> reason.  Surely there has to be a way to do this, as I imagine this is a
>>>> commonly desired goal in streaming applications?
>>>>
>>>>
>>>> On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez <
>>>> langel.gro...@gmail.com> wrote:
>>>>
>>>>> I'm joining several kafka dstreams using the join operation but you
>>>>> have the limitation that the duration of the batch has to be same,i.e. 1
>>>>> second window for all dstreams... so it would not work for you.
>>>>>
>>>>>
>>>>> 2014-07-16 18:08 GMT+01:00 Walrus theCat :
>>>>>
>>>>> Hi,
>>>>>>
>>>>>> My application has multiple dstreams on the same inputstream:
>>>>>>
>>>>>> dstream1 // 1 second window
>>>>>> dstream2 // 2 second window
>>>>>> dstream3 // 5 minute window
>>>>>>
>>>>>>
>>>>>> I want to write logic that deals with all three windows (e.g. when
>>>>>> the 1 second window differs from the 2 second window by some delta ...)
>>>>>>
>>>>>> I've found some examples online (there's not much out there!), and I
>>>>>> can only see people transforming a single dstream.  In conventional 
>>>>>> spark,
>>>>>> we'd do this sort of thing with a cartesian on RDDs.
>>>>>>
>>>>>> How can I deal with multiple Dstreams at once?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread Walrus theCat
I did not!


On Wed, Jul 16, 2014 at 12:31 PM, aaronjosephs  wrote:

> The only other thing to keep in mind is that window duration and slide
> duration have to be multiples of batch duration, IDK if you made that fully
> clear
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Difference-among-batchDuration-windowDuration-slideDuration-tp9966p9973.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Difference among batchDuration, windowDuration, slideDuration

2014-07-16 Thread Walrus theCat
Here's what I understand:


batchDuration:  How often should the streaming context update?  how many
seconds of data should each dstream contain?

windowDuration:  What size windows are you looking for from this dstream?

slideDuration:  Once I've given you that slice, how many units forward do
you want me to move to give you the next one?





On Wed, Jul 16, 2014 at 11:28 AM, hsy...@gmail.com  wrote:

> When I'm reading the API of spark streaming, I'm confused by the 3
> different durations
>
> StreamingContext(conf: SparkConf
> 
> , batchDuration: Duration
> 
> )
>
> DStream window(windowDuration: Duration
> 
> , slideDuration: Duration
> 
> ): DStream
> 
> [T]
>
>
> Can anyone please explain these 3 different durations
>
>
> Best,
> Siyuan
>


Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
hey at least it's something (thanks!) ... not sure what i'm going to do if
i can't find a solution (other than not use spark) as i really need these
capabilities.  anyone got anything else?


On Wed, Jul 16, 2014 at 10:34 AM, Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com> wrote:

> hum... maybe consuming all streams at the same time with an actor that
> would act as a new DStream source... but this is just a random idea... I
> don't really know if that would be a good idea or even possible.
>
>
> 2014-07-16 18:30 GMT+01:00 Walrus theCat :
>
> Yeah -- I tried the .union operation and it didn't work for that reason.
>> Surely there has to be a way to do this, as I imagine this is a commonly
>> desired goal in streaming applications?
>>
>>
>> On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez <
>> langel.gro...@gmail.com> wrote:
>>
>>> I'm joining several kafka dstreams using the join operation but you have
>>> the limitation that the duration of the batch has to be same,i.e. 1 second
>>> window for all dstreams... so it would not work for you.
>>>
>>>
>>> 2014-07-16 18:08 GMT+01:00 Walrus theCat :
>>>
>>> Hi,
>>>>
>>>> My application has multiple dstreams on the same inputstream:
>>>>
>>>> dstream1 // 1 second window
>>>> dstream2 // 2 second window
>>>> dstream3 // 5 minute window
>>>>
>>>>
>>>> I want to write logic that deals with all three windows (e.g. when the
>>>> 1 second window differs from the 2 second window by some delta ...)
>>>>
>>>> I've found some examples online (there's not much out there!), and I
>>>> can only see people transforming a single dstream.  In conventional spark,
>>>> we'd do this sort of thing with a cartesian on RDDs.
>>>>
>>>> How can I deal with multiple Dstreams at once?
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>


Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Or, if not, is there a way to do this in terms of a single dstream?  Keep
in mind that dstream1, dstream2, and dstream3 have already had
transformations applied.  I tried creating the dstreams by calling .window
on the first one, but that ends up with me having ... 3 dstreams... which
is the same problem.


On Wed, Jul 16, 2014 at 10:30 AM, Walrus theCat 
wrote:

> Yeah -- I tried the .union operation and it didn't work for that reason.
> Surely there has to be a way to do this, as I imagine this is a commonly
> desired goal in streaming applications?
>
>
> On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez <
> langel.gro...@gmail.com> wrote:
>
>> I'm joining several kafka dstreams using the join operation but you have
>> the limitation that the duration of the batch has to be same,i.e. 1 second
>> window for all dstreams... so it would not work for you.
>>
>>
>> 2014-07-16 18:08 GMT+01:00 Walrus theCat :
>>
>> Hi,
>>>
>>> My application has multiple dstreams on the same inputstream:
>>>
>>> dstream1 // 1 second window
>>> dstream2 // 2 second window
>>> dstream3 // 5 minute window
>>>
>>>
>>> I want to write logic that deals with all three windows (e.g. when the 1
>>> second window differs from the 2 second window by some delta ...)
>>>
>>> I've found some examples online (there's not much out there!), and I can
>>> only see people transforming a single dstream.  In conventional spark, we'd
>>> do this sort of thing with a cartesian on RDDs.
>>>
>>> How can I deal with multiple Dstreams at once?
>>>
>>> Thanks
>>>
>>
>>
>


Re: using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Yeah -- I tried the .union operation and it didn't work for that reason.
Surely there has to be a way to do this, as I imagine this is a commonly
desired goal in streaming applications?


On Wed, Jul 16, 2014 at 10:10 AM, Luis Ángel Vicente Sánchez <
langel.gro...@gmail.com> wrote:

> I'm joining several kafka dstreams using the join operation but you have
> the limitation that the duration of the batch has to be same,i.e. 1 second
> window for all dstreams... so it would not work for you.
>
>
> 2014-07-16 18:08 GMT+01:00 Walrus theCat :
>
> Hi,
>>
>> My application has multiple dstreams on the same inputstream:
>>
>> dstream1 // 1 second window
>> dstream2 // 2 second window
>> dstream3 // 5 minute window
>>
>>
>> I want to write logic that deals with all three windows (e.g. when the 1
>> second window differs from the 2 second window by some delta ...)
>>
>> I've found some examples online (there's not much out there!), and I can
>> only see people transforming a single dstream.  In conventional spark, we'd
>> do this sort of thing with a cartesian on RDDs.
>>
>> How can I deal with multiple Dstreams at once?
>>
>> Thanks
>>
>
>


using multiple dstreams together (spark streaming)

2014-07-16 Thread Walrus theCat
Hi,

My application has multiple dstreams on the same inputstream:

dstream1 // 1 second window
dstream2 // 2 second window
dstream3 // 5 minute window


I want to write logic that deals with all three windows (e.g. when the 1
second window differs from the 2 second window by some delta ...)

I've found some examples online (there's not much out there!), and I can
only see people transforming a single dstream.  In conventional spark, we'd
do this sort of thing with a cartesian on RDDs.

How can I deal with multiple Dstreams at once?

Thanks


Re: Spark 1.0.1 akka connection refused

2014-07-15 Thread Walrus theCat
I'm getting similar errors on spark streaming -- but at this point in my
project I don't need a cluster and can develop locally.  Will write it up
later, though, if it persists.


On Tue, Jul 15, 2014 at 7:44 PM, Kevin Jung  wrote:

> Hi,
> I recently upgrade my spark 1.0.0 cluster to 1.0.1.
> But it gives me "ERROR remote.EndpointWriter: AssociationError" when I run
> simple SparkSQL job in spark-shell.
>
> here is code,
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> import sqlContext._
> case class Person(name:String, Age:Int, Gender:String, Birth:String)
> val peopleRDD = sc.textFile("/sample/sample.csv").map(_.split(",")).map(p
> =>
> Person(p(0).toString, p(1).toInt, p(2).toString, p(3).toString))
> peopleRDD.collect().foreach(println)
>
> and here is error message on worker node.
> 14/07/16 10:58:04 ERROR remote.EndpointWriter: AssociationError
> [akka.tcp://sparkwor...@my.test6.com:38030] ->
> [akka.tcp://sparkexecu...@my.test6.com:35534]: Error [Association failed
> with [akka.tcp://sparkexecu...@my.tes$
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkexecu...@my.test6.com:35534]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection
> refused: my.test6.com/
>
> and here is error message in shell
> 14/07/16 11:33:15 WARN scheduler.TaskSetManager: Loss was due to
> java.lang.NoClassDefFoundError
> java.lang.NoClassDefFoundError: Could not initialize class $line10.$read$
> at
> $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)
> at
> $line14.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(:19)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at scala.collection.AbstractIterator.to(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
> at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:717)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1083)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:722)
>
> I tested it on spark 1.0.0(same machine) and it was fine.
> It seems like Worker cannot find Executor akka endpoint.
> Do you have any ideas?
>
> Best regards
> Kevin
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-akka-connection-refused-tp9864.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Tathagata determined that the reason it was failing was the accidental
creation of multiple input streams.  Thanks!


On Tue, Jul 15, 2014 at 1:09 PM, Walrus theCat 
wrote:

> Will do.
>
>
> On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> This sounds really really weird. Can you give me a piece of code that I
>> can run to reproduce this issue myself?
>>
>> TD
>>
>>
>> On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat 
>> wrote:
>>
>>> This is (obviously) spark streaming, by the way.
>>>
>>>
>>> On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've got a socketTextStream through which I'm reading input.  I have
>>>> three Dstreams, all of which are the same window operation over that
>>>> socketTextStream.  I have a four core machine.  As we've been covering
>>>> lately, I have to give a "cores" parameter to my StreamingSparkContext:
>>>>
>>>> ssc = new StreamingContext("local[4]" /**TODO change once a cluster is
>>>> up **/,
>>>>   "AppName", Seconds(1))
>>>>
>>>> Now, I have three dstreams, and all I ask them to do is print or
>>>> count.  I should preface this with the statement that they all work on
>>>> their own.
>>>>
>>>> dstream1 // 1 second window
>>>> dstream2 // 2 second window
>>>> dstream3 // 5 minute window
>>>>
>>>>
>>>> If I construct the ssc with "local[8]", and put these statements in
>>>> this order, I get prints on the first one, and zero counts on the second
>>>> one:
>>>>
>>>> ssc(local[8])  // hyperthread dat sheezy
>>>> dstream1.print // works
>>>> dstream2.count.print // always prints 0
>>>>
>>>>
>>>>
>>>> If I do this, this happens:
>>>> ssc(local[4])
>>>> dstream1.print // doesn't work, just gives me the Time:  ms message
>>>> dstream2.count.print // doesn't work, prints 0
>>>>
>>>> ssc(local[6])
>>>> dstream1.print // doesn't work, just gives me the Time:  ms message
>>>> dstream2.count.print // works, prints 1
>>>>
>>>> Sometimes these results switch up, seemingly at random. How can I get
>>>> things to the point where I can develop and test my application locally?
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>


Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
Will do.


On Tue, Jul 15, 2014 at 12:56 PM, Tathagata Das  wrote:

> This sounds really really weird. Can you give me a piece of code that I
> can run to reproduce this issue myself?
>
> TD
>
>
> On Tue, Jul 15, 2014 at 12:02 AM, Walrus theCat 
> wrote:
>
>> This is (obviously) spark streaming, by the way.
>>
>>
>> On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat 
>> wrote:
>>
>>> Hi,
>>>
>>> I've got a socketTextStream through which I'm reading input.  I have
>>> three Dstreams, all of which are the same window operation over that
>>> socketTextStream.  I have a four core machine.  As we've been covering
>>> lately, I have to give a "cores" parameter to my StreamingSparkContext:
>>>
>>> ssc = new StreamingContext("local[4]" /**TODO change once a cluster is
>>> up **/,
>>>   "AppName", Seconds(1))
>>>
>>> Now, I have three dstreams, and all I ask them to do is print or count.
>>> I should preface this with the statement that they all work on their own.
>>>
>>> dstream1 // 1 second window
>>> dstream2 // 2 second window
>>> dstream3 // 5 minute window
>>>
>>>
>>> If I construct the ssc with "local[8]", and put these statements in this
>>> order, I get prints on the first one, and zero counts on the second one:
>>>
>>> ssc(local[8])  // hyperthread dat sheezy
>>> dstream1.print // works
>>> dstream2.count.print // always prints 0
>>>
>>>
>>>
>>> If I do this, this happens:
>>> ssc(local[4])
>>> dstream1.print // doesn't work, just gives me the Time:  ms message
>>> dstream2.count.print // doesn't work, prints 0
>>>
>>> ssc(local[6])
>>> dstream1.print // doesn't work, just gives me the Time:  ms message
>>> dstream2.count.print // works, prints 1
>>>
>>> Sometimes these results switch up, seemingly at random. How can I get
>>> things to the point where I can develop and test my application locally?
>>>
>>> Thanks
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-15 Thread Walrus theCat
This is (obviously) spark streaming, by the way.


On Mon, Jul 14, 2014 at 8:27 PM, Walrus theCat 
wrote:

> Hi,
>
> I've got a socketTextStream through which I'm reading input.  I have three
> Dstreams, all of which are the same window operation over that
> socketTextStream.  I have a four core machine.  As we've been covering
> lately, I have to give a "cores" parameter to my StreamingSparkContext:
>
> ssc = new StreamingContext("local[4]" /**TODO change once a cluster is up
> **/,
>   "AppName", Seconds(1))
>
> Now, I have three dstreams, and all I ask them to do is print or count.  I
> should preface this with the statement that they all work on their own.
>
> dstream1 // 1 second window
> dstream2 // 2 second window
> dstream3 // 5 minute window
>
>
> If I construct the ssc with "local[8]", and put these statements in this
> order, I get prints on the first one, and zero counts on the second one:
>
> ssc(local[8])  // hyperthread dat sheezy
> dstream1.print // works
> dstream2.count.print // always prints 0
>
>
>
> If I do this, this happens:
> ssc(local[4])
> dstream1.print // doesn't work, just gives me the Time:  ms message
> dstream2.count.print // doesn't work, prints 0
>
> ssc(local[6])
> dstream1.print // doesn't work, just gives me the Time:  ms message
> dstream2.count.print // works, prints 1
>
> Sometimes these results switch up, seemingly at random. How can I get
> things to the point where I can develop and test my application locally?
>
> Thanks
>
>
>
>
>
>
>


truly bizarre behavior with local[n] on Spark 1.0.1

2014-07-14 Thread Walrus theCat
Hi,

I've got a socketTextStream through which I'm reading input.  I have three
Dstreams, all of which are the same window operation over that
socketTextStream.  I have a four core machine.  As we've been covering
lately, I have to give a "cores" parameter to my StreamingSparkContext:

ssc = new StreamingContext("local[4]" /**TODO change once a cluster is up
**/,
  "AppName", Seconds(1))

Now, I have three dstreams, and all I ask them to do is print or count.  I
should preface this with the statement that they all work on their own.

dstream1 // 1 second window
dstream2 // 2 second window
dstream3 // 5 minute window


If I construct the ssc with "local[8]", and put these statements in this
order, I get prints on the first one, and zero counts on the second one:

ssc(local[8])  // hyperthread dat sheezy
dstream1.print // works
dstream2.count.print // always prints 0



If I do this, this happens:
ssc(local[4])
dstream1.print // doesn't work, just gives me the Time:  ms message
dstream2.count.print // doesn't work, prints 0

ssc(local[6])
dstream1.print // doesn't work, just gives me the Time:  ms message
dstream2.count.print // works, prints 1

Sometimes these results switch up, seemingly at random. How can I get
things to the point where I can develop and test my application locally?

Thanks


Re: not getting output from socket connection

2014-07-13 Thread Walrus theCat
Hah, thanks for tidying up the paper trail here, but I was the OP (and
solver) of the recent "reduce" thread that ended in this solution.


On Sun, Jul 13, 2014 at 4:26 PM, Michael Campbell <
michael.campb...@gmail.com> wrote:

> Make sure you use "local[n]" (where n > 1) in your context setup too, (if
> you're running locally), or you won't get output.
>
>
> On Sat, Jul 12, 2014 at 11:36 PM, Walrus theCat 
> wrote:
>
>> Thanks!
>>
>> I thought it would get "passed through" netcat, but given your email, I
>> was able to follow this tutorial and get it to work:
>>
>>
>> http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html
>>
>>
>>
>>
>> On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen  wrote:
>>
>>> netcat is listening for a connection on port . It is echoing what
>>> you type to its console to anything that connects to  and reads.
>>> That is what Spark streaming does.
>>>
>>> If you yourself connect to  and write, nothing happens except that
>>> netcat echoes it. This does not cause Spark to somehow get that data.
>>> nc is only echoing input from the console.
>>>
>>> On Fri, Jul 11, 2014 at 9:25 PM, Walrus theCat 
>>> wrote:
>>> > Hi,
>>> >
>>> > I have a java application that is outputting a string every second.
>>>  I'm
>>> > running the wordcount example that comes with Spark 1.0, and running
>>> nc -lk
>>> > . When I type words into the terminal running netcat, I get counts.
>>> > However, when I write the String onto a socket on port , I don't
>>> get
>>> > counts.  I can see the strings showing up in the netcat terminal, but
>>> no
>>> > counts from Spark.  If I paste in the string, I get counts.
>>> >
>>> > Any ideas?
>>> >
>>> > Thanks
>>>
>>
>>
>


Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Great success!

I was able to get output to the driver console by changing the construction
of the Streaming Spark Context from:

 val ssc = new StreamingContext("local" /**TODO change once a cluster is up
**/,
"AppName", Seconds(1))


to:

val ssc = new StreamingContext("local[2]" /**TODO change once a cluster is
up **/,
"AppName", Seconds(1))


I found something that tipped me off that this might work by digging
through this mailing list.


On Sun, Jul 13, 2014 at 2:00 PM, Walrus theCat 
wrote:

> More strange behavior:
>
> lines.foreachRDD(x => println(x.first)) // works
> lines.foreachRDD(x => println((x.count,x.first))) // no output is printed
> to driver console
>
>
>
>
> On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat 
> wrote:
>
>>
>> Thanks for your interest.
>>
>> lines.foreachRDD(x => println(x.count))
>>
>> And I got 0 every once in a while (which I think is strange, because
>> lines.print prints the input I'm giving it over the socket.)
>>
>>
>> When I tried:
>>
>> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>>
>> I got no count.
>>
>> Thanks
>>
>>
>> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> Try doing DStream.foreachRDD and then printing the RDD count and further
>>> inspecting the RDD.
>>>  On Jul 13, 2014 1:03 AM, "Walrus theCat" 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have a DStream that works just fine when I say:
>>>>
>>>> dstream.print
>>>>
>>>> If I say:
>>>>
>>>> dstream.map(_,1).print
>>>>
>>>> that works, too.  However, if I do the following:
>>>>
>>>> dstream.reduce{case(x,y) => x}.print
>>>>
>>>> I don't get anything on my console.  What's going on?
>>>>
>>>> Thanks
>>>>
>>>
>>
>


Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
More strange behavior:

lines.foreachRDD(x => println(x.first)) // works
lines.foreachRDD(x => println((x.count,x.first))) // no output is printed
to driver console




On Sun, Jul 13, 2014 at 11:47 AM, Walrus theCat 
wrote:

>
> Thanks for your interest.
>
> lines.foreachRDD(x => println(x.count))
>
> And I got 0 every once in a while (which I think is strange, because
> lines.print prints the input I'm giving it over the socket.)
>
>
> When I tried:
>
> lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))
>
> I got no count.
>
> Thanks
>
>
> On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> Try doing DStream.foreachRDD and then printing the RDD count and further
>> inspecting the RDD.
>>  On Jul 13, 2014 1:03 AM, "Walrus theCat"  wrote:
>>
>>> Hi,
>>>
>>> I have a DStream that works just fine when I say:
>>>
>>> dstream.print
>>>
>>> If I say:
>>>
>>> dstream.map(_,1).print
>>>
>>> that works, too.  However, if I do the following:
>>>
>>> dstream.reduce{case(x,y) => x}.print
>>>
>>> I don't get anything on my console.  What's going on?
>>>
>>> Thanks
>>>
>>
>


Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Thanks for your interest.

lines.foreachRDD(x => println(x.count))

And I got 0 every once in a while (which I think is strange, because
lines.print prints the input I'm giving it over the socket.)


When I tried:

lines.map(_->1).reduceByKey(_+_).foreachRDD(x => println(x.count))

I got no count.

Thanks


On Sun, Jul 13, 2014 at 11:34 AM, Tathagata Das  wrote:

> Try doing DStream.foreachRDD and then printing the RDD count and further
> inspecting the RDD.
> On Jul 13, 2014 1:03 AM, "Walrus theCat"  wrote:
>
>> Hi,
>>
>> I have a DStream that works just fine when I say:
>>
>> dstream.print
>>
>> If I say:
>>
>> dstream.map(_,1).print
>>
>> that works, too.  However, if I do the following:
>>
>> dstream.reduce{case(x,y) => x}.print
>>
>> I don't get anything on my console.  What's going on?
>>
>> Thanks
>>
>


Re: can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Update on this:

val lines = ssc.socketTextStream("localhost", )

lines.print // works

lines.map(_->1).print // works

lines.map(_->1).reduceByKey(_+_).print // nothing printed to driver console

Just lots of:

14/07/13 11:37:40 INFO receiver.BlockGenerator: Pushed block
input-0-1405276660400
14/07/13 11:37:41 INFO scheduler.ReceiverTracker: Stream 0 received 1 blocks
14/07/13 11:37:41 INFO scheduler.JobScheduler: Added jobs for time
1405276661000 ms
14/07/13 11:37:41 INFO storage.MemoryStore: ensureFreeSpace(60) called with
curMem=1275, maxMem=98539929
14/07/13 11:37:41 INFO storage.MemoryStore: Block input-0-1405276661400
stored as bytes to memory (size 60.0 B, free 94.0 MB)
14/07/13 11:37:41 INFO storage.BlockManagerInfo: Added
input-0-1405276661400 in memory on 25.17.218.118:55820 (size: 60.0 B, free:
94.0 MB)
14/07/13 11:37:41 INFO storage.BlockManagerMaster: Updated info of block
input-0-1405276661400


Any insight?

Thanks


On Sun, Jul 13, 2014 at 1:03 AM, Walrus theCat 
wrote:

> Hi,
>
> I have a DStream that works just fine when I say:
>
> dstream.print
>
> If I say:
>
> dstream.map(_,1).print
>
> that works, too.  However, if I do the following:
>
> dstream.reduce{case(x,y) => x}.print
>
> I don't get anything on my console.  What's going on?
>
> Thanks
>


can't print DStream after reduce

2014-07-13 Thread Walrus theCat
Hi,

I have a DStream that works just fine when I say:

dstream.print

If I say:

dstream.map(_,1).print

that works, too.  However, if I do the following:

dstream.reduce{case(x,y) => x}.print

I don't get anything on my console.  What's going on?

Thanks


Re: not getting output from socket connection

2014-07-12 Thread Walrus theCat
Thanks!

I thought it would get "passed through" netcat, but given your email, I was
able to follow this tutorial and get it to work:

http://docs.oracle.com/javase/tutorial/networking/sockets/clientServer.html




On Fri, Jul 11, 2014 at 1:31 PM, Sean Owen  wrote:

> netcat is listening for a connection on port . It is echoing what
> you type to its console to anything that connects to  and reads.
> That is what Spark streaming does.
>
> If you yourself connect to  and write, nothing happens except that
> netcat echoes it. This does not cause Spark to somehow get that data.
> nc is only echoing input from the console.
>
> On Fri, Jul 11, 2014 at 9:25 PM, Walrus theCat 
> wrote:
> > Hi,
> >
> > I have a java application that is outputting a string every second.  I'm
> > running the wordcount example that comes with Spark 1.0, and running nc
> -lk
> > . When I type words into the terminal running netcat, I get counts.
> > However, when I write the String onto a socket on port , I don't get
> > counts.  I can see the strings showing up in the netcat terminal, but no
> > counts from Spark.  If I paste in the string, I get counts.
> >
> > Any ideas?
> >
> > Thanks
>


Re: not getting output from socket connection

2014-07-11 Thread Walrus theCat
I forgot to add that I get the same behavior if I tail -f | nc localhost
 on a log file.


On Fri, Jul 11, 2014 at 1:25 PM, Walrus theCat 
wrote:

> Hi,
>
> I have a java application that is outputting a string every second.  I'm
> running the wordcount example that comes with Spark 1.0, and running nc -lk
> . When I type words into the terminal running netcat, I get counts.
> However, when I write the String onto a socket on port , I don't get
> counts.  I can see the strings showing up in the netcat terminal, but no
> counts from Spark.  If I paste in the string, I get counts.
>
> Any ideas?
>
> Thanks
>


not getting output from socket connection

2014-07-11 Thread Walrus theCat
Hi,

I have a java application that is outputting a string every second.  I'm
running the wordcount example that comes with Spark 1.0, and running nc -lk
. When I type words into the terminal running netcat, I get counts.
However, when I write the String onto a socket on port , I don't get
counts.  I can see the strings showing up in the netcat terminal, but no
counts from Spark.  If I paste in the string, I get counts.

Any ideas?

Thanks


Re: How can adding a random count() change the behavior of my program?

2014-05-11 Thread Walrus theCat
Nick,

I have encountered strange things like this before (usually when
programming with mutable structures and side-effects), and for me, the
answer was that, until .count (or .first, or similar), is called, your
variable 'a' refers to a set of instructions that only get executed to form
the object you expect when you're asking something of it.  Back before I
was using side-effect-free techniques on immutable data structures, I had
to call .first or .count or similar to get the behavior I wanted.  There
are still special cases where I have to purposefully "collapse" the RDD for
some reason or another.  This may not be new information to you, but I've
encountered similar behavior before and highly suspect this is playing a
role here.


On Mon, May 5, 2014 at 5:52 PM, Nicholas Chammas  wrote:

> I’m running into something very strange today. I’m getting an error on the
> follow innocuous operations.
>
> a = sc.textFile('s3n://...')
> a = a.repartition(8)
> a = a.map(...)
> c = a.countByKey() # ERRORs out on this action. See below for traceback. [1]
>
> If I add a count() right after the repartition(), this error magically
> goes away.
>
> a = sc.textFile('s3n://...')
> a = a.repartition(8)
> print a.count()
> a = a.map(...)
> c = a.countByKey() # A-OK! WTF?
>
> To top it off, this “fix” is inconsistent. Sometimes, I still get this
> error.
>
> This is strange. How do I get to the bottom of this?
>
> Nick
>
> [1] Here’s the traceback:
>
> Traceback (most recent call last):
>   File "", line 7, in 
>   File "file.py", line 187, in function_blah
> c = a.countByKey()
>   File "/root/spark/python/pyspark/rdd.py", line 778, in countByKey
> return self.map(lambda x: x[0]).countByValue()
>   File "/root/spark/python/pyspark/rdd.py", line 624, in countByValue
> return self.mapPartitions(countPartition).reduce(mergeMaps)
>   File "/root/spark/python/pyspark/rdd.py", line 505, in reduce
> vals = self.mapPartitions(func).collect()
>   File "/root/spark/python/pyspark/rdd.py", line 469, in collect
> bytesInJava = self._jrdd.collect().iterator()
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py", line 
> 537, in __call__
>   File "/root/spark/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py", line 
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o46.collect.
>
>
> --
> View this message in context: How can adding a random count() change the
> behavior of my 
> program?
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
Dankeschön !


On Tue, Apr 15, 2014 at 11:29 AM, Aaron Davidson  wrote:

> This is probably related to the Scala bug that :cp does not work:
> https://issues.scala-lang.org/browse/SI-6502
>
>
> On Tue, Apr 15, 2014 at 11:21 AM, Walrus theCat wrote:
>
>> Actually altering the classpath in the REPL causes the provided
>> SparkContext to disappear:
>>
>> scala> sc.parallelize(List(1,2,3))
>> res0: spark.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
>> :13
>>
>> scala> :cp /root
>> Added '/root'.  Your new classpath is:
>>
>> ":/root/jars/aspectjrt.jar:/root/jars/aspectjweaver.jar:/root/jars/aws-java-sdk-1.4.5.jar:/root/jars/aws-java-sdk-1.4.5-javadoc.jar:/root/jars/aws-java-sdk-1.4.5-sources.jar:/root/jars/aws-java-sdk-flow-build-tools-1.4.5.jar:/root/jars/commons-codec-1.3.jar:/root/jars/commons-logging-1.1.1.jar:/root/jars/freemarker-2.3.18.jar:/root/jars/httpclient-4.1.1.jar:/root/jars/httpcore-4.1.jar:/root/jars/jackson-core-asl-1.8.7.jar:/root/jars/mail-1.4.3.jar:/root/jars/spring-beans-3.0.7.jar:/root/jars/spring-context-3.0.7.jar:/root/jars/spring-core-3.0.7.jar:/root/jars/stax-1.2.0.jar:/root/jars/stax-api-1.0.1.jar:/root/spark/conf:/root/spark/core/target/scala-2.9.3/classes:/root/spark/core/src/main/resources:/root/spark/repl/target/scala-2.9.3/classes:/root/spark/examples/target/scala-2.9.3/classes:/root/spark/streaming/target/scala-2.9.3/classes:/root/spark/streaming/lib/org/apache/kafka/kafka/0.7.2-spark/*:/root/spark/lib_managed/jars/*:/root/spark/lib_managed/bundles/*:/root/spark/repl/lib/*:/root/spark/bagel/target/scala-2.9.3/classes:/root/spark/python/lib/py4j0.7.jar:/root"
>> 14/04/15 18:19:37 INFO server.Server: jetty-7.6.8.v20121106
>> 14/04/15 18:19:37 INFO server.AbstractConnector: Started
>> SocketConnector@0.0.0.0:48978
>> Replaying: sc.parallelize(List(1,2,3))
>> :8: error: not found: value sc
>>sc.parallelize(List(1,2,3))
>>
>>
>>
>> On Mon, Apr 14, 2014 at 7:51 PM, Walrus theCat wrote:
>>
>>> Nevermind -- I'm like 90% sure the problem is that I'm importing stuff
>>> that declares a SparkContext as sc.  If it's not, I'll report back.
>>>
>>>
>>> On Mon, Apr 14, 2014 at 2:55 PM, Walrus theCat 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Using the spark-shell, I can't sc.parallelize to get an RDD.
>>>>
>>>> Looks like a bug.
>>>>
>>>> scala> sc.parallelize(Array("a","s","d"))
>>>> java.lang.NullPointerException
>>>> at (:17)
>>>> at (:22)
>>>> at (:24)
>>>> at (:26)
>>>> at (:28)
>>>> at (:30)
>>>> at (:32)
>>>> at (:34)
>>>> at (:36)
>>>> at .(:40)
>>>> at .()
>>>> at .(:11)
>>>> at .()
>>>> at $export()
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:606)
>>>> at spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
>>>> at
>>>> spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:890)
>>>> at
>>>> scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
>>>> at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
>>>> at java.lang.Thread.run(Thread.java:744)
>>>>
>>>
>>>
>>
>


Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-15 Thread Walrus theCat
Actually altering the classpath in the REPL causes the provided
SparkContext to disappear:

scala> sc.parallelize(List(1,2,3))
res0: spark.RDD[Int] = ParallelCollectionRDD[0] at parallelize at
:13

scala> :cp /root
Added '/root'.  Your new classpath is:
":/root/jars/aspectjrt.jar:/root/jars/aspectjweaver.jar:/root/jars/aws-java-sdk-1.4.5.jar:/root/jars/aws-java-sdk-1.4.5-javadoc.jar:/root/jars/aws-java-sdk-1.4.5-sources.jar:/root/jars/aws-java-sdk-flow-build-tools-1.4.5.jar:/root/jars/commons-codec-1.3.jar:/root/jars/commons-logging-1.1.1.jar:/root/jars/freemarker-2.3.18.jar:/root/jars/httpclient-4.1.1.jar:/root/jars/httpcore-4.1.jar:/root/jars/jackson-core-asl-1.8.7.jar:/root/jars/mail-1.4.3.jar:/root/jars/spring-beans-3.0.7.jar:/root/jars/spring-context-3.0.7.jar:/root/jars/spring-core-3.0.7.jar:/root/jars/stax-1.2.0.jar:/root/jars/stax-api-1.0.1.jar:/root/spark/conf:/root/spark/core/target/scala-2.9.3/classes:/root/spark/core/src/main/resources:/root/spark/repl/target/scala-2.9.3/classes:/root/spark/examples/target/scala-2.9.3/classes:/root/spark/streaming/target/scala-2.9.3/classes:/root/spark/streaming/lib/org/apache/kafka/kafka/0.7.2-spark/*:/root/spark/lib_managed/jars/*:/root/spark/lib_managed/bundles/*:/root/spark/repl/lib/*:/root/spark/bagel/target/scala-2.9.3/classes:/root/spark/python/lib/py4j0.7.jar:/root"
14/04/15 18:19:37 INFO server.Server: jetty-7.6.8.v20121106
14/04/15 18:19:37 INFO server.AbstractConnector: Started
SocketConnector@0.0.0.0:48978
Replaying: sc.parallelize(List(1,2,3))
:8: error: not found: value sc
   sc.parallelize(List(1,2,3))



On Mon, Apr 14, 2014 at 7:51 PM, Walrus theCat wrote:

> Nevermind -- I'm like 90% sure the problem is that I'm importing stuff
> that declares a SparkContext as sc.  If it's not, I'll report back.
>
>
> On Mon, Apr 14, 2014 at 2:55 PM, Walrus theCat wrote:
>
>> Hi,
>>
>> Using the spark-shell, I can't sc.parallelize to get an RDD.
>>
>> Looks like a bug.
>>
>> scala> sc.parallelize(Array("a","s","d"))
>> java.lang.NullPointerException
>> at (:17)
>> at (:22)
>> at (:24)
>> at (:26)
>> at (:28)
>> at (:30)
>> at (:32)
>> at (:34)
>> at (:36)
>> at .(:40)
>> at .()
>> at .(:11)
>> at .()
>> at $export()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:606)
>> at spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
>> at
>> spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:890)
>> at
>> scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
>> at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
>> at java.lang.Thread.run(Thread.java:744)
>>
>
>


Re: can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-14 Thread Walrus theCat
Nevermind -- I'm like 90% sure the problem is that I'm importing stuff that
declares a SparkContext as sc.  If it's not, I'll report back.


On Mon, Apr 14, 2014 at 2:55 PM, Walrus theCat wrote:

> Hi,
>
> Using the spark-shell, I can't sc.parallelize to get an RDD.
>
> Looks like a bug.
>
> scala> sc.parallelize(Array("a","s","d"))
> java.lang.NullPointerException
> at (:17)
> at (:22)
> at (:24)
> at (:26)
> at (:28)
> at (:30)
> at (:32)
> at (:34)
> at (:36)
> at .(:40)
> at .()
> at .(:11)
> at .()
> at $export()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
> at
> spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:890)
> at
> scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
> at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
> at java.lang.Thread.run(Thread.java:744)
>


can't sc.paralellize in Spark 0.7.3 spark-shell

2014-04-14 Thread Walrus theCat
Hi,

Using the spark-shell, I can't sc.parallelize to get an RDD.

Looks like a bug.

scala> sc.parallelize(Array("a","s","d"))
java.lang.NullPointerException
at (:17)
at (:22)
at (:24)
at (:26)
at (:28)
at (:30)
at (:32)
at (:34)
at (:36)
at .(:40)
at .()
at .(:11)
at .()
at $export()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:629)
at spark.repl.SparkIMain$Request$$anonfun$10.apply(SparkIMain.scala:890)
at
scala.tools.nsc.interpreter.Line$$anonfun$1.apply$mcV$sp(Line.scala:43)
at scala.tools.nsc.io.package$$anon$2.run(package.scala:25)
at java.lang.Thread.run(Thread.java:744)


RDD[Array] question

2014-03-27 Thread Walrus theCat
Sup y'all,

If I have an RDD[Array], if I do some operation in the RDD, then each Array
is going to get instantiated on some individual machine, correct (or does
it spread it out?)

Thanks


Re: coalescing RDD into equally sized partitions

2014-03-26 Thread Walrus theCat
For the record, I tried this, and it worked.


On Wed, Mar 26, 2014 at 10:51 AM, Walrus theCat wrote:

> Oh so if I had something more reasonable, like RDD's full of tuples of
> say, (Int,Set,Set), I could expect a more uniform distribution?
>
> Thanks
>
>
> On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia 
> wrote:
>
>> This happened because they were integers equal to 0 mod 5, and we used
>> the default hashCode implementation for integers, which will map them all
>> to 0. There's no API method that will look at the resulting partition sizes
>> and rebalance them, but you could use another hash function.
>>
>> Matei
>>
>> On Mar 24, 2014, at 5:20 PM, Walrus theCat 
>> wrote:
>>
>> > Hi,
>> >
>> > sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0
>> ).coalesce(5,true).glom.collect  yields
>> >
>> > Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
>> Array(), Array())
>> >
>> > How do I get something more like:
>> >
>> >  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
>> >
>> > Thanks
>>
>>
>


Re: interleave partitions?

2014-03-26 Thread Walrus theCat
Answering my own question here.  This may not be efficient, but this is
what I came up with:

rdd1.coalesce(N).glom.zip(rdd2.coalesce(N).glom).map { case(x,y) => x++y}


On Wed, Mar 26, 2014 at 11:11 AM, Walrus theCat wrote:

> Hi,
>
> I want to do something like this:
>
> rdd3 = rdd1.coalesce(N).partitions.zip(rdd2.coalesce(N).partitions)
>
> I realize the above will get me something like
> Array[(partition,partition)].
>
> I hope you see what I'm going for here -- any tips on how to accomplish
> this?
>
> Thanks
>


interleave partitions?

2014-03-26 Thread Walrus theCat
Hi,

I want to do something like this:

rdd3 = rdd1.coalesce(N).partitions.zip(rdd2.coalesce(N).partitions)

I realize the above will get me something like Array[(partition,partition)].

I hope you see what I'm going for here -- any tips on how to accomplish
this?

Thanks


Re: coalescing RDD into equally sized partitions

2014-03-26 Thread Walrus theCat
Oh so if I had something more reasonable, like RDD's full of tuples of
say, (Int,Set,Set), I could expect a more uniform distribution?

Thanks


On Mon, Mar 24, 2014 at 11:11 PM, Matei Zaharia wrote:

> This happened because they were integers equal to 0 mod 5, and we used the
> default hashCode implementation for integers, which will map them all to 0.
> There's no API method that will look at the resulting partition sizes and
> rebalance them, but you could use another hash function.
>
> Matei
>
> On Mar 24, 2014, at 5:20 PM, Walrus theCat  wrote:
>
> > Hi,
> >
> > sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0
> ).coalesce(5,true).glom.collect  yields
> >
> > Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
> Array(), Array())
> >
> > How do I get something more like:
> >
> >  Array(Array(0), Array(20), Array(40), Array(60), Array(80))
> >
> > Thanks
>
>


coalescing RDD into equally sized partitions

2014-03-24 Thread Walrus theCat
Hi,

sc.parallelize(Array.tabulate(100)(i=>i)).filter( _ % 20 == 0
).coalesce(5,true).glom.collect  yields

Array[Array[Int]] = Array(Array(0, 20, 40, 60, 80), Array(), Array(),
Array(), Array())

How do I get something more like:

 Array(Array(0), Array(20), Array(40), Array(60), Array(80))

Thanks


Re: Splitting RDD and Grouping together to perform computation

2014-03-24 Thread Walrus theCat
I'm also interested in this.


On Mon, Mar 24, 2014 at 4:59 PM, yh18190  wrote:

> Hi, I have large data set of numbers ie RDD and wanted to perform a
> computation only on groupof two values at a time. For example
> 1,2,3,4,5,6,7... is an RDD Can i group the RDD into (1,2),(3,4),(5,6)...??
> and perform the respective computations ?in an efficient manner? As we
> do'nt have a way to index elements directly using forloop etc..(i,i+1)...is
> their way to resolve this problem? Please suggest me ..i would be really
> thankful to you
> --
> View this message in context: Splitting RDD and Grouping together to
> perform 
> computation
> Sent from the Apache Spark User List mailing list 
> archiveat Nabble.com.
>


Re: N-Fold validation and RDD partitions

2014-03-24 Thread Walrus theCat
If someone wanted / needed to implement this themselves, are partitions the
correct way to go?  Any tips on how to get started (say, dividing an RDD
into 5 parts)?



On Fri, Mar 21, 2014 at 9:51 AM, Jaonary Rabarisoa wrote:

> Thank you Hai-Anh. Are the files   CrossValidation.scala and 
> RandomSplitRDD.scala
>  enough to use it ? I'm currently using spark 0.9.0 and I to avoid to
> rebuild every thing.
>
>
>
>
> On Fri, Mar 21, 2014 at 4:58 PM, Hai-Anh Trinh  wrote:
>
>> Hi Jaonary,
>>
>> You can find the code for k-fold CV in
>> https://github.com/apache/incubator-spark/pull/448. I have not find the
>> time to resubmit the pull to latest master.
>>
>>
>> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani > > wrote:
>>
>>> Hi Jaonary,
>>>
>>> I believe the n folds should be mapped into n Keys in spark using a map
>>> function. You can reduce the returned PairRDD and you should get your
>>> metric.
>>> I don't understand partitions fully, but from whatever I understand of
>>> it, they aren't required in your scenario.
>>>
>>> Regards,
>>> Sanjay
>>>
>>>
>>>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa 
>>> wrote:
>>>   Hi
>>>
>>> I need to partition my data represented as RDD into n folds and run
>>> metrics computation in each fold and finally compute the means of my
>>> metrics overall the folds.
>>> Does spark can do the data partition out of the box or do I need to
>>> implement it myself. I know that RDD has a partitions method and
>>> mapPartitions but I really don't understand the purpose and the meaning of
>>> partition here.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>>
>>>
>>
>>
>>  --
>> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
>> http://www.linkedin.com/in/haianh
>>
>>
>


Re: question about partitions

2014-03-24 Thread Walrus theCat
Syed,

Thanks for the tip.  I'm not sure if coalesce is doing what I'm intending
to do, which is, in effect, to subdivide the RDD into N parts (by calling
coalesce and doing operations on the partitions.)  It sounds like, however,
this won't bottleneck my processing power.  If this sets off any alarms for
anyone, feel free to chime in.


On Mon, Mar 24, 2014 at 2:50 PM, Syed A. Hashmi wrote:

> RDD.coalesce should be fine for rebalancing data across all RDD
> partitions. Coalesce is pretty handy in situations where you have sparse
> data and want to compact it (e.g. data after applying a strict filter) OR
> you know the magic number of partitions according to your cluster which
> will be optimal.
>
> One point to watch out though is that if N is greater than your current
> partitions, you need to pass shuffle=true to coalesce. If N is less than
> your current partitions (i.e. you are shrinking partitions, do not set
> shuffle=true, otherwise it will cause additional unnecessary shuffle
> overhead.
>
>
> On Mon, Mar 24, 2014 at 2:32 PM, Walrus theCat wrote:
>
>> For instance, I need to work with an RDD in terms of N parts.  Will
>> calling RDD.coalesce(N) possibly cause processing bottlenecks?
>>
>>
>> On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote:
>>
>>> Hi,
>>>
>>> Quick question about partitions.  If my RDD is partitioned into 5
>>> partitions, does that mean that I am constraining it to exist on at most 5
>>> machines?
>>>
>>> Thanks
>>>
>>
>>
>


Re: question about partitions

2014-03-24 Thread Walrus theCat
For instance, I need to work with an RDD in terms of N parts.  Will calling
RDD.coalesce(N) possibly cause processing bottlenecks?


On Mon, Mar 24, 2014 at 1:28 PM, Walrus theCat wrote:

> Hi,
>
> Quick question about partitions.  If my RDD is partitioned into 5
> partitions, does that mean that I am constraining it to exist on at most 5
> machines?
>
> Thanks
>


question about partitions

2014-03-24 Thread Walrus theCat
Hi,

Quick question about partitions.  If my RDD is partitioned into 5
partitions, does that mean that I am constraining it to exist on at most 5
machines?

Thanks


Re: inexplicable exceptions in Spark 0.7.3

2014-03-18 Thread Walrus theCat
Hi Andrew,

Thanks for your interest.  This is a standalone job.


On Mon, Mar 17, 2014 at 4:30 PM, Andrew Ash  wrote:

> Are you running from the spark shell or from a standalone job?
>
>
> On Mon, Mar 17, 2014 at 4:17 PM, Walrus theCat wrote:
>
>> Hi,
>>
>> I'm getting this stack trace, using Spark 0.7.3.  No references to
>> anything in my code, never experienced anything like this before.  Any
>> ideas what is going on?
>>
>> java.lang.ClassCastException: spark.SparkContext$$anonfun$9 cannot be
>> cast to scala.Function2
>> at spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:43)
>> at spark.scheduler.ResultTask.readExternal(ResultTask.scala:106)
>> at
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>> at spark.JavaDeserializationStream.readObject(JavaSerializer.scala:23)
>> at spark.JavaSerializerInstance.deserialize(JavaSerializer.scala:45)
>> at spark.executor.Executor$TaskRunner.run(Executor.scala:96)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>>
>
>


inexplicable exceptions in Spark 0.7.3

2014-03-17 Thread Walrus theCat
Hi,

I'm getting this stack trace, using Spark 0.7.3.  No references to anything
in my code, never experienced anything like this before.  Any ideas what is
going on?

java.lang.ClassCastException: spark.SparkContext$$anonfun$9 cannot be cast
to scala.Function2
at spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:43)
at spark.scheduler.ResultTask.readExternal(ResultTask.scala:106)
at
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at spark.JavaDeserializationStream.readObject(JavaSerializer.scala:23)
at spark.JavaSerializerInstance.deserialize(JavaSerializer.scala:45)
at spark.executor.Executor$TaskRunner.run(Executor.scala:96)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


Re: links for the old versions are broken

2014-03-17 Thread Walrus theCat



On Thu, Mar 13, 2014 at 11:05 AM, Aaron Davidson  wrote:

> Looks like everything from 0.8.0 and before errors similarly (though
> "Spark 0.3 for Scala 2.9" has a malformed link as well).
>
>
> On Thu, Mar 13, 2014 at 10:52 AM, Walrus theCat wrote:
>
>> Sup,
>>
>> Where can I get Spark 0.7.3?  It's 404 here:
>>
>> http://spark.apache.org/downloads.html
>>
>> Thanks
>>
>
>


links for the old versions are broken

2014-03-13 Thread Walrus theCat
Sup,

Where can I get Spark 0.7.3?  It's 404 here:

http://spark.apache.org/downloads.html

Thanks