cached rdd in memory eviction

2014-02-24 Thread Koert Kuipers
i was under the impression that running jobs could not evict cached rdds
from memory as long as they are below spark.storage.memoryFraction. however
what i observe seems to indicate the opposite. did anything change?

thanks! koert


Re: OutOfMemoryError with basic kmeans

2014-02-17 Thread Koert Kuipers
looks like it could be kryo related?

i am only guessing here, but you can configure kryo buffers separately...
see:
spark.kryoserializer.buffer.mb


On Mon, Feb 17, 2014 at 7:49 PM, agg  wrote:

> Hi guys,
>
> I'm trying to run a basic version of kmeans (not the mllib version), on
> 250gb of data on 8 machines (each with 8 cores, and 60gb of ram).  I've
> tried many configurations, but keep getting an OutOfMemory error (at the
> bottom).  I've tried the following settings with
> persist(MEMORY_AND_DISK_SER) and Kyro Serialization:
>
> System.setProperty("spark.executor.memory", "55g")
> System.setProperty("spark.storage.memoryFraction", ".5")
> System.setProperty("spark.default.parallelism", "5000")
>
>
> OutOfMemory Error:
>
> 14/02/18 00:34:07 WARN cluster.ClusterTaskSetManager: Loss was due to
> java.lang.OutOfMemoryError
> java.lang.OutOfMemoryError: Java heap space
> at it.unimi.dsi.fastutil.bytes.ByteArrays.grow(ByteArrays.java:170)
> at
>
> it.unimi.dsi.fastutil.io.FastByteArrayOutputStream.write(FastByteArrayOutputStream.java:97)
> at
>
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.dumpBuffer(FastBufferedOutputStream.java:120)
> at
>
> it.unimi.dsi.fastutil.io.FastBufferedOutputStream.write(FastBufferedOutputStream.java:150)
> at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
> at com.esotericsoftware.kryo.io.Output.require(Output.java:135)
> at
> com.esotericsoftware.kryo.io.Output.writeAscii_slow(Output.java:446)
> at com.esotericsoftware.kryo.io.Output.writeString(Output.java:306)
> at
>
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeName(DefaultClassResolver.java:105)
> at
>
> com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:81)
> at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
> at
> com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
> at
>
> org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:88)
> at
>
> org.apache.spark.serializer.SerializationStream$class.writeAll(Serializer.scala:80)
> at
>
> org.apache.spark.serializer.KryoSerializationStream.writeAll(KryoSerializer.scala:84)
> at
>
> org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:815)
> at
> org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:824)
> at
> org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:78)
> at
> org.apache.spark.storage.BlockManager.liftedTree1$1(BlockManager.scala:552)
> at
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:546)
> at
> org.apache.spark.storage.BlockManager.put(BlockManager.scala:477)
> at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:76)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:224)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:29)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:34)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:237)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:226)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:159)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:100)
> at org.apache.spark.scheduler.Task.run(Task.scala:53)
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/OutOfMemoryError-with-basic-kmeans-tp1651.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


Re: Shared access to RDD

2014-02-17 Thread Koert Kuipers
it is possible to run multiple queries using a shared SparkContext (which
holds the shared RDD). however this is not easily available in spark-shell
i believe.

alternatively tachyon can be used to share (serialized) RDDs


On Mon, Feb 17, 2014 at 11:41 AM, David Thomas  wrote:

> Is it possible for multiple users to access the same RDD, say, from their
> respective spark-shells?
>


yarn documentation

2014-02-17 Thread Koert Kuipers
in the documentation for running spark on yarn it states:

"We do not requesting container resources based on the number of cores.
Thus the numbers of cores given via command line arguments cannot be
guaranteed."

can someone explain this a bit more?
is it simply a reflection of the fact that yarn could come back with the
requested number of containers, but those could have less cores than
requested? (and if so, spark will not ask for more containers to get the
required number of cores)


Re: default parallelism in trunk

2014-02-02 Thread Koert Kuipers
After the upgrade spark-shell still behaved properly. But a scala program
that defined it's own SparkContext and did not set
spark.default.parallelism suddenly was stuck with a parallelism of 2. I
"fixed it" by setting a desired spark.default.parallelism system property
for now, and no longer relying on the default.


On Sun, Feb 2, 2014 at 7:48 PM, Aaron Davidson  wrote:

> Sorry, I meant to say we will use the maximum between (the total number of
> cores in the cluster) and (2) if spark.default.parallelism is not set. So
> this should not be causing your problem unless your cluster thinks it has
> less than 2 cores.
>
>
> On Sun, Feb 2, 2014 at 4:46 PM, Aaron Davidson  wrote:
>
>> Could you give an example where default parallelism is set to 2 where it
>> didn't used to be?
>>
>> Here is the relevant section for the spark standalone mode:
>> CoarseGrainedSchedulerBackend.scala#L211<https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L211>.
>> If spark.default.parallelism is set, it will override anything else. If it
>> is not set, we will use the total number of cores in the cluster and 2,
>> which is the same logic that has been used since 
>> spark-0.7<https://github.com/apache/incubator-spark/blob/branch-0.7/core/src/main/scala/spark/scheduler/cluster/StandaloneSchedulerBackend.scala#L156>
>> .
>>
>> Simplest possibility is that you're setting spark.default.parallelism,
>> otherwise there may be a bug introduced somewhere that isn't defaulting
>> correctly anymore.
>>
>>
>> On Sat, Feb 1, 2014 at 12:30 AM, Koert Kuipers  wrote:
>>
>>> i just managed to upgrade my 0.9-SNAPSHOT from the last scala 2.9.x
>>> version to the latest.
>>>
>>>
>>> everything seems good except that my default parallelism is now set to 2
>>> for jobs instead of some smart number based on the number of cores (i think
>>> that is what it used to do). it this change on purpose?
>>>
>>> i am running spark standalone.
>>>
>>> thx, koert
>>>
>>
>>
>


default parallelism in trunk

2014-02-01 Thread Koert Kuipers
i just managed to upgrade my 0.9-SNAPSHOT from the last scala 2.9.x version
to the latest.


everything seems good except that my default parallelism is now set to 2
for jobs instead of some smart number based on the number of cores (i think
that is what it used to do). it this change on purpose?

i am running spark standalone.

thx, koert


graphx merge for scala 2.9

2013-12-27 Thread Koert Kuipers
since we are still on scala 2.9.x and trunk migrated to 2.10.x i hope
graphx will get merged into the 0.8.x series at some point, and not just
0.9.x (which is now scala 2.10), since that would make it hard for us to
use in the near future.
best, koert


how to detect a disconnect

2013-12-21 Thread Koert Kuipers
with long running apps i see this at times:

13/12/21 12:57:59 INFO scheduler.Stage: Stage 1 is now unavailable on
executor 10 (0/66, false)
13/12/21 12:58:19 WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(1, node10, 33734, 0) with no recent heart
beats: 50227ms exceeds 45000ms

typically this would be because of a spark service restart. is there a way
to detect this programmatically so that the client can take the correct
steps to recover?


Re: writing to HDFS with a given username

2013-12-13 Thread Koert Kuipers
thats great. didn't realize this was in master already.


On Thu, Dec 12, 2013 at 8:10 PM, Shao, Saisai  wrote:

>  Hi Koert,
>
>
>
> Spark with multi-user support has been merged in master branch with patch (
> https://github.com/apache/incubator-spark/pull/23), you can check out the
> master branch. These patch can support to access hdfs with the username you
> start the Spark application, not the one who starts Spark service.
>
>
>
> Thanks
>
> Jerry
>
> *From:* Koert Kuipers [mailto:ko...@tresata.com]
> *Sent:* Friday, December 13, 2013 8:39 AM
> *To:* user@spark.incubator.apache.org
> *Subject:* Re: writing to HDFS with a given username
>
>
>
> Hey Philip,
> how do you get spark to write to hdfs with your user name? When i use
> spark it writes to hdfs as the user that runs the spark services... i wish
> it read and wrote as me.
>
>
>
> On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren 
> wrote:
>
> When I call rdd.saveAsTextFile("hdfs://...") it uses my username to write
> to the HDFS drive.  If I try to write to an HDFS directory that I do not
> have permissions to, then I get an error like this:
>
> Permission denied: user=me, access=WRITE,
> inode="/user/you/":you:us:drwxr-xr-x
>
> I can obviously get around this by changing the permissions on the
> directory /user/you.  However, is it possible to call rdd.saveAsText with
> an alternate username and password?
>
> Thanks,
> Philip
>
>
>


Re: writing to HDFS with a given username

2013-12-12 Thread Koert Kuipers
Hey Philip,
how do you get spark to write to hdfs with your user name? When i use spark
it writes to hdfs as the user that runs the spark services... i wish it
read and wrote as me.


On Thu, Dec 12, 2013 at 6:37 PM, Philip Ogren wrote:

> When I call rdd.saveAsTextFile("hdfs://...") it uses my username to write
> to the HDFS drive.  If I try to write to an HDFS directory that I do not
> have permissions to, then I get an error like this:
>
> Permission denied: user=me, access=WRITE, inode="/user/you/":you:us:
> drwxr-xr-x
>
> I can obviously get around this by changing the permissions on the
> directory /user/you.  However, is it possible to call rdd.saveAsText with
> an alternate username and password?
>
> Thanks,
> Philip
>


Re: 0.9-SNAPSHOT StageInfo

2013-11-29 Thread Koert Kuipers
i use a SparkListener to collect info about failures in task related to my
RDD.

to do so for every stage submitted i verify if the stage is for an RDD that
is a dependency of my target target RDD (including the target RDD itself).

then for every task ending i check if the task is for a stage i care about,
after which i collect any errors for the task (for which i already have to
break the spark API, since i currently cannot pattern match on
taskEnd.reason due to the private nature of ExceptionFailure and friends.

all of this simply to be able to provide the user with a useful error
message as to why the calculation failed (as opposed to: fetch failed more
than 4 times).


On Fri, Nov 29, 2013 at 3:09 PM, Koert Kuipers  wrote:

> in 0.9-SNAPSHOT StageInfo has been changed to make the stage itself no
> longer accessible.
>
> however the stage contains the rdd, which is necessary to tie this
> StageInfo to an RDD. now all we have is the rddName. is the rddName
> guaranteed to be unique, and can it be relied upon to identify RDDs?
>


0.9-SNAPSHOT StageInfo

2013-11-29 Thread Koert Kuipers
in 0.9-SNAPSHOT StageInfo has been changed to make the stage itself no
longer accessible.

however the stage contains the rdd, which is necessary to tie this
StageInfo to an RDD. now all we have is the rddName. is the rddName
guaranteed to be unique, and can it be relied upon to identify RDDs?


Re: interesting question on quora

2013-11-18 Thread Koert Kuipers
the core of hadoop is currently hdfs + mapreduce. the more appropriate
question is if it will become hdfs + spark. so will spark overtake
mapreduce as the dominant computational engine? its a very serious
candidate for that i think. it can do many things mapreduce cannot do, and
has an awesome api.

it's missing a few things to truly replace mapreduce:
* handling data that does not fit in memory per key/reducer
* security support (integrate with hdfs authorization/authentication)
* scalability??? (has spark been tested on 1000 machines)


On Mon, Nov 18, 2013 at 12:38 AM, jamal sasha  wrote:

> I found this interesting question on quora.. and thought of sharing here.
> https://www.quora.com/Apache-Hadoop/Will-spark-ever-overtake-hadoop
> So.. is spark missing any capabilty?
>
>


Re: Does spark RDD has a partitionedByKey

2013-11-16 Thread Koert Kuipers
in fact co-partitioning was one of the main reason we started using spark.
in map-reduce its a giant pain to implement


On Sat, Nov 16, 2013 at 3:05 PM, Koert Kuipers  wrote:

> we use PartitionBy a lot to keep multiple datasets co-partitioned before
> caching.
> it works well.
>
>
> On Sat, Nov 16, 2013 at 5:10 AM, guojc  wrote:
>
>> After looking at the api more carefully, I just found  I overlooked the
>> partitionBy function on PairRDDFunction.  It's the function I need. Sorry
>> for the confusion.
>>
>> Best Regards,
>> Jiacheng Guo
>>
>>
>> On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen wrote:
>>
>>> Jiacheng, if you're OK with using the Shark layer above Spark (and I
>>> think for many use cases the answer would be "yes"), then you can take
>>> advantage of Shark's co-partitioning. Or do something like
>>> https://github.com/amplab/shark/pull/100/commits
>>>
>>> Sent while mobile. Pls excuse typos etc.
>>> On Nov 16, 2013 2:48 AM, "guojc"  wrote:
>>>
>>>> Hi Meisam,
>>>>  What I want to achieve here is a bit tricky. Basically, I'm try to
>>>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>>>> very efficient join strategy for high in-balanced data set and provide huge
>>>> gain against normal join in that situation.,
>>>>
>>>>  Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and
>>>> both of them load directly from hdfs. So both of them will has a
>>>> partitioner of Nothing. And X is a large complicate struct contain a set of
>>>> join key Y.  First for each partition of a , I extract join key Y from
>>>> every ins of X in that parition and construct a hash set of join key Y and
>>>> paritionID. Now I have a new rdd c :RDD[Y,PartionID ] and join it with b on
>>>> Y and then construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on
>>>> PartitionID and constructing map of Y and Z.  As for each partition of a, I
>>>> want to repartiion it according to its partition id, and it becomes a rdd
>>>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>>>> they will be joined very efficiently.
>>>>
>>>> The key ability I want to have here is the ability to cache rdd c
>>>> with same partitioner of rdd b and cache e. So later join with b and d will
>>>> be efficient, because the value of b will be updated from time to time and
>>>> d's content will change accordingly. And It will be nice to have the
>>>> ability to repartition a with its original paritionid without actually
>>>> shuffle across network.
>>>>
>>>> You can refer to
>>>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf 
>>>> for
>>>> PerSplit SemiJoin's details.
>>>>
>>>> Best Regards,
>>>> Jiacheng Guo
>>>>
>>>>
>>>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi 
>>>> wrote:
>>>>
>>>>> Hi guojc,
>>>>>
>>>>> It is not cleat for me what problem you are trying to solve. What do
>>>>> you want to do with the result of your
>>>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>>>>> in a join? Do you want to save it to your file system? Or do you want
>>>>> to do something else with it?
>>>>>
>>>>> Thanks,
>>>>> Meisam
>>>>>
>>>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc  wrote:
>>>>> > Hi Meisam,
>>>>> > Thank you for response. I know each rdd has a partitioner. What
>>>>> I want
>>>>> > to achieved here is re-partition a piece of data according to my
>>>>> custom
>>>>> > partitioner. Currently I do that by
>>>>> groupByKey(myPartitioner).flatMapValues(
>>>>> > x=>x). But I'm a bit worried whether this will create additional
>>>>> temp object
>>>>> > collection, as result is first made into Seq the an collection of
>>>>> tupples.
>>>>> > Any suggestion?
>>>>> >
>>>>> > Best Regards,
>>>>> > Jiahcheng Guo
>>>>> >
>>>>> >
>>>>> > On Sat, Nov 16, 2013 at 12

Re: Does spark RDD has a partitionedByKey

2013-11-16 Thread Koert Kuipers
we use PartitionBy a lot to keep multiple datasets co-partitioned before
caching.
it works well.


On Sat, Nov 16, 2013 at 5:10 AM, guojc  wrote:

> After looking at the api more carefully, I just found  I overlooked the
> partitionBy function on PairRDDFunction.  It's the function I need. Sorry
> for the confusion.
>
> Best Regards,
> Jiacheng Guo
>
>
> On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen wrote:
>
>> Jiacheng, if you're OK with using the Shark layer above Spark (and I
>> think for many use cases the answer would be "yes"), then you can take
>> advantage of Shark's co-partitioning. Or do something like
>> https://github.com/amplab/shark/pull/100/commits
>>
>> Sent while mobile. Pls excuse typos etc.
>> On Nov 16, 2013 2:48 AM, "guojc"  wrote:
>>
>>> Hi Meisam,
>>>  What I want to achieve here is a bit tricky. Basically, I'm try to
>>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>>> very efficient join strategy for high in-balanced data set and provide huge
>>> gain against normal join in that situation.,
>>>
>>>  Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both
>>> of them load directly from hdfs. So both of them will has a partitioner of
>>> Nothing. And X is a large complicate struct contain a set of join key Y.
>>>  First for each partition of a , I extract join key Y from every ins of X
>>> in that parition and construct a hash set of join key Y and paritionID. Now
>>> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
>>> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
>>> constructing map of Y and Z.  As for each partition of a, I want to
>>> repartiion it according to its partition id, and it becomes a rdd
>>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>>> they will be joined very efficiently.
>>>
>>> The key ability I want to have here is the ability to cache rdd c
>>> with same partitioner of rdd b and cache e. So later join with b and d will
>>> be efficient, because the value of b will be updated from time to time and
>>> d's content will change accordingly. And It will be nice to have the
>>> ability to repartition a with its original paritionid without actually
>>> shuffle across network.
>>>
>>> You can refer to
>>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf 
>>> for
>>> PerSplit SemiJoin's details.
>>>
>>> Best Regards,
>>> Jiacheng Guo
>>>
>>>
>>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi wrote:
>>>
 Hi guojc,

 It is not cleat for me what problem you are trying to solve. What do
 you want to do with the result of your
 groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
 in a join? Do you want to save it to your file system? Or do you want
 to do something else with it?

 Thanks,
 Meisam

 On Fri, Nov 15, 2013 at 12:56 PM, guojc  wrote:
 > Hi Meisam,
 > Thank you for response. I know each rdd has a partitioner. What I
 want
 > to achieved here is re-partition a piece of data according to my
 custom
 > partitioner. Currently I do that by
 groupByKey(myPartitioner).flatMapValues(
 > x=>x). But I'm a bit worried whether this will create additional temp
 object
 > collection, as result is first made into Seq the an collection of
 tupples.
 > Any suggestion?
 >
 > Best Regards,
 > Jiahcheng Guo
 >
 >
 > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <
 meisam.fa...@gmail.com>
 > wrote:
 >>
 >> Hi Jiacheng,
 >>
 >> Each RDD has a partitioner. You can define your own partitioner if
 the
 >> default partitioner does not suit your purpose.
 >> You can take a look at this
 >>
 >>
 http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
 .
 >>
 >> Thanks,
 >> Meisam
 >>
 >> On Fri, Nov 15, 2013 at 6:54 AM, guojc  wrote:
 >> > Hi,
 >> >   I'm wondering whether spark rdd can has a partitionedByKey
 function?
 >> > The
 >> > use of this function is to have a rdd distributed by according to a
 >> > cerntain
 >> > paritioner and cache it. And then further join performance by rdd
 with
 >> > same
 >> > partitoner will a great speed up. Currently, we only have a
 >> > groupByKeyFunction and generate a Seq of desired type , which is
 not
 >> > very
 >> > convenient.
 >> >
 >> > Btw, Sorry for last empty body email. I mistakenly hit the send
 >> > shortcut.
 >> >
 >> >
 >> > Best Regards,
 >> > Jiacheng Guo
 >
 >

>>>
>>>
>


Re: compare/contrast Spark with Cascading

2013-10-29 Thread Koert Kuipers
Hey Prashant,
I assume you mean steps to reproduce the OOM. I do not currently. I just
ran into them when porting some jobs from map-red. I never turned it into a
reproducible test, and i do not exclude that it was my poor programming
that caused it. However it happened with a bunch of jobs, and then i asked
on the message boards about the OOM, and people pointed me to the
assumption about reducer input having to fit in memory. At that point i
felt like that was too much of a limitation for the jobs i was trying to
port and i gave up.


On Tue, Oct 29, 2013 at 1:12 AM, Prashant Sharma wrote:

> Hey Koert,
>
> Can you give me steps to reproduce this ?
>
>
> On Tue, Oct 29, 2013 at 10:06 AM, Koert Kuipers  wrote:
>
>> Matei,
>> We have some jobs where even the input for a single key in a groupBy
>> would not fit in the the tasks memory. We rely on mapred to stream from
>> disk to disk as it reduces.
>> I think spark should be able to handle that situation to truly be able to
>> claim it can replace map-red (or not?).
>> Best, Koert
>>
>>
>> On Mon, Oct 28, 2013 at 8:51 PM, Matei Zaharia 
>> wrote:
>>
>>> FWIW, the only thing that Spark expects to fit in memory if you use
>>> DISK_ONLY caching is the input to each reduce task. Those currently don't
>>> spill to disk. The solution if datasets are large is to add more reduce
>>> tasks, whereas Hadoop would run along with a small number of tasks that do
>>> lots of disk IO. But this is something we will likely change soon. Other
>>> than that, everything runs in a streaming fashion and there's no need for
>>> the data to fit in memory. Our goal is certainly to work on any size
>>> datasets, and some of our current users are explicitly using Spark to
>>> replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s
>>> presentation from http://ampcamp.berkeley.edu/3/). If you run into
>>> trouble with these, let us know, since it is an explicit goal of the
>>> project to support it.
>>>
>>> Matei
>>>
>>> On Oct 28, 2013, at 5:32 PM, Koert Kuipers  wrote:
>>>
>>> no problem :) i am actually not familiar with what oscar has said on
>>> this. can you share or point me to the conversation thread?
>>>
>>> it is my opinion based on the little experimenting i have done. but i am
>>> willing to be convinced otherwise.
>>> one the very first things i did when we started using spark is run jobs
>>> with DISK_ONLY, and see if it could some of the jobs that map-reduce does
>>> for us. however i ran into OOMs, presumably because spark makes assumptions
>>> that some things should fit in memory. i have to admit i didn't try too
>>> hard after the first OOMs.
>>>
>>> if spark were able to scale from the quick in-memory query to the
>>> overnight disk-only giant batch query, i would love it! spark has a much
>>> nicer api than map-red, and one could use a single set of algos for
>>> everything from quick/realtime queries to giant batch jobs. as far as i am
>>> concerned map-red would be done. our clusters of the future would be hdfs +
>>> spark.
>>>
>>>
>>> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra 
>>> wrote:
>>>
>>>> And I didn't mean to skip over you, Koert.  I'm just more familiar with
>>>> what Oscar said on the subject than with your opinion.
>>>>
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra 
>>>> wrote:
>>>>
>>>>> Hmmm... I was unaware of this concept that Spark is for medium to
>>>>>> large datasets but not for very large datasets.
>>>>>
>>>>>
>>>>> It is in the opinion of some at Twitter.  That doesn't make it true or
>>>>> a universally held opinion.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole wrote:
>>>>>
>>>>>> Hmmm... I was unaware of this concept that Spark is for medium to
>>>>>> large datasets but not for very large datasets. What size is very large?
>>>>>>
>>>>>> Can someone please elaborate on why this would be the case and what
>>>>>> stops Spark, as it is today, to be successfully run on very large 
>>>>>> datasets?
>>>>>> I'll appreciate it.
>>>>>>
>>>>>> I would think that Spark should be able to pull 

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
Matei,
We have some jobs where even the input for a single key in a groupBy would
not fit in the the tasks memory. We rely on mapred to stream from disk to
disk as it reduces.
I think spark should be able to handle that situation to truly be able to
claim it can replace map-red (or not?).
Best, Koert


On Mon, Oct 28, 2013 at 8:51 PM, Matei Zaharia wrote:

> FWIW, the only thing that Spark expects to fit in memory if you use
> DISK_ONLY caching is the input to each reduce task. Those currently don't
> spill to disk. The solution if datasets are large is to add more reduce
> tasks, whereas Hadoop would run along with a small number of tasks that do
> lots of disk IO. But this is something we will likely change soon. Other
> than that, everything runs in a streaming fashion and there's no need for
> the data to fit in memory. Our goal is certainly to work on any size
> datasets, and some of our current users are explicitly using Spark to
> replace things like Hadoop Streaming in just batch jobs (see e.g. Yahoo!'s
> presentation from http://ampcamp.berkeley.edu/3/). If you run into
> trouble with these, let us know, since it is an explicit goal of the
> project to support it.
>
> Matei
>
> On Oct 28, 2013, at 5:32 PM, Koert Kuipers  wrote:
>
> no problem :) i am actually not familiar with what oscar has said on this.
> can you share or point me to the conversation thread?
>
> it is my opinion based on the little experimenting i have done. but i am
> willing to be convinced otherwise.
> one the very first things i did when we started using spark is run jobs
> with DISK_ONLY, and see if it could some of the jobs that map-reduce does
> for us. however i ran into OOMs, presumably because spark makes assumptions
> that some things should fit in memory. i have to admit i didn't try too
> hard after the first OOMs.
>
> if spark were able to scale from the quick in-memory query to the
> overnight disk-only giant batch query, i would love it! spark has a much
> nicer api than map-red, and one could use a single set of algos for
> everything from quick/realtime queries to giant batch jobs. as far as i am
> concerned map-red would be done. our clusters of the future would be hdfs +
> spark.
>
>
> On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra wrote:
>
>> And I didn't mean to skip over you, Koert.  I'm just more familiar with
>> what Oscar said on the subject than with your opinion.
>>
>>
>>
>> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra wrote:
>>
>>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>>> datasets but not for very large datasets.
>>>
>>>
>>> It is in the opinion of some at Twitter.  That doesn't make it true or a
>>> universally held opinion.
>>>
>>>
>>>
>>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole wrote:
>>>
>>>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>>> datasets but not for very large datasets. What size is very large?
>>>>
>>>> Can someone please elaborate on why this would be the case and what
>>>> stops Spark, as it is today, to be successfully run on very large datasets?
>>>> I'll appreciate it.
>>>>
>>>> I would think that Spark should be able to pull off Hadoop level
>>>> throughput in worst case with DISK_ONLY caching.
>>>>
>>>> Thanks
>>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers"  wrote:
>>>>
>>>>> i would say scaling (cascading + DSL for scala) offers similar
>>>>> functionality to spark, and a similar syntax.
>>>>> the main difference between spark and scalding is target jobs:
>>>>> scalding is for long running jobs on very large data. the data is read
>>>>> from and written to disk between steps. jobs run from minutes to days.
>>>>> spark is for faster jobs on medium to large data. the data is
>>>>> primarily held in memory. jobs run from a few seconds to a few hours.
>>>>> although spark can work with data on disks it still makes assumptions that
>>>>> data needs to fit in memory for certain steps (although less and less with
>>>>> every release). spark also makes iterative designs much easier.
>>>>>
>>>>> i have found them both great to program in and complimentary. we use
>>>>> scalding for overnight batch processes and spark for more realtime
>>>>> processes. at this point i would trust scalding a lot more due to the
>>>>> robustness of the sta

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
no problem :) i am actually not familiar with what oscar has said on this.
can you share or point me to the conversation thread?

it is my opinion based on the little experimenting i have done. but i am
willing to be convinced otherwise.
one the very first things i did when we started using spark is run jobs
with DISK_ONLY, and see if it could some of the jobs that map-reduce does
for us. however i ran into OOMs, presumably because spark makes assumptions
that some things should fit in memory. i have to admit i didn't try too
hard after the first OOMs.

if spark were able to scale from the quick in-memory query to the overnight
disk-only giant batch query, i would love it! spark has a much nicer api
than map-red, and one could use a single set of algos for everything from
quick/realtime queries to giant batch jobs. as far as i am concerned
map-red would be done. our clusters of the future would be hdfs + spark.


On Mon, Oct 28, 2013 at 8:16 PM, Mark Hamstra wrote:

> And I didn't mean to skip over you, Koert.  I'm just more familiar with
> what Oscar said on the subject than with your opinion.
>
>
>
> On Mon, Oct 28, 2013 at 5:13 PM, Mark Hamstra wrote:
>
>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>> datasets but not for very large datasets.
>>
>>
>> It is in the opinion of some at Twitter.  That doesn't make it true or a
>> universally held opinion.
>>
>>
>>
>> On Mon, Oct 28, 2013 at 5:08 PM, Ashish Rangole wrote:
>>
>>> Hmmm... I was unaware of this concept that Spark is for medium to large
>>> datasets but not for very large datasets. What size is very large?
>>>
>>> Can someone please elaborate on why this would be the case and what
>>> stops Spark, as it is today, to be successfully run on very large datasets?
>>> I'll appreciate it.
>>>
>>> I would think that Spark should be able to pull off Hadoop level
>>> throughput in worst case with DISK_ONLY caching.
>>>
>>> Thanks
>>> On Oct 28, 2013 1:37 PM, "Koert Kuipers"  wrote:
>>>
>>>> i would say scaling (cascading + DSL for scala) offers similar
>>>> functionality to spark, and a similar syntax.
>>>> the main difference between spark and scalding is target jobs:
>>>> scalding is for long running jobs on very large data. the data is read
>>>> from and written to disk between steps. jobs run from minutes to days.
>>>> spark is for faster jobs on medium to large data. the data is primarily
>>>> held in memory. jobs run from a few seconds to a few hours. although spark
>>>> can work with data on disks it still makes assumptions that data needs to
>>>> fit in memory for certain steps (although less and less with every
>>>> release). spark also makes iterative designs much easier.
>>>>
>>>> i have found them both great to program in and complimentary. we use
>>>> scalding for overnight batch processes and spark for more realtime
>>>> processes. at this point i would trust scalding a lot more due to the
>>>> robustness of the stack, but spark is getting better every day.
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan  wrote:
>>>>
>>>>> Hi Philip,
>>>>>
>>>>> Cascading is relatively agnostic about the distributed topology
>>>>> underneath it, especially as of the 2.0 release over a year ago. There's
>>>>> been some discussion about writing a flow planner for Spark -- e.g., which
>>>>> would replace the Hadoop flow planner. Not sure if there's active work on
>>>>> that yet.
>>>>>
>>>>> There are a few commercial workflow abstraction layers (probably what
>>>>> was meant by "application layer" ?), in terms of the Cascading family
>>>>> (incl. Cascalog, Scalding), and also Actian's integration of
>>>>> Hadoop/Knime/etc., and also the work by Continuum, ODG, and others in the
>>>>> Py data stack.
>>>>>
>>>>> Spark would not be at the same level of abstraction as Cascading
>>>>> (business logic, effectively); however, something like MLbase is 
>>>>> ostensibly
>>>>> intended for that http://www.mlbase.org/
>>>>>
>>>>> With respect to Spark, two other things to watch... One would
>>>>> definitely be the Py data stack and ability to integrate with PySpark,
>>>>> which is 

Re: compare/contrast Spark with Cascading

2013-10-28 Thread Koert Kuipers
i would say scaling (cascading + DSL for scala) offers similar
functionality to spark, and a similar syntax.
the main difference between spark and scalding is target jobs:
scalding is for long running jobs on very large data. the data is read from
and written to disk between steps. jobs run from minutes to days.
spark is for faster jobs on medium to large data. the data is primarily
held in memory. jobs run from a few seconds to a few hours. although spark
can work with data on disks it still makes assumptions that data needs to
fit in memory for certain steps (although less and less with every
release). spark also makes iterative designs much easier.

i have found them both great to program in and complimentary. we use
scalding for overnight batch processes and spark for more realtime
processes. at this point i would trust scalding a lot more due to the
robustness of the stack, but spark is getting better every day.




On Mon, Oct 28, 2013 at 3:00 PM, Paco Nathan  wrote:

> Hi Philip,
>
> Cascading is relatively agnostic about the distributed topology underneath
> it, especially as of the 2.0 release over a year ago. There's been some
> discussion about writing a flow planner for Spark -- e.g., which would
> replace the Hadoop flow planner. Not sure if there's active work on that
> yet.
>
> There are a few commercial workflow abstraction layers (probably what was
> meant by "application layer" ?), in terms of the Cascading family (incl.
> Cascalog, Scalding), and also Actian's integration of Hadoop/Knime/etc.,
> and also the work by Continuum, ODG, and others in the Py data stack.
>
> Spark would not be at the same level of abstraction as Cascading (business
> logic, effectively); however, something like MLbase is ostensibly intended
> for that http://www.mlbase.org/
>
> With respect to Spark, two other things to watch... One would definitely
> be the Py data stack and ability to integrate with PySpark, which is
> turning out to be very power abstraction -- quite close to a large segment
> of industry needs.  The other project to watch, on the Scala side, is
> Summingbird and it's evolution at Twitter:
> https://blog.twitter.com/2013/streaming-mapreduce-with-summingbird
>
> Paco
> http://amazon.com/dp/1449358721/
>
>
> On Mon, Oct 28, 2013 at 10:11 AM, Philip Ogren wrote:
>
>>
>> My team is investigating a number of technologies in the Big Data space.
>> A team member recently got turned on to 
>> Cascadingas an application layer 
>> for orchestrating complex workflows/scenarios.  He
>> asked me if Spark had an "application layer"?  My initial reaction is "no"
>> that Spark would not have a separate orchestration/application layer.
>> Instead, the core Spark API (along with Streaming) would compete directly
>> with Cascading for this kind of functionality and that the two would not
>> likely be all that complementary.  I realize that I am exposing my
>> ignorance here and could be way off.  Is there anyone who knows a bit about
>> both of these technologies who could speak to this in broad strokes?
>>
>> Thanks!
>> Philip
>>
>>
>


Re: snappy

2013-10-18 Thread Koert Kuipers
for now i simply rebuild spark 0.8 but changed the snappy dependency, and
this fixed it.

but in general it should be possible for a user to include jars in jobs
that "override" the jars included with spark i would say. however this does
not seem to work.


On Fri, Oct 18, 2013 at 7:39 PM, Koert Kuipers  wrote:

> the issue with snappy is supposedly fixed in snappy-java version 1.1.0-M4.
> so i tried to include that with my job, but no luck. i still get error
> related to "snappy-1.0.5-libsnappyjava.so", which to me seems to indicate
> that the snappy included with spark is on the classpath ahead of the one i
> included with my job. is this the case? how do i put my jars first on the
> spark claspath? this should generally be possible as people can include
> newer versions of jars...
>
>
> On Fri, Oct 18, 2013 at 7:30 PM, Koert Kuipers  wrote:
>
>> woops wrong mailing list
>>
>> -- Forwarded message --
>> From: Koert Kuipers 
>> Date: Fri, Oct 18, 2013 at 7:29 PM
>> Subject: snappy
>> To: spark-us...@googlegroups.com
>>
>>
>> the snappy bundled with spark 0.8 is causing trouble on CentOS 5:
>>
>>  java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5-libsnappyjava.so:
>> /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by
>> /tmp/snappy-1.0.5-libsn
>> appyjava.so)
>>
>>
>>
>


Re: snappy

2013-10-18 Thread Koert Kuipers
the issue with snappy is supposedly fixed in snappy-java version 1.1.0-M4.
so i tried to include that with my job, but no luck. i still get error
related to "snappy-1.0.5-libsnappyjava.so", which to me seems to indicate
that the snappy included with spark is on the classpath ahead of the one i
included with my job. is this the case? how do i put my jars first on the
spark claspath? this should generally be possible as people can include
newer versions of jars...


On Fri, Oct 18, 2013 at 7:30 PM, Koert Kuipers  wrote:

> woops wrong mailing list
>
> -- Forwarded message --
> From: Koert Kuipers 
> Date: Fri, Oct 18, 2013 at 7:29 PM
> Subject: snappy
> To: spark-us...@googlegroups.com
>
>
> the snappy bundled with spark 0.8 is causing trouble on CentOS 5:
>
>  java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5-libsnappyjava.so:
> /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by
> /tmp/snappy-1.0.5-libsn
> appyjava.so)
>
>
>


Fwd: snappy

2013-10-18 Thread Koert Kuipers
woops wrong mailing list

-- Forwarded message --
From: Koert Kuipers 
Date: Fri, Oct 18, 2013 at 7:29 PM
Subject: snappy
To: spark-us...@googlegroups.com


the snappy bundled with spark 0.8 is causing trouble on CentOS 5:

 java.lang.UnsatisfiedLinkError: /tmp/snappy-1.0.5-libsnappyjava.so:
/usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.9' not found (required by
/tmp/snappy-1.0.5-libsn
appyjava.so)


Re: spark 0.8

2013-10-18 Thread Koert Kuipers
OK it turned out setting
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer in
SPARK_JAVA_OPTS on the workers/slaves caused all this. not sure why. this
used to work fine in previous spark. but when i removed it the errors went
away.


On Fri, Oct 18, 2013 at 2:59 PM, Koert Kuipers  wrote:

> i installed the plain vanilla spark 0.8 on our cluster, downloaded from
> here:
> http://spark-project.org/download/spark-0.8.0-incubating-bin-hadoop1.tgz
> after a restart of all spark daemons i still see the same issue for every
> task:
>
> java.io.StreamCorruptedException: invalid type code: 00
>
> so now i am guessing it must be something in my configuration. i guess
> this is progress...
>
> looking at the logs of a worker, i see the task gets launched like this:
>
> Spark Executor Command: "java" "-cp"
> ":/usr/local/lib/spark/conf:/usr/local/lib/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar"
> "-Dspar
> k.worker.timeout=3" "-Dspark.akka.timeout=3"
> "-Dspark.storage.blockManagerHeartBeatMs=12"
> "-Dspark.storage.blockManagerTimeoutIntervalMs=12" "-Dspark.akka.retry
> .wait=3" "-Dspark.akka.frameSize=1"
> "-Dspark.akka.logLifecycleEvents=true"
> "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer"
> "-Dspark.worker.timeout=3
> " "-Dspark.akka.timeout=3"
> "-Dspark.storage.blockManagerHeartBeatMs=12"
> "-Dspark.storage.blockManagerTimeoutIntervalMs=12"
> "-Dspark.akka.retry.wait=3" "-Dspark.
> akka.frameSize=1" "-Dspark.akka.logLifecycleEvents=true"
> "-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Xms512M"
> "-Xmx512M" "org.apache.spark.executor.St
> andaloneExecutorBackend" "akka://
> spark@192.168.3.171:38472/user/StandaloneScheduler" "1" "node02" "7"
>
>
> and finally this is my spark-env.sh:
>
> export SCALA_HOME=/usr/local/lib/scala-2.9.3
> export SPARK_MASTER_IP=node01
> export SPARK_MASTER_PORT=7077
> export SPARK_MASTER_WEBUI_PORT=8080
> export SPARK_WORKER_CORES=7
> export SPARK_WORKER_MEMORY=14G
> export SPARK_WORKER_PORT=7078
> export SPARK_WORKER_WEBUI_PORT=8081
> export SPARK_WORKER_DIR=/var/lib/spark
> export SPARK_CLASSPATH=$SPARK_USER_CLASSPATH
> export SPARK_JAVA_OPTS="-Dspark.worker.timeout=3
> -Dspark.akka.timeout=3 -Dspark.storage.blockManagerHeartBeatMs=120000
> -Dspark.storage.blockManagerTimeoutIntervalMs=120
> 000 -Dspark.akka.retry.wait=3 -Dspark.akka.frameSize=1
> -Dspark.akka.logLifecycleEvents=true
> -Dspark.serializer=org.apache.spark.serializer.KryoSerializer $SPARK_JAVA_OP
> TS"
> export
> SPARK_WORKER_OPTS="-Dspark.local.dir=/data/0/tmp,/data/1/tmp,/data/2/tmp,/data/3/tmp,/data/4/tmp,/data/5/tmp"
>
>
>
>
>
> On Fri, Oct 18, 2013 at 2:02 PM, Koert Kuipers  wrote:
>
>> i checked out the v0.8.0-incubating tag again, changed the settings to
>> build against correct version of hadoop for our cluster, ran sbt-assembly,
>> build tarball, installed it on cluster, restarted spark... same errors
>>
>>
>> On Fri, Oct 18, 2013 at 12:49 PM, Koert Kuipers wrote:
>>
>>> at this point i feel like it must be some sort of version mismatch? i am
>>> gonna check the spark build that i deployed on the cluster
>>>
>>>
>>> On Fri, Oct 18, 2013 at 12:46 PM, Koert Kuipers wrote:
>>>
>>>> name := "Simple Project"
>>>>
>>>> version := "1.0"
>>>>
>>>> scalaVersion := "2.9.3"
>>>>
>>>> libraryDependencies += "org.apache.spark" %% "spark-core" %
>>>> "0.8.0-incubating"
>>>>
>>>> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>>>>
>>>> resolvers += "Cloudera Repository" at "
>>>> https://repository.cloudera.com/artifactory/cloudera-repos/";
>>>>
>>>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" %
>>>> "2.0.0-mr1-cdh4.3.0"
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, Oct 18, 2013 at 12:34 PM, Matei Zaharia <
>>>> matei.zaha...@gmail.com> wrote:
>>>>
>>>>> Can you post the build.sbt for your program? It needs to include
>>>>> hadoop-client for CDH4.3, and that 

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
i installed the plain vanilla spark 0.8 on our cluster, downloaded from
here:
http://spark-project.org/download/spark-0.8.0-incubating-bin-hadoop1.tgz
after a restart of all spark daemons i still see the same issue for every
task:
java.io.StreamCorruptedException: invalid type code: 00

so now i am guessing it must be something in my configuration. i guess this
is progress...

looking at the logs of a worker, i see the task gets launched like this:

Spark Executor Command: "java" "-cp"
":/usr/local/lib/spark/conf:/usr/local/lib/spark/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.0-incubating-hadoop1.0.4.jar"
"-Dspar
k.worker.timeout=3" "-Dspark.akka.timeout=3"
"-Dspark.storage.blockManagerHeartBeatMs=12"
"-Dspark.storage.blockManagerTimeoutIntervalMs=12" "-Dspark.akka.retry
.wait=3" "-Dspark.akka.frameSize=1"
"-Dspark.akka.logLifecycleEvents=true"
"-Dspark.serializer=org.apache.spark.serializer.KryoSerializer"
"-Dspark.worker.timeout=3
" "-Dspark.akka.timeout=3"
"-Dspark.storage.blockManagerHeartBeatMs=12"
"-Dspark.storage.blockManagerTimeoutIntervalMs=12"
"-Dspark.akka.retry.wait=3" "-Dspark.
akka.frameSize=1" "-Dspark.akka.logLifecycleEvents=true"
"-Dspark.serializer=org.apache.spark.serializer.KryoSerializer" "-Xms512M"
"-Xmx512M" "org.apache.spark.executor.St
andaloneExecutorBackend" "akka://
spark@192.168.3.171:38472/user/StandaloneScheduler" "1" "node02" "7"


and finally this is my spark-env.sh:

export SCALA_HOME=/usr/local/lib/scala-2.9.3
export SPARK_MASTER_IP=node01
export SPARK_MASTER_PORT=7077
export SPARK_MASTER_WEBUI_PORT=8080
export SPARK_WORKER_CORES=7
export SPARK_WORKER_MEMORY=14G
export SPARK_WORKER_PORT=7078
export SPARK_WORKER_WEBUI_PORT=8081
export SPARK_WORKER_DIR=/var/lib/spark
export SPARK_CLASSPATH=$SPARK_USER_CLASSPATH
export SPARK_JAVA_OPTS="-Dspark.worker.timeout=3
-Dspark.akka.timeout=3 -Dspark.storage.blockManagerHeartBeatMs=12
-Dspark.storage.blockManagerTimeoutIntervalMs=120
000 -Dspark.akka.retry.wait=3 -Dspark.akka.frameSize=1
-Dspark.akka.logLifecycleEvents=true
-Dspark.serializer=org.apache.spark.serializer.KryoSerializer $SPARK_JAVA_OP
TS"
export
SPARK_WORKER_OPTS="-Dspark.local.dir=/data/0/tmp,/data/1/tmp,/data/2/tmp,/data/3/tmp,/data/4/tmp,/data/5/tmp"





On Fri, Oct 18, 2013 at 2:02 PM, Koert Kuipers  wrote:

> i checked out the v0.8.0-incubating tag again, changed the settings to
> build against correct version of hadoop for our cluster, ran sbt-assembly,
> build tarball, installed it on cluster, restarted spark... same errors
>
>
> On Fri, Oct 18, 2013 at 12:49 PM, Koert Kuipers  wrote:
>
>> at this point i feel like it must be some sort of version mismatch? i am
>> gonna check the spark build that i deployed on the cluster
>>
>>
>> On Fri, Oct 18, 2013 at 12:46 PM, Koert Kuipers wrote:
>>
>>> name := "Simple Project"
>>>
>>> version := "1.0"
>>>
>>> scalaVersion := "2.9.3"
>>>
>>> libraryDependencies += "org.apache.spark" %% "spark-core" %
>>> "0.8.0-incubating"
>>>
>>> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>>>
>>> resolvers += "Cloudera Repository" at "
>>> https://repository.cloudera.com/artifactory/cloudera-repos/";
>>>
>>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" %
>>> "2.0.0-mr1-cdh4.3.0"
>>>
>>>
>>>
>>>
>>> On Fri, Oct 18, 2013 at 12:34 PM, Matei Zaharia >> > wrote:
>>>
>>>> Can you post the build.sbt for your program? It needs to include
>>>> hadoop-client for CDH4.3, and that should *not* be listed as provided.
>>>>
>>>> Matei
>>>>
>>>> On Oct 18, 2013, at 8:23 AM, Koert Kuipers  wrote:
>>>>
>>>> ok this has nothing to do with hadoop access. even a simple program
>>>> that uses sc.parallelize blows up in this way.
>>>>
>>>> so spark-shell works well on the same machine i launch this from.
>>>>
>>>> if i launch a simple program without using kryo for serializer and
>>>> closure serialize i get a different error. see below.
>>>> at this point it seems to me i have some issue with task
>>>> serialization???
>>>>
>>>>
>>>>
>>>> 13/10/18

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
i checked out the v0.8.0-incubating tag again, changed the settings to
build against correct version of hadoop for our cluster, ran sbt-assembly,
build tarball, installed it on cluster, restarted spark... same errors


On Fri, Oct 18, 2013 at 12:49 PM, Koert Kuipers  wrote:

> at this point i feel like it must be some sort of version mismatch? i am
> gonna check the spark build that i deployed on the cluster
>
>
> On Fri, Oct 18, 2013 at 12:46 PM, Koert Kuipers  wrote:
>
>> name := "Simple Project"
>>
>> version := "1.0"
>>
>> scalaVersion := "2.9.3"
>>
>> libraryDependencies += "org.apache.spark" %% "spark-core" %
>> "0.8.0-incubating"
>>
>> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>>
>> resolvers += "Cloudera Repository" at "
>> https://repository.cloudera.com/artifactory/cloudera-repos/";
>>
>> libraryDependencies += "org.apache.hadoop" % "hadoop-client" %
>> "2.0.0-mr1-cdh4.3.0"
>>
>>
>>
>>
>> On Fri, Oct 18, 2013 at 12:34 PM, Matei Zaharia 
>> wrote:
>>
>>> Can you post the build.sbt for your program? It needs to include
>>> hadoop-client for CDH4.3, and that should *not* be listed as provided.
>>>
>>> Matei
>>>
>>> On Oct 18, 2013, at 8:23 AM, Koert Kuipers  wrote:
>>>
>>> ok this has nothing to do with hadoop access. even a simple program that
>>> uses sc.parallelize blows up in this way.
>>>
>>> so spark-shell works well on the same machine i launch this from.
>>>
>>> if i launch a simple program without using kryo for serializer and
>>> closure serialize i get a different error. see below.
>>> at this point it seems to me i have some issue with task serialization???
>>>
>>>
>>>
>>> 13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 0
>>> 13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 1
>>> 13/10/18 11:20:37 INFO Executor: Running task ID 1
>>> 13/10/18 11:20:37 INFO Executor: Running task ID 0
>>> 13/10/18 11:20:37 INFO Executor: Fetching
>>> http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar with
>>> timestamp 1382109635095
>>> 13/10/18 11:20:37 INFO Utils: Fetching
>>> http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar to
>>> /tmp/fetchFileTemp378181753997570700.tmp
>>> 13/10/18 11:20:37 INFO Executor: Adding
>>> file:/var/lib/spark/app-20131018112035-0014/1/./simple-project_2.9.3-1.0.jar
>>> to class loader
>>> 13/10/18 11:20:37 INFO Executor: caught throwable with stacktrace
>>> java.io.StreamCorruptedException: invalid type code: 00
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2467)
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2502)
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2661)
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2583)
>>> at java.io.DataInputStream.readFully(DataInputStream.java:178)
>>> at java.io.DataInputStream.readLong(DataInputStream.java:399)
>>> at
>>> java.io.ObjectInputStream$BlockDataInputStream.readLong(ObjectInputStream.java:2803)
>>> at java.io.ObjectInputStream.readLong(ObjectInputStream.java:958)
>>> at
>>> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:72)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>> at java.lang.reflect.Method.invoke(Method.java:597)
>>> at
>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
>>> at
>>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
>>> at
>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>>> at
>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
&g

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
at this point i feel like it must be some sort of version mismatch? i am
gonna check the spark build that i deployed on the cluster


On Fri, Oct 18, 2013 at 12:46 PM, Koert Kuipers  wrote:

> name := "Simple Project"
>
> version := "1.0"
>
> scalaVersion := "2.9.3"
>
> libraryDependencies += "org.apache.spark" %% "spark-core" %
> "0.8.0-incubating"
>
> resolvers += "Akka Repository" at "http://repo.akka.io/releases/";
>
> resolvers += "Cloudera Repository" at "
> https://repository.cloudera.com/artifactory/cloudera-repos/";
>
> libraryDependencies += "org.apache.hadoop" % "hadoop-client" %
> "2.0.0-mr1-cdh4.3.0"
>
>
>
>
> On Fri, Oct 18, 2013 at 12:34 PM, Matei Zaharia 
> wrote:
>
>> Can you post the build.sbt for your program? It needs to include
>> hadoop-client for CDH4.3, and that should *not* be listed as provided.
>>
>> Matei
>>
>> On Oct 18, 2013, at 8:23 AM, Koert Kuipers  wrote:
>>
>> ok this has nothing to do with hadoop access. even a simple program that
>> uses sc.parallelize blows up in this way.
>>
>> so spark-shell works well on the same machine i launch this from.
>>
>> if i launch a simple program without using kryo for serializer and
>> closure serialize i get a different error. see below.
>> at this point it seems to me i have some issue with task serialization???
>>
>>
>>
>> 13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 0
>> 13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 1
>> 13/10/18 11:20:37 INFO Executor: Running task ID 1
>> 13/10/18 11:20:37 INFO Executor: Running task ID 0
>> 13/10/18 11:20:37 INFO Executor: Fetching
>> http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar with
>> timestamp 1382109635095
>> 13/10/18 11:20:37 INFO Utils: Fetching
>> http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar to
>> /tmp/fetchFileTemp378181753997570700.tmp
>> 13/10/18 11:20:37 INFO Executor: Adding
>> file:/var/lib/spark/app-20131018112035-0014/1/./simple-project_2.9.3-1.0.jar
>> to class loader
>> 13/10/18 11:20:37 INFO Executor: caught throwable with stacktrace
>> java.io.StreamCorruptedException: invalid type code: 00
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2467)
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2502)
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2661)
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2583)
>> at java.io.DataInputStream.readFully(DataInputStream.java:178)
>> at java.io.DataInputStream.readLong(DataInputStream.java:399)
>> at
>> java.io.ObjectInputStream$BlockDataInputStream.readLong(ObjectInputStream.java:2803)
>> at java.io.ObjectInputStream.readLong(ObjectInputStream.java:958)
>> at
>> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:72)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
>> at
>> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>> at
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
>> at
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1795)
>> at
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1754)
>> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>> at
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)
>> at
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Exec

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
name := "Simple Project"

version := "1.0"

scalaVersion := "2.9.3"

libraryDependencies += "org.apache.spark" %% "spark-core" %
"0.8.0-incubating"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/";

resolvers += "Cloudera Repository" at "
https://repository.cloudera.com/artifactory/cloudera-repos/";

libraryDependencies += "org.apache.hadoop" % "hadoop-client" %
"2.0.0-mr1-cdh4.3.0"




On Fri, Oct 18, 2013 at 12:34 PM, Matei Zaharia wrote:

> Can you post the build.sbt for your program? It needs to include
> hadoop-client for CDH4.3, and that should *not* be listed as provided.
>
> Matei
>
> On Oct 18, 2013, at 8:23 AM, Koert Kuipers  wrote:
>
> ok this has nothing to do with hadoop access. even a simple program that
> uses sc.parallelize blows up in this way.
>
> so spark-shell works well on the same machine i launch this from.
>
> if i launch a simple program without using kryo for serializer and closure
> serialize i get a different error. see below.
> at this point it seems to me i have some issue with task serialization???
>
>
>
> 13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 0
> 13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 1
> 13/10/18 11:20:37 INFO Executor: Running task ID 1
> 13/10/18 11:20:37 INFO Executor: Running task ID 0
> 13/10/18 11:20:37 INFO Executor: Fetching
> http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar with
> timestamp 1382109635095
> 13/10/18 11:20:37 INFO Utils: Fetching
> http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar to
> /tmp/fetchFileTemp378181753997570700.tmp
> 13/10/18 11:20:37 INFO Executor: Adding
> file:/var/lib/spark/app-20131018112035-0014/1/./simple-project_2.9.3-1.0.jar
> to class loader
> 13/10/18 11:20:37 INFO Executor: caught throwable with stacktrace
> java.io.StreamCorruptedException: invalid type code: 00
> at
> java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2467)
> at
> java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2502)
> at
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2661)
> at
> java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2583)
> at java.io.DataInputStream.readFully(DataInputStream.java:178)
> at java.io.DataInputStream.readLong(DataInputStream.java:399)
> at
> java.io.ObjectInputStream$BlockDataInputStream.readLong(ObjectInputStream.java:2803)
> at java.io.ObjectInputStream.readLong(ObjectInputStream.java:958)
> at
> org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:72)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
> at
> java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
> at
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
> at
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1795)
> at
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1754)
> at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
> at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
> at
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)
> at
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:153)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
>
>
>
> On Fri, Oct 18, 2013 at 10:59 AM, Koert Kuipers  wrote:
>
>> i created a tiny sbt project as described here:
>> apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala<http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala>
>>
>> it h

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
ok this has nothing to do with hadoop access. even a simple program that
uses sc.parallelize blows up in this way.

so spark-shell works well on the same machine i launch this from.

if i launch a simple program without using kryo for serializer and closure
serialize i get a different error. see below.
at this point it seems to me i have some issue with task serialization???



13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 0
13/10/18 11:20:37 INFO StandaloneExecutorBackend: Got assigned task 1
13/10/18 11:20:37 INFO Executor: Running task ID 1
13/10/18 11:20:37 INFO Executor: Running task ID 0
13/10/18 11:20:37 INFO Executor: Fetching
http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar with timestamp
1382109635095
13/10/18 11:20:37 INFO Utils: Fetching
http://192.168.3.171:41629/jars/simple-project_2.9.3-1.0.jar to
/tmp/fetchFileTemp378181753997570700.tmp
13/10/18 11:20:37 INFO Executor: Adding
file:/var/lib/spark/app-20131018112035-0014/1/./simple-project_2.9.3-1.0.jar
to class loader
13/10/18 11:20:37 INFO Executor: caught throwable with stacktrace
java.io.StreamCorruptedException: invalid type code: 00
at
java.io.ObjectInputStream$BlockDataInputStream.readBlockHeader(ObjectInputStream.java:2467)
at
java.io.ObjectInputStream$BlockDataInputStream.refill(ObjectInputStream.java:2502)
at
java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2661)
at
java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2583)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at java.io.DataInputStream.readLong(DataInputStream.java:399)
at
java.io.ObjectInputStream$BlockDataInputStream.readLong(ObjectInputStream.java:2803)
at java.io.ObjectInputStream.readLong(ObjectInputStream.java:958)
at
org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:72)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
at
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1795)
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1754)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)
at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:153)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)



On Fri, Oct 18, 2013 at 10:59 AM, Koert Kuipers  wrote:

> i created a tiny sbt project as described here:
> apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala<http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala>
>
> it has the correct dependencies: spark-core and the correct hadoop-client
> for my platform. i tried both the generic spark-core dependency and
> spark-core dependency compiled against my platform. it runs fine in local
> mode, but when i switch to the cluster i still always get the following
> exceptions on tasks:
>
> 13/10/18 10:25:53 ERROR Executor: Uncaught exception in thread
> Thread[pool-5-thread-1,5,main]
>
> java.lang.NullPointerException
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
>
> after adding some additional debugging to Executor i see the cause is this:
> 13/10/18 10:54:47 INFO Executor: caught throwable with stacktrace
> java.lang.NullPointerException
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$2.apply(Executor.scala:155)
> at
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$2.apply(Exec

Re: spark 0.8

2013-10-18 Thread Koert Kuipers
i created a tiny sbt project as described here:
apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala<http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala>

it has the correct dependencies: spark-core and the correct hadoop-client
for my platform. i tried both the generic spark-core dependency and
spark-core dependency compiled against my platform. it runs fine in local
mode, but when i switch to the cluster i still always get the following
exceptions on tasks:

13/10/18 10:25:53 ERROR Executor: Uncaught exception in thread
Thread[pool-5-thread-1,5,main]
java.lang.NullPointerException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

after adding some additional debugging to Executor i see the cause is this:
13/10/18 10:54:47 INFO Executor: caught throwable with stacktrace
java.lang.NullPointerException
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$2.apply(Executor.scala:155)
at
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$2.apply(Executor.scala:155)
at org.apache.spark.Logging$class.logInfo(Logging.scala:48)
at org.apache.spark.executor.Executor.logInfo(Executor.scala:36)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:155)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)

so it seems the offending line is:
logInfo("Its epoch is " + task.epoch)

i am guessing accessing epoch on the task is throwing the NPE. any ideas?



On Thu, Oct 17, 2013 at 8:12 PM, Koert Kuipers  wrote:

> sorry one more related question:
> i compile against a spark build for hadoop 1.0.4, but the actual installed
> version of spark is build against cdh4.3.0-mr1. this also used to work, and
> i prefer to do this so i compile against a generic spark build. could this
> be the issue?
>
>
> On Thu, Oct 17, 2013 at 8:06 PM, Koert Kuipers  wrote:
>
>> i have my spark and hadoop related dependencies as "provided" for my
>> spark job. this used to work with previous versions. are these now supposed
>> to be compile/runtime/default dependencies?
>>
>>
>> On Thu, Oct 17, 2013 at 8:04 PM, Koert Kuipers  wrote:
>>
>>> yes i did that and i can see the correct jars sitting in lib_managed
>>>
>>>
>>> On Thu, Oct 17, 2013 at 7:56 PM, Matei Zaharia 
>>> wrote:
>>>
>>>> Koert, did you link your Spark job to the right version of HDFS as
>>>> well? In Spark 0.8, you have to add a Maven dependency on "hadoop-client"
>>>> for your version of Hadoop. See
>>>> http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>>>>  for
>>>> example.
>>>>
>>>> Matei
>>>>
>>>> On Oct 17, 2013, at 4:38 PM, Koert Kuipers  wrote:
>>>>
>>>> i got the job a little further along by also setting this:
>>>> System.setProperty("spark.closure.serializer",
>>>> "org.apache.spark.serializer.KryoSerializer")
>>>>
>>>> not sure why i need to... but anyhow, now my workers start and then
>>>> they blow up on this:
>>>>
>>>> 13/10/17 19:22:57 ERROR Executor: Uncaught exception in thread
>>>> Thread[pool-5-thread-1,5,main]
>>>> java.lang.NullPointerException
>>>> at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>>>> at java.lang.Thread.run(Thread.java:662)
>>>>
>>>>
>>>> which is:
>>>>  val metrics = attemptedTask.flatMap(t => t.metrics)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Oct 17, 2013 at 7:30 PM, dachuan  wrote:
>>>>
>>>>> thanks, Mark.
>>>>>
>>>>>
>>>>> On Thu, Oct 17, 2013 at 6:36 PM, Mark Hamstra >>>> > wrote:
>>>>>
>>>>>> SNAPSHOTs are not fixed versions, but are floati

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
sorry one more related question:
i compile against a spark build for hadoop 1.0.4, but the actual installed
version of spark is build against cdh4.3.0-mr1. this also used to work, and
i prefer to do this so i compile against a generic spark build. could this
be the issue?


On Thu, Oct 17, 2013 at 8:06 PM, Koert Kuipers  wrote:

> i have my spark and hadoop related dependencies as "provided" for my spark
> job. this used to work with previous versions. are these now supposed to be
> compile/runtime/default dependencies?
>
>
> On Thu, Oct 17, 2013 at 8:04 PM, Koert Kuipers  wrote:
>
>> yes i did that and i can see the correct jars sitting in lib_managed
>>
>>
>> On Thu, Oct 17, 2013 at 7:56 PM, Matei Zaharia 
>> wrote:
>>
>>> Koert, did you link your Spark job to the right version of HDFS as well?
>>> In Spark 0.8, you have to add a Maven dependency on "hadoop-client" for
>>> your version of Hadoop. See
>>> http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>>>  for
>>> example.
>>>
>>> Matei
>>>
>>> On Oct 17, 2013, at 4:38 PM, Koert Kuipers  wrote:
>>>
>>> i got the job a little further along by also setting this:
>>> System.setProperty("spark.closure.serializer",
>>> "org.apache.spark.serializer.KryoSerializer")
>>>
>>> not sure why i need to... but anyhow, now my workers start and then they
>>> blow up on this:
>>>
>>> 13/10/17 19:22:57 ERROR Executor: Uncaught exception in thread
>>> Thread[pool-5-thread-1,5,main]
>>> java.lang.NullPointerException
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>>> at java.lang.Thread.run(Thread.java:662)
>>>
>>>
>>> which is:
>>>  val metrics = attemptedTask.flatMap(t => t.metrics)
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Oct 17, 2013 at 7:30 PM, dachuan  wrote:
>>>
>>>> thanks, Mark.
>>>>
>>>>
>>>> On Thu, Oct 17, 2013 at 6:36 PM, Mark Hamstra 
>>>> wrote:
>>>>
>>>>> SNAPSHOTs are not fixed versions, but are floating names associated
>>>>> with whatever is the most recent code.  So, Spark 0.8.0 is the current
>>>>> released version of Spark, which is exactly the same today as it was
>>>>> yesterday, and will be the same thing forever.  Spark 0.8.1-SNAPSHOT is
>>>>> whatever is currently in branch-0.8.  It changes every time new code is
>>>>> committed to that branch (which should be just bug fixes and the few
>>>>> additional features that we wanted to get into 0.8.0, but that didn't 
>>>>> quite
>>>>> make it.)  Not too long from now there will be a release of Spark 0.8.1, 
>>>>> at
>>>>> which time the SNAPSHOT will got to 0.8.2 and 0.8.1 will be forever 
>>>>> frozen.
>>>>>  Meanwhile, the wild new development is taking place on the master branch,
>>>>> and whatever is currently in that branch becomes 0.9.0-SNAPSHOT.  This
>>>>> could be quite different from day to day, and there are no guarantees that
>>>>> things won't be broken in 0.9.0-SNAPSHOT.  Several months from now there
>>>>> will be a release of Spark 0.9.0 (unless the decision is made to bump the
>>>>> version to 1.0.0), at which point the SNAPSHOT goes to 0.9.1 and the whole
>>>>> process advances to the next phase of development.
>>>>>
>>>>> The short answer is that releases are stable, SNAPSHOTs are not, and
>>>>> SNAPSHOTs that aren't on maintenance branches can break things.  You make
>>>>> your choice of which to use and pay the consequences.
>>>>>
>>>>>
>>>>> On Thu, Oct 17, 2013 at 3:18 PM, dachuan  wrote:
>>>>>
>>>>>> yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got..
>>>>>> what's the difference? I mean SNAPSHOT and non-SNAPSHOT.
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra <
>>>&

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
i have my spark and hadoop related dependencies as "provided" for my spark
job. this used to work with previous versions. are these now supposed to be
compile/runtime/default dependencies?


On Thu, Oct 17, 2013 at 8:04 PM, Koert Kuipers  wrote:

> yes i did that and i can see the correct jars sitting in lib_managed
>
>
> On Thu, Oct 17, 2013 at 7:56 PM, Matei Zaharia wrote:
>
>> Koert, did you link your Spark job to the right version of HDFS as well?
>> In Spark 0.8, you have to add a Maven dependency on "hadoop-client" for
>> your version of Hadoop. See
>> http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>>  for
>> example.
>>
>> Matei
>>
>> On Oct 17, 2013, at 4:38 PM, Koert Kuipers  wrote:
>>
>> i got the job a little further along by also setting this:
>> System.setProperty("spark.closure.serializer",
>> "org.apache.spark.serializer.KryoSerializer")
>>
>> not sure why i need to... but anyhow, now my workers start and then they
>> blow up on this:
>>
>> 13/10/17 19:22:57 ERROR Executor: Uncaught exception in thread
>> Thread[pool-5-thread-1,5,main]
>> java.lang.NullPointerException
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>> at java.lang.Thread.run(Thread.java:662)
>>
>>
>> which is:
>>  val metrics = attemptedTask.flatMap(t => t.metrics)
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Oct 17, 2013 at 7:30 PM, dachuan  wrote:
>>
>>> thanks, Mark.
>>>
>>>
>>> On Thu, Oct 17, 2013 at 6:36 PM, Mark Hamstra 
>>> wrote:
>>>
>>>> SNAPSHOTs are not fixed versions, but are floating names associated
>>>> with whatever is the most recent code.  So, Spark 0.8.0 is the current
>>>> released version of Spark, which is exactly the same today as it was
>>>> yesterday, and will be the same thing forever.  Spark 0.8.1-SNAPSHOT is
>>>> whatever is currently in branch-0.8.  It changes every time new code is
>>>> committed to that branch (which should be just bug fixes and the few
>>>> additional features that we wanted to get into 0.8.0, but that didn't quite
>>>> make it.)  Not too long from now there will be a release of Spark 0.8.1, at
>>>> which time the SNAPSHOT will got to 0.8.2 and 0.8.1 will be forever frozen.
>>>>  Meanwhile, the wild new development is taking place on the master branch,
>>>> and whatever is currently in that branch becomes 0.9.0-SNAPSHOT.  This
>>>> could be quite different from day to day, and there are no guarantees that
>>>> things won't be broken in 0.9.0-SNAPSHOT.  Several months from now there
>>>> will be a release of Spark 0.9.0 (unless the decision is made to bump the
>>>> version to 1.0.0), at which point the SNAPSHOT goes to 0.9.1 and the whole
>>>> process advances to the next phase of development.
>>>>
>>>> The short answer is that releases are stable, SNAPSHOTs are not, and
>>>> SNAPSHOTs that aren't on maintenance branches can break things.  You make
>>>> your choice of which to use and pay the consequences.
>>>>
>>>>
>>>> On Thu, Oct 17, 2013 at 3:18 PM, dachuan  wrote:
>>>>
>>>>> yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got..
>>>>> what's the difference? I mean SNAPSHOT and non-SNAPSHOT.
>>>>>
>>>>>
>>>>> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra >>>> > wrote:
>>>>>
>>>>>> Of course, you mean 0.9.0-SNAPSHOT.  There is no Spark 0.9.0, and
>>>>>> won't be for several months.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 17, 2013 at 3:11 PM, dachuan  wrote:
>>>>>>
>>>>>>> I'm sorry if this doesn't answer your question directly, but I have
>>>>>>> tried spark 0.9.0 and hdfs 1.0.4 just now, it works..
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote:
>>>>>>>
>>>>>>>> after upgrading from spark 0.7 

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
yes i did that and i can see the correct jars sitting in lib_managed


On Thu, Oct 17, 2013 at 7:56 PM, Matei Zaharia wrote:

> Koert, did you link your Spark job to the right version of HDFS as well?
> In Spark 0.8, you have to add a Maven dependency on "hadoop-client" for
> your version of Hadoop. See
> http://spark.incubator.apache.org/docs/latest/quick-start.html#a-standalone-app-in-scala
>  for
> example.
>
> Matei
>
> On Oct 17, 2013, at 4:38 PM, Koert Kuipers  wrote:
>
> i got the job a little further along by also setting this:
> System.setProperty("spark.closure.serializer",
> "org.apache.spark.serializer.KryoSerializer")
>
> not sure why i need to... but anyhow, now my workers start and then they
> blow up on this:
>
> 13/10/17 19:22:57 ERROR Executor: Uncaught exception in thread
> Thread[pool-5-thread-1,5,main]
> java.lang.NullPointerException
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
> at java.lang.Thread.run(Thread.java:662)
>
>
> which is:
>  val metrics = attemptedTask.flatMap(t => t.metrics)
>
>
>
>
>
>
>
>
>
> On Thu, Oct 17, 2013 at 7:30 PM, dachuan  wrote:
>
>> thanks, Mark.
>>
>>
>> On Thu, Oct 17, 2013 at 6:36 PM, Mark Hamstra wrote:
>>
>>> SNAPSHOTs are not fixed versions, but are floating names associated with
>>> whatever is the most recent code.  So, Spark 0.8.0 is the current released
>>> version of Spark, which is exactly the same today as it was yesterday, and
>>> will be the same thing forever.  Spark 0.8.1-SNAPSHOT is whatever is
>>> currently in branch-0.8.  It changes every time new code is committed to
>>> that branch (which should be just bug fixes and the few additional features
>>> that we wanted to get into 0.8.0, but that didn't quite make it.)  Not too
>>> long from now there will be a release of Spark 0.8.1, at which time the
>>> SNAPSHOT will got to 0.8.2 and 0.8.1 will be forever frozen.  Meanwhile,
>>> the wild new development is taking place on the master branch, and whatever
>>> is currently in that branch becomes 0.9.0-SNAPSHOT.  This could be quite
>>> different from day to day, and there are no guarantees that things won't be
>>> broken in 0.9.0-SNAPSHOT.  Several months from now there will be a release
>>> of Spark 0.9.0 (unless the decision is made to bump the version to 1.0.0),
>>> at which point the SNAPSHOT goes to 0.9.1 and the whole process advances to
>>> the next phase of development.
>>>
>>> The short answer is that releases are stable, SNAPSHOTs are not, and
>>> SNAPSHOTs that aren't on maintenance branches can break things.  You make
>>> your choice of which to use and pay the consequences.
>>>
>>>
>>> On Thu, Oct 17, 2013 at 3:18 PM, dachuan  wrote:
>>>
>>>> yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got..
>>>> what's the difference? I mean SNAPSHOT and non-SNAPSHOT.
>>>>
>>>>
>>>> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra 
>>>> wrote:
>>>>
>>>>> Of course, you mean 0.9.0-SNAPSHOT.  There is no Spark 0.9.0, and
>>>>> won't be for several months.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Oct 17, 2013 at 3:11 PM, dachuan  wrote:
>>>>>
>>>>>> I'm sorry if this doesn't answer your question directly, but I have
>>>>>> tried spark 0.9.0 and hdfs 1.0.4 just now, it works..
>>>>>>
>>>>>>
>>>>>> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote:
>>>>>>
>>>>>>> after upgrading from spark 0.7 to spark 0.8 i can no longer access
>>>>>>> any files on HDFS.
>>>>>>>  i see the error below. any ideas?
>>>>>>>
>>>>>>> i am running spark standalone on a cluster that also has CDH4.3.0
>>>>>>> and rebuild spark accordingly. the jars in lib_managed look good to me.
>>>>>>>
>>>>>>> i noticed similar errors in the mailing list but found no suggested
>>>>>>> solutions.
>>>>>>>
>>>>>>> thanks! koert
>>>>>>>
>>>>>>&

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
i got the job a little further along by also setting this:
System.setProperty("spark.closure.serializer",
"org.apache.spark.serializer.KryoSerializer")

not sure why i need to... but anyhow, now my workers start and then they
blow up on this:

13/10/17 19:22:57 ERROR Executor: Uncaught exception in thread
Thread[pool-5-thread-1,5,main]
java.lang.NullPointerException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)


which is:
 val metrics = attemptedTask.flatMap(t => t.metrics)









On Thu, Oct 17, 2013 at 7:30 PM, dachuan  wrote:

> thanks, Mark.
>
>
> On Thu, Oct 17, 2013 at 6:36 PM, Mark Hamstra wrote:
>
>> SNAPSHOTs are not fixed versions, but are floating names associated with
>> whatever is the most recent code.  So, Spark 0.8.0 is the current released
>> version of Spark, which is exactly the same today as it was yesterday, and
>> will be the same thing forever.  Spark 0.8.1-SNAPSHOT is whatever is
>> currently in branch-0.8.  It changes every time new code is committed to
>> that branch (which should be just bug fixes and the few additional features
>> that we wanted to get into 0.8.0, but that didn't quite make it.)  Not too
>> long from now there will be a release of Spark 0.8.1, at which time the
>> SNAPSHOT will got to 0.8.2 and 0.8.1 will be forever frozen.  Meanwhile,
>> the wild new development is taking place on the master branch, and whatever
>> is currently in that branch becomes 0.9.0-SNAPSHOT.  This could be quite
>> different from day to day, and there are no guarantees that things won't be
>> broken in 0.9.0-SNAPSHOT.  Several months from now there will be a release
>> of Spark 0.9.0 (unless the decision is made to bump the version to 1.0.0),
>> at which point the SNAPSHOT goes to 0.9.1 and the whole process advances to
>> the next phase of development.
>>
>> The short answer is that releases are stable, SNAPSHOTs are not, and
>> SNAPSHOTs that aren't on maintenance branches can break things.  You make
>> your choice of which to use and pay the consequences.
>>
>>
>> On Thu, Oct 17, 2013 at 3:18 PM, dachuan  wrote:
>>
>>> yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got..
>>> what's the difference? I mean SNAPSHOT and non-SNAPSHOT.
>>>
>>>
>>> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra 
>>> wrote:
>>>
>>>> Of course, you mean 0.9.0-SNAPSHOT.  There is no Spark 0.9.0, and won't
>>>> be for several months.
>>>>
>>>>
>>>>
>>>> On Thu, Oct 17, 2013 at 3:11 PM, dachuan  wrote:
>>>>
>>>>> I'm sorry if this doesn't answer your question directly, but I have
>>>>> tried spark 0.9.0 and hdfs 1.0.4 just now, it works..
>>>>>
>>>>>
>>>>> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote:
>>>>>
>>>>>> after upgrading from spark 0.7 to spark 0.8 i can no longer access
>>>>>> any files on HDFS.
>>>>>>  i see the error below. any ideas?
>>>>>>
>>>>>> i am running spark standalone on a cluster that also has CDH4.3.0 and
>>>>>> rebuild spark accordingly. the jars in lib_managed look good to me.
>>>>>>
>>>>>> i noticed similar errors in the mailing list but found no suggested
>>>>>> solutions.
>>>>>>
>>>>>> thanks! koert
>>>>>>
>>>>>>
>>>>>> 13/10/17 17:43:23 ERROR Executor: Exception in task ID 0
>>>>>> java.io.EOFException
>>>>>>  at 
>>>>>> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2703)
>>>>>>  at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1008)
>>>>>>  at 
>>>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
>>>>>>  at 
>>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
>>>>>>  at org.apache.hadoop.io.UTF8.readChars(UTF8.java:258)
>>>>>>  at org.apache.hadoop.io.UTF8.readString(UTF8.java:250)
>>>>>>  at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
>>>>>>  at 
>>>>>> o

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
even the simplest job will produce these EOFException errors in the
workers. for example:

object SimpleJob extends App {
  System.setProperty("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
  val sc = new SparkContext("spark://master:7077", "Simple Job",
"/usr/local/lib/spark",
List("dist/myjar-0.1-SNAPSHOT.jar"))
  val songs = sc.textFile("hdfs://master:8020/user/koert/songs")
  println("songs count: " + songs.count)
}

did something change in how i am supposed to launch jobs?



On Thu, Oct 17, 2013 at 6:38 PM, Koert Kuipers  wrote:

> oh well i spoke too soon. spark-shell still works but any scala/java
> program still throws this error.
>
>
> On Thu, Oct 17, 2013 at 6:18 PM, dachuan  wrote:
>
>> yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got..
>> what's the difference? I mean SNAPSHOT and non-SNAPSHOT.
>>
>>
>> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra wrote:
>>
>>> Of course, you mean 0.9.0-SNAPSHOT.  There is no Spark 0.9.0, and won't
>>> be for several months.
>>>
>>>
>>>
>>> On Thu, Oct 17, 2013 at 3:11 PM, dachuan  wrote:
>>>
>>>> I'm sorry if this doesn't answer your question directly, but I have
>>>> tried spark 0.9.0 and hdfs 1.0.4 just now, it works..
>>>>
>>>>
>>>> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote:
>>>>
>>>>> after upgrading from spark 0.7 to spark 0.8 i can no longer access any
>>>>> files on HDFS.
>>>>>  i see the error below. any ideas?
>>>>>
>>>>> i am running spark standalone on a cluster that also has CDH4.3.0 and
>>>>> rebuild spark accordingly. the jars in lib_managed look good to me.
>>>>>
>>>>> i noticed similar errors in the mailing list but found no suggested
>>>>> solutions.
>>>>>
>>>>> thanks! koert
>>>>>
>>>>>
>>>>> 13/10/17 17:43:23 ERROR Executor: Exception in task ID 0
>>>>> java.io.EOFException
>>>>>   at 
>>>>> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2703)
>>>>>   at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1008)
>>>>>   at 
>>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
>>>>>   at 
>>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
>>>>>   at org.apache.hadoop.io.UTF8.readChars(UTF8.java:258)
>>>>>   at org.apache.hadoop.io.UTF8.readString(UTF8.java:250)
>>>>>   at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
>>>>>   at 
>>>>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>>>>   at 
>>>>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>>>>   at 
>>>>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>>>>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>>   at 
>>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>>   at 
>>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>   at java.lang.reflect.Method.invoke(Method.java:597)
>>>>>   at 
>>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
>>>>>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
>>>>>   at 
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>>>>>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>>>>   at 
>>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1950)
>>>>>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1874)
>>>>>   at 
>>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>>>>>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>>>>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>>>>>   at 
>>>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
>>>>>   at 
>>>>> java.io.ObjectInputStream.read

Re: spark 0.8

2013-10-17 Thread Koert Kuipers
oh well i spoke too soon. spark-shell still works but any scala/java
program still throws this error.


On Thu, Oct 17, 2013 at 6:18 PM, dachuan  wrote:

> yeah, I mean 0.9.0-SNAPSHOT. I use git clone and that's what I got..
> what's the difference? I mean SNAPSHOT and non-SNAPSHOT.
>
>
> On Thu, Oct 17, 2013 at 6:15 PM, Mark Hamstra wrote:
>
>> Of course, you mean 0.9.0-SNAPSHOT.  There is no Spark 0.9.0, and won't
>> be for several months.
>>
>>
>>
>> On Thu, Oct 17, 2013 at 3:11 PM, dachuan  wrote:
>>
>>> I'm sorry if this doesn't answer your question directly, but I have
>>> tried spark 0.9.0 and hdfs 1.0.4 just now, it works..
>>>
>>>
>>> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers wrote:
>>>
>>>> after upgrading from spark 0.7 to spark 0.8 i can no longer access any
>>>> files on HDFS.
>>>>  i see the error below. any ideas?
>>>>
>>>> i am running spark standalone on a cluster that also has CDH4.3.0 and
>>>> rebuild spark accordingly. the jars in lib_managed look good to me.
>>>>
>>>> i noticed similar errors in the mailing list but found no suggested
>>>> solutions.
>>>>
>>>> thanks! koert
>>>>
>>>>
>>>> 13/10/17 17:43:23 ERROR Executor: Exception in task ID 0
>>>> java.io.EOFException
>>>>at 
>>>> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2703)
>>>>at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1008)
>>>>at 
>>>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
>>>>at 
>>>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
>>>>at org.apache.hadoop.io.UTF8.readChars(UTF8.java:258)
>>>>at org.apache.hadoop.io.UTF8.readString(UTF8.java:250)
>>>>at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
>>>>at 
>>>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>>>at 
>>>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>>>at 
>>>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>>>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>at 
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>at 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>at java.lang.reflect.Method.invoke(Method.java:597)
>>>>at 
>>>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
>>>>at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
>>>>at 
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>>>>at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>>>at 
>>>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1950)
>>>>at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1874)
>>>>at 
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>>>>at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>>>at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>>>>at 
>>>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
>>>>at 
>>>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1795)
>>>>at 
>>>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1754)
>>>>at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>>>at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>>>>at 
>>>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)
>>>>at 
>>>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)
>>>>at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:153)
>>>>at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>>>>at 
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>>>>at java.lang.Thread.run(Thread.java:662)
>>>>
>>>>
>>>
>>>
>>> --
>>> Dachuan Huang
>>> Cellphone: 614-390-7234
>>> 2015 Neil Avenue
>>> Ohio State University
>>> Columbus, Ohio
>>> U.S.A.
>>> 43210
>>>
>>
>>
>
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>


Re: spark 0.8

2013-10-17 Thread Koert Kuipers
ah please disregard my question. i screwed up a single configuration
setting in spark! oh well that was 3 hours of fun


On Thu, Oct 17, 2013 at 6:11 PM, dachuan  wrote:

> I'm sorry if this doesn't answer your question directly, but I have tried
> spark 0.9.0 and hdfs 1.0.4 just now, it works..
>
>
> On Thu, Oct 17, 2013 at 6:05 PM, Koert Kuipers  wrote:
>
>> after upgrading from spark 0.7 to spark 0.8 i can no longer access any
>> files on HDFS.
>>  i see the error below. any ideas?
>>
>> i am running spark standalone on a cluster that also has CDH4.3.0 and
>> rebuild spark accordingly. the jars in lib_managed look good to me.
>>
>> i noticed similar errors in the mailing list but found no suggested
>> solutions.
>>
>> thanks! koert
>>
>>
>> 13/10/17 17:43:23 ERROR Executor: Exception in task ID 0
>> java.io.EOFException
>>  at 
>> java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2703)
>>  at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1008)
>>  at 
>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
>>  at 
>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
>>  at org.apache.hadoop.io.UTF8.readChars(UTF8.java:258)
>>  at org.apache.hadoop.io.UTF8.readString(UTF8.java:250)
>>  at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
>>  at 
>> org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
>>  at 
>> org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
>>  at 
>> org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>  at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>  at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>  at java.lang.reflect.Method.invoke(Method.java:597)
>>  at 
>> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>  at 
>> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1950)
>>  at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1874)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>>  at 
>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
>>  at 
>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1795)
>>  at 
>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1754)
>>  at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
>>  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
>>  at 
>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)
>>  at 
>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)
>>  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:153)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
>>  at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
>>  at java.lang.Thread.run(Thread.java:662)
>>
>>
>
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>


spark 0.8

2013-10-17 Thread Koert Kuipers
after upgrading from spark 0.7 to spark 0.8 i can no longer access any
files on HDFS.
i see the error below. any ideas?

i am running spark standalone on a cluster that also has CDH4.3.0 and
rebuild spark accordingly. the jars in lib_managed look good to me.

i noticed similar errors in the mailing list but found no suggested
solutions.

thanks! koert


13/10/17 17:43:23 ERROR Executor: Exception in task ID 0
java.io.EOFException
at 
java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2703)
at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1008)
at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:68)
at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:106)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:258)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:250)
at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
at 
org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
at 
org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
at 
org.apache.spark.SerializableWritable.readObject(SerializableWritable.scala:39)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:969)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1852)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1950)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1874)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1756)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at 
org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:135)
at 
java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1795)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1754)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1326)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:39)
at 
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:61)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:153)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)


HDFS user permissions

2013-09-19 Thread Koert Kuipers
hello all,
currently spark runs tasks as the user that runs the spark worker daemon.
at least, this is what i observer with standalone spark.

as a result spark does not respect user permissions on HDFS.

i guess this could be fixed by running the tasks as a different user, or
even just by using hadoop's proxy mechanism to access HDFS as that user.

are there plans to fix this? is there a JIRA tickets for this? i couldn't
find it but would like to track this issue.

and what about secure HDFS (kerberos)?

thanks! koert