Efficient hierarchical aggregation in Spark

2013-12-24 Thread Deenar Toraskar
Classification: Public Hi I have a requirement to aggregate a large data set in Spark across a multi level (25 levels) hierarchy. The data model (simplified) is as follows Measures leafNode Long measureType String measureValue Array[Float] Hierarchy (expanded) - a typical

Spark application development work flow in scala

2013-12-24 Thread Aureliano Buendia
Hi, What's a typical work flow of spark application development in scala? One option is to write a scala application with a main function, and keep executing the app after every development change. Given a big overhead of a moderately sized development data, this could mean slow iterations.

Re: Spark application development work flow in scala

2013-12-24 Thread Mayur Rustagi
I typically use the main method and test driven approach, for most simple application that works out pretty well. Another technique is to create a jar containing the complex functionality and test it. Create another jar just for streaming/processing that hooks into it and handles all the data

How to set Akka frame size

2013-12-24 Thread leosand...@gmail.com
Hi, everyone I have a question about the arg spark.akka.frameSize , it default value is 10m . I execute the JavaWordCount read data from hdfs , there is a 7G file . there is a oom error caused by some task result exceeded Akka frame size . but when I modify the arg 1G ,2G , 10G , it show me

RE: Unable to load additional JARs in yarn-client mode

2013-12-24 Thread Karavany, Ido
Hi, Thanks for your responses. We already tried the one jar approach and it worked - but it is a real pain to compile ~15 project every time we need to do a small change in one of them. Just to make sure I understand you correctly - below is what we've tried to pass in our test constructor:

Re: debugging NotSerializableException while using Kryo

2013-12-24 Thread Ameet Kini
Hi Michael, I re-ran this on another machine which is on spark's master branch 0.9.0-SNAPSHOT from Dec 14 (right after the scala 2.10 branch was merged back into master) and recreated the NPE towards the end of this message. I can't tell looking at the relevant code what may have caused the

Re: debugging NotSerializableException while using Kryo

2013-12-24 Thread Ameet Kini
If Java serialization is the only one that properly works for closures, then I shouldn't be setting spark.closure.serializer to org.apache.spark.serializer.KryoSerializer, and my only hope for getting lookup (and other such methods that still use closure serializers) to work is to either a) use

Re: debugging NotSerializableException while using Kryo

2013-12-24 Thread Eugen Cepoi
In scala case classes are serializable by default, your TileIdWritable should be a case class. I usually enable Kryo ser for objects and keep default ser for closures, this works pretty well. Eugen 2013/12/24 Ameet Kini ameetk...@gmail.com If Java serialization is the only one that properly

Re: Noob Spark questions

2013-12-24 Thread Jie Deng
@Mark Hamstra: Thanks, good to know. @Ognen Duzlevski: 2013/12/24 Ognen Duzlevski og...@nengoiksvelzud.com Hello, On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng deng113...@gmail.com wrote: I am using Java, and Spark has APIs for Java as well. Though there is a saying that Java in Spark is

Re: mapWith and array index as key

2013-12-24 Thread Mark Hamstra
No, the index referred to in mapWith (as well as in mapPartitionsWithIndex and several other RDD methods) is the index of the RDD's partitions. So, for example, in a typical case of an RDD read in from a distributed filesystem where the input file occupies n blocks, the index values in mapWith

Re: reading LZO compressed file in spark

2013-12-24 Thread Berkeley Malagon
Andrew, This is great. Excuse my ignorance, but what do you mean by RF=3? Also, after reading the LZO files, are you able to access the contents directly, or do you have to decompress them after reading them? Sent from my iPhone On Dec 24, 2013, at 12:03 AM, Andrew Ash and...@andrewash.com

Re: Spark application development work flow in scala

2013-12-24 Thread Hossein
d of a moderately sized development data, this could mean slow iterations. Another option is to somehow initialize the data in REPL, and keep the development inside REPL. This would mean faster development iterations, however, it's not clear to me how to keep the code in sync with REPL. Do

Re: How to set Akka frame size

2013-12-24 Thread Aaron Davidson
The error you're receiving is because the Akka frame size must be a positive Java Integer, i.e., less than 2^31. However, the frame size is not intended to be nearly the size of the job memory -- it is the smallest unit of data transfer that Spark does. In this case, your task result size is

Re: IMPORTANT: Spark mailing lists moving to Apache by September 1st

2013-12-24 Thread Patrick Wendell
Hey Andy - these Nabble groups look great! Thanks for setting them up. On Tue, Dec 24, 2013 at 10:49 AM, Evan Chan e...@ooyala.com wrote: Thanks Andy, at first glance nabble seems great, it allows search plus posting new topics, so it appears to be bidirectional.Now just have to register

Re: multi-line elements

2013-12-24 Thread Philip Ogren
Thank you for pointing me in the right direction! On 12/24/2013 2:39 PM, suman bharadwaj wrote: Just one correction, I think NLineInputFormat won't fit your usecase. I think you may have to write custom record reader and use textinputformat and plug it in spark as show above. Regards, Suman

Re: debugging NotSerializableException while using Kryo

2013-12-24 Thread Dmitriy Lyubimov
On Tue, Dec 24, 2013 at 7:29 AM, Ameet Kini ameetk...@gmail.com wrote: If Java serialization is the only one that properly works for closures, then I shouldn't be setting spark.closure.serializer to org.apache.spark.serializer.KryoSerializer, My understanding is that it's not that it kryo

Re: multi-line elements

2013-12-24 Thread Christopher Nguyen
Phillip, if there are easily detectable line groups you might define your own InputFormat. Alternatively you can consider using mapPartitions() to get access to the entire data partition instead of row-at-a-time. You'd still have to worry about what happens at the partition boundaries. A third

Re: multi-line elements

2013-12-24 Thread Azuryy Yu
Hi Philip, you can specify org.apache.hadoop.streaming.StreamInputFormat, which fit for you. you just specify stream.recordreader.begin and stream.recordreader.end, then this Reader can read the block records between BEGIN and END each time. On Wed, Dec 25, 2013 at 11:11 AM, Christopher Nguyen