Classification: Public
Hi
I have a requirement to aggregate a large data set in Spark across a multi
level (25 levels) hierarchy. The data model (simplified) is as follows
Measures
leafNode Long
measureType String
measureValue Array[Float]
Hierarchy (expanded) - a typical
Hi,
What's a typical work flow of spark application development in scala?
One option is to write a scala application with a main function, and keep
executing the app after every development change. Given a big overhead of a
moderately sized development data, this could mean slow iterations.
I typically use the main method and test driven approach, for most simple
application that works out pretty well. Another technique is to create a
jar containing the complex functionality and test it. Create another jar
just for streaming/processing that hooks into it and handles all the data
Hi, everyone
I have a question about the arg spark.akka.frameSize , it default value is 10m .
I execute the JavaWordCount read data from hdfs , there is a 7G file .
there is a oom error caused by
some task result exceeded Akka frame size .
but when I modify the arg 1G ,2G , 10G , it show me
Hi,
Thanks for your responses.
We already tried the one jar approach and it worked - but it is a real pain to
compile ~15 project every time we need to do a small change in one of them.
Just to make sure I understand you correctly - below is what we've tried to
pass in our test constructor:
Hi Michael,
I re-ran this on another machine which is on spark's master branch
0.9.0-SNAPSHOT from Dec 14 (right after the scala 2.10 branch was merged
back into master) and recreated the NPE towards the end of this message. I
can't tell looking at the relevant code what may have caused the
If Java serialization is the only one that properly works for closures,
then I shouldn't be setting spark.closure.serializer to
org.apache.spark.serializer.KryoSerializer, and my only hope for getting
lookup (and other such methods that still use closure serializers) to work
is to either a) use
In scala case classes are serializable by default, your TileIdWritable
should be a case class. I usually enable Kryo ser for objects and keep
default ser for closures, this works pretty well.
Eugen
2013/12/24 Ameet Kini ameetk...@gmail.com
If Java serialization is the only one that properly
@Mark Hamstra:
Thanks, good to know.
@Ognen Duzlevski:
2013/12/24 Ognen Duzlevski og...@nengoiksvelzud.com
Hello,
On Mon, Dec 23, 2013 at 3:23 PM, Jie Deng deng113...@gmail.com wrote:
I am using Java, and Spark has APIs for Java as well. Though there is a
saying that Java in Spark is
No, the index referred to in mapWith (as well as in mapPartitionsWithIndex
and several other RDD methods) is the index of the RDD's partitions. So,
for example, in a typical case of an RDD read in from a distributed
filesystem where the input file occupies n blocks, the index values in
mapWith
Andrew, This is great.
Excuse my ignorance, but what do you mean by RF=3? Also, after reading the LZO
files, are you able to access the contents directly, or do you have to
decompress them after reading them?
Sent from my iPhone
On Dec 24, 2013, at 12:03 AM, Andrew Ash and...@andrewash.com
d of a moderately sized development data, this could mean slow iterations.
Another option is to somehow initialize the data in REPL, and keep the
development inside REPL. This would mean faster development iterations,
however, it's not clear to me how to keep the code in sync with REPL. Do
The error you're receiving is because the Akka frame size must be
a positive Java Integer, i.e., less than 2^31. However, the frame size is
not intended to be nearly the size of the job memory -- it is the smallest
unit of data transfer that Spark does. In this case, your task result
size is
Hey Andy - these Nabble groups look great! Thanks for setting them up.
On Tue, Dec 24, 2013 at 10:49 AM, Evan Chan e...@ooyala.com wrote:
Thanks Andy, at first glance nabble seems great, it allows search plus
posting new topics, so it appears to be bidirectional.Now just have to
register
Thank you for pointing me in the right direction!
On 12/24/2013 2:39 PM, suman bharadwaj wrote:
Just one correction, I think NLineInputFormat won't fit your usecase.
I think you may have to write custom record reader and use
textinputformat and plug it in spark as show above.
Regards,
Suman
On Tue, Dec 24, 2013 at 7:29 AM, Ameet Kini ameetk...@gmail.com wrote:
If Java serialization is the only one that properly works for closures,
then I shouldn't be setting spark.closure.serializer to
org.apache.spark.serializer.KryoSerializer,
My understanding is that it's not that it kryo
Phillip, if there are easily detectable line groups you might define your
own InputFormat. Alternatively you can consider using mapPartitions() to
get access to the entire data partition instead of row-at-a-time. You'd
still have to worry about what happens at the partition boundaries. A third
Hi Philip,
you can specify org.apache.hadoop.streaming.StreamInputFormat, which fit
for you. you just specify stream.recordreader.begin
and stream.recordreader.end, then this Reader can read the block records
between BEGIN and END each time.
On Wed, Dec 25, 2013 at 11:11 AM, Christopher Nguyen
18 matches
Mail list logo