Ah looking at that inputformat it should just work out the box using
sc.newAPIHadoopFile ...
Would be interested to hear if it works as expected for you (in python you'll
end up with bytearray values).
N
—
Sent from Mailbox
On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman
Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat
support. We recently wrote a custom hadoop input format so we can support
flat binary files
(https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop),
and have been testing it in Scala.
Hey Matei,
Wanted to let you know this issue appears to be fixed in 1.0.0. Great work!
-- Jeremy
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html
Sent from the Apache Spark User List mailing list
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10,
100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls
at the same point in each case.
-- Jeremy
-
jeremy freeman, phd
neuroscientist
@thefreemanlab
On Mar 23, 2014, at 9:56
Hi all,
Hitting a mysterious error loading large text files, specific to PySpark
0.9.0.
In PySpark 0.8.1, this works:
data = sc.textFile(path/to/myfile)
data.count()
But in 0.9.0, it stalls. There are indications of completion up to:
14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your
SparkContext? It tries to serialize that many objects together at a time, which
might be too much. By default the batchSize is 1024.
Matei
On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote:
Hi