Re: error loading large files in PySpark 0.9.0

2014-06-07 Thread Nick Pentreath
Ah looking at that inputformat it should just work out the box using sc.newAPIHadoopFile ... Would be interested to hear if it works as expected for you (in python you'll end up with bytearray values). N — Sent from Mailbox On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman

Re: error loading large files in PySpark 0.9.0

2014-06-06 Thread Jeremy Freeman
Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat support. We recently wrote a custom hadoop input format so we can support flat binary files (https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop), and have been testing it in Scala.

Re: error loading large files in PySpark 0.9.0

2014-06-04 Thread Jeremy Freeman
Hey Matei, Wanted to let you know this issue appears to be fixed in 1.0.0. Great work! -- Jeremy -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html Sent from the Apache Spark User List mailing list

Re: error loading large files in PySpark 0.9.0

2014-03-24 Thread Jeremy Freeman
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls at the same point in each case. -- Jeremy - jeremy freeman, phd neuroscientist @thefreemanlab On Mar 23, 2014, at 9:56

error loading large files in PySpark 0.9.0

2014-03-23 Thread Jeremy Freeman
Hi all, Hitting a mysterious error loading large text files, specific to PySpark 0.9.0. In PySpark 0.8.1, this works: data = sc.textFile(path/to/myfile) data.count() But in 0.9.0, it stalls. There are indications of completion up to: 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in

Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your SparkContext? It tries to serialize that many objects together at a time, which might be too much. By default the batchSize is 1024. Matei On Mar 23, 2014, at 10:11 AM, Jeremy Freeman freeman.jer...@gmail.com wrote: Hi