Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 
100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls 
at the same point in each case.

-- Jeremy

---------------------
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Mar 23, 2014, at 9:56 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote:

> Hey Jeremy, what happens if you pass batchSize=10 as an argument to your 
> SparkContext? It tries to serialize that many objects together at a time, 
> which might be too much. By default the batchSize is 1024.
> 
> Matei
> 
> On Mar 23, 2014, at 10:11 AM, Jeremy Freeman <freeman.jer...@gmail.com> wrote:
> 
>> Hi all,
>> 
>> Hitting a mysterious error loading large text files, specific to PySpark
>> 0.9.0.
>> 
>> In PySpark 0.8.1, this works:
>> 
>> data = sc.textFile("path/to/myfile")
>> data.count()
>> 
>> But in 0.9.0, it stalls. There are indications of completion up to:
>> 
>> 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
>> (progress: 15/537)
>> 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)
>> 
>> And then this repeats indefinitely
>> 
>> 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
>> runningTasks: 144
>> 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
>> runningTasks: 144
>> 
>> Always stalls at the same place. There's nothing in stderr on the workers,
>> but in stdout there are several of these messages:
>> 
>> INFO PythonRDD: stdin writer to Python finished early
>> 
>> So perhaps the real error is being suppressed as in
>> https://spark-project.atlassian.net/browse/SPARK-1025
>> 
>> Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
>> characters per row. Running on a private cluster with 10 nodes, 100GB / 16
>> cores each, Python v 2.7.6.
>> 
>> I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
>> and in PySpark in 0.8.1. Happy to post the file, but it should repro for
>> anything with these dimensions. It *might* be specific to long strings: I
>> don't see it with fewer characters (10k) per row, but I also don't see it
>> with many fewer rows but the same number of characters per row.
>> 
>> Happy to try and provide more info / help debug!
>> 
>> -- Jeremy
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 

Reply via email to