Hey Jeremy, what happens if you pass batchSize=10 as an argument to your 
SparkContext? It tries to serialize that many objects together at a time, which 
might be too much. By default the batchSize is 1024.

Matei

On Mar 23, 2014, at 10:11 AM, Jeremy Freeman <freeman.jer...@gmail.com> wrote:

> Hi all,
> 
> Hitting a mysterious error loading large text files, specific to PySpark
> 0.9.0.
> 
> In PySpark 0.8.1, this works:
> 
> data = sc.textFile("path/to/myfile")
> data.count()
> 
> But in 0.9.0, it stalls. There are indications of completion up to:
> 
> 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
> (progress: 15/537)
> 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)
> 
> And then this repeats indefinitely
> 
> 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
> runningTasks: 144
> 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
> runningTasks: 144
> 
> Always stalls at the same place. There's nothing in stderr on the workers,
> but in stdout there are several of these messages:
> 
> INFO PythonRDD: stdin writer to Python finished early
> 
> So perhaps the real error is being suppressed as in
> https://spark-project.atlassian.net/browse/SPARK-1025
> 
> Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
> characters per row. Running on a private cluster with 10 nodes, 100GB / 16
> cores each, Python v 2.7.6.
> 
> I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
> and in PySpark in 0.8.1. Happy to post the file, but it should repro for
> anything with these dimensions. It *might* be specific to long strings: I
> don't see it with fewer characters (10k) per row, but I also don't see it
> with many fewer rows but the same number of characters per row.
> 
> Happy to try and provide more info / help debug!
> 
> -- Jeremy
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to