Re: error loading large files in PySpark 0.9.0

2014-03-23 Thread Matei Zaharia
Hey Jeremy, what happens if you pass batchSize=10 as an argument to your 
SparkContext? It tries to serialize that many objects together at a time, which 
might be too much. By default the batchSize is 1024.

Matei

On Mar 23, 2014, at 10:11 AM, Jeremy Freeman  wrote:

> Hi all,
> 
> Hitting a mysterious error loading large text files, specific to PySpark
> 0.9.0.
> 
> In PySpark 0.8.1, this works:
> 
> data = sc.textFile("path/to/myfile")
> data.count()
> 
> But in 0.9.0, it stalls. There are indications of completion up to:
> 
> 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
> (progress: 15/537)
> 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)
> 
> And then this repeats indefinitely
> 
> 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
> runningTasks: 144
> 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
> runningTasks: 144
> 
> Always stalls at the same place. There's nothing in stderr on the workers,
> but in stdout there are several of these messages:
> 
> INFO PythonRDD: stdin writer to Python finished early
> 
> So perhaps the real error is being suppressed as in
> https://spark-project.atlassian.net/browse/SPARK-1025
> 
> Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
> characters per row. Running on a private cluster with 10 nodes, 100GB / 16
> cores each, Python v 2.7.6.
> 
> I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
> and in PySpark in 0.8.1. Happy to post the file, but it should repro for
> anything with these dimensions. It *might* be specific to long strings: I
> don't see it with fewer characters (10k) per row, but I also don't see it
> with many fewer rows but the same number of characters per row.
> 
> Happy to try and provide more info / help debug!
> 
> -- Jeremy
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: error loading large files in PySpark 0.9.0

2014-03-24 Thread Jeremy Freeman
Thanks Matei, unfortunately doesn't seem to fix it. I tried batchSize = 10, 
100, as well as 1 (which should reproduce the 0.8.1 behavior?), and it stalls 
at the same point in each case.

-- Jeremy

-
jeremy freeman, phd
neuroscientist
@thefreemanlab

On Mar 23, 2014, at 9:56 PM, Matei Zaharia  wrote:

> Hey Jeremy, what happens if you pass batchSize=10 as an argument to your 
> SparkContext? It tries to serialize that many objects together at a time, 
> which might be too much. By default the batchSize is 1024.
> 
> Matei
> 
> On Mar 23, 2014, at 10:11 AM, Jeremy Freeman  wrote:
> 
>> Hi all,
>> 
>> Hitting a mysterious error loading large text files, specific to PySpark
>> 0.9.0.
>> 
>> In PySpark 0.8.1, this works:
>> 
>> data = sc.textFile("path/to/myfile")
>> data.count()
>> 
>> But in 0.9.0, it stalls. There are indications of completion up to:
>> 
>> 14/03/17 16:54:24 INFO TaskSetManager: Finished TID 4 in 1699 ms on X.X.X.X
>> (progress: 15/537)
>> 14/03/17 16:54:24 INFO DAGScheduler: Completed ResultTask(5, 4)
>> 
>> And then this repeats indefinitely
>> 
>> 14/03/17 16:54:24 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
>> runningTasks: 144
>> 14/03/17 16:54:25 DEBUG TaskSchedulerImpl: parentName: , name: TaskSet_5,
>> runningTasks: 144
>> 
>> Always stalls at the same place. There's nothing in stderr on the workers,
>> but in stdout there are several of these messages:
>> 
>> INFO PythonRDD: stdin writer to Python finished early
>> 
>> So perhaps the real error is being suppressed as in
>> https://spark-project.atlassian.net/browse/SPARK-1025
>> 
>> Data is just rows of space-separated numbers, ~20GB, with 300k rows and 50k
>> characters per row. Running on a private cluster with 10 nodes, 100GB / 16
>> cores each, Python v 2.7.6.
>> 
>> I doubt the data is corrupted as it works fine in Scala in 0.8.1 and 0.9.0,
>> and in PySpark in 0.8.1. Happy to post the file, but it should repro for
>> anything with these dimensions. It *might* be specific to long strings: I
>> don't see it with fewer characters (10k) per row, but I also don't see it
>> with many fewer rows but the same number of characters per row.
>> 
>> Happy to try and provide more info / help debug!
>> 
>> -- Jeremy
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 



Re: error loading large files in PySpark 0.9.0

2014-06-04 Thread Jeremy Freeman
Hey Matei,

Wanted to let you know this issue appears to be fixed in 1.0.0. Great work!

-- Jeremy



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: error loading large files in PySpark 0.9.0

2014-06-04 Thread Matei Zaharia
Ah, good to know!

By the way in master we now have saveAsPickleFile 
(https://github.com/apache/spark/pull/755), and Nick Pentreath has been working 
on Hadoop InputFormats: https://github.com/apache/spark/pull/455. Would be good 
to have your input on both of those if you have a chance to try them.

Matei

On Jun 4, 2014, at 3:28 PM, Jeremy Freeman  wrote:

> Hey Matei,
> 
> Wanted to let you know this issue appears to be fixed in 1.0.0. Great work!
> 
> -- Jeremy
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: error loading large files in PySpark 0.9.0

2014-06-06 Thread Jeremy Freeman
Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat
support. We recently wrote a custom hadoop input format so we can support
flat binary files
(https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop),
and have been testing it in Scala. So I was following Nick's progress and
was eager to check this out when ready. Will let you guys know how it goes.

-- J



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p7144.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: error loading large files in PySpark 0.9.0

2014-06-07 Thread Nick Pentreath
Ah looking at that inputformat it should just work out the box using 
sc.newAPIHadoopFile ...


Would be interested to hear if it works as expected for you (in python you'll 
end up with bytearray values).




N
—
Sent from Mailbox

On Fri, Jun 6, 2014 at 9:38 PM, Jeremy Freeman 
wrote:

> Oh cool, thanks for the heads up! Especially for the Hadoop InputFormat
> support. We recently wrote a custom hadoop input format so we can support
> flat binary files
> (https://github.com/freeman-lab/thunder/tree/master/scala/src/main/scala/thunder/util/io/hadoop),
> and have been testing it in Scala. So I was following Nick's progress and
> was eager to check this out when ready. Will let you guys know how it goes.
> -- J
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p7144.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.