Re: pyspark is crashing in this case. why?

Sameer Farooqui Mon, 15 Dec 2014 13:36:07 -0800

Adding group back.


FYI Geneis - this was on a m3.xlarge with all default settings in Spark. I
used Spark version 1.3.0.

The 2nd case did work for me:

>>> a = [1,2,3,4,5,6,7,8,9]
>>> b = []
>>> for x in range(1000000):
...   b.append(a)
...
>>> rdd1 = sc.parallelize(b)
>>> rdd1.first()
14/12/15 16:33:01 WARN TaskSetManager: Stage 1 contains a task of very
large size (9766 KB). The maximum recommended task size is 100 KB.
[1, 2, 3, 4, 5, 6, 7, 8, 9]


On Mon, Dec 15, 2014 at 1:33 PM, Sameer Farooqui <same...@databricks.com>
wrote:
>
> Hi Genesis,
>
>
> The 2nd case did work for me:
>
> >>> a = [1,2,3,4,5,6,7,8,9]
> >>> b = []
> >>> for x in range(1000000):
> ...   b.append(a)
> ...
> >>> rdd1 = sc.parallelize(b)
> >>> rdd1.first()
> 14/12/15 16:33:01 WARN TaskSetManager: Stage 1 contains a task of very
> large size (9766 KB). The maximum recommended task size is 100 KB.
> [1, 2, 3, 4, 5, 6, 7, 8, 9]
>
>
>
>
> On Sun, Dec 14, 2014 at 2:13 PM, Genesis Fatum <genesis.fa...@gmail.com>
> wrote:
>>
>> Hi Sameer,
>>
>> I have tried multiple configurations. For example, executor and driver
>> memory at 2G. Also played with the JRE memory size parameters (-Xms) and
>> get the same error.
>>
>> Does it work for you? I think it is a setup issue on my side, although I
>> have tried a couple laptops.
>>
>> Thanks
>>
>> On Sun, Dec 14, 2014 at 1:11 PM, Sameer Farooqui <same...@databricks.com>
>> wrote:
>>>
>>> How much executor-memory are you setting for the JVM? What about the
>>> Driver JVM memory?
>>>
>>> Also check the Windows Event Log for Out of memory errors for one of the
>>> 2 above JVMs.
>>> On Dec 14, 2014 6:04 AM, "genesis fatum" <genesis.fa...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> My environment is: standalone spark 1.1.1 on windows 8.1 pro.
>>>>
>>>> The following case works fine:
>>>> >>> a = [1,2,3,4,5,6,7,8,9]
>>>> >>> b = []
>>>> >>> for x in range(100000):
>>>> ...  b.append(a)
>>>> ...
>>>> >>> rdd1 = sc.parallelize(b)
>>>> >>> rdd1.first()
>>>> >>>[1, 2, 3, 4, 5, 6, 7, 8, 9]
>>>>
>>>> The following case does not work. The only difference is the size of the
>>>> array. Note the loop range: 100K vs. 1M.
>>>> >>> a = [1,2,3,4,5,6,7,8,9]
>>>> >>> b = []
>>>> >>> for x in range(1000000):
>>>> ...  b.append(a)
>>>> ...
>>>> >>> rdd1 = sc.parallelize(b)
>>>> >>> rdd1.first()
>>>> >>>
>>>> 14/12/14 07:52:19 ERROR PythonRDD: Python worker exited unexpectedly
>>>> (crashed)
>>>> java.net.SocketException: Connection reset by peer: socket write error
>>>>         at java.net.SocketOutputStream.socketWrite0(Native Method)
>>>>         at java.net.SocketOutputStream.socketWrite(Unknown Source)
>>>>         at java.net.SocketOutputStream.write(Unknown Source)
>>>>         at java.io.BufferedOutputStream.flushBuffer(Unknown Source)
>>>>         at java.io.BufferedOutputStream.write(Unknown Source)
>>>>         at java.io.DataOutputStream.write(Unknown Source)
>>>>         at java.io.FilterOutputStream.write(Unknown Source)
>>>>         at
>>>> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
>>>> 1.apply(PythonRDD.scala:341)
>>>>         at
>>>> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$
>>>> 1.apply(PythonRDD.scala:339)
>>>>         at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>>>         at
>>>> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>>>>         at
>>>> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRD
>>>> D.scala:339)
>>>>         at
>>>> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
>>>> ly$mcV$sp(PythonRDD.scala:209)
>>>>         at
>>>> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
>>>> ly(PythonRDD.scala:184)
>>>>         at
>>>> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.app
>>>> ly(PythonRDD.scala:184)
>>>>         at
>>>> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1364)
>>>>         at
>>>> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scal
>>>> a:183)
>>>>
>>>> What I have tried:
>>>> 1. Replaced JRE 32bit with JRE64
>>>> 2. Multiple configurations when I start pyspark: --driver-memory,
>>>> --executor-memory
>>>> 3. Tried to set the SparkConf with different settings
>>>> 4. Tried also with spark 1.1.0
>>>>
>>>> Being new to Spark, I am sure that it is something simple that I am
>>>> missing
>>>> and would appreciate any thoughts.
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-is-crashing-in-this-case-why-tp20675.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>

Re: pyspark is crashing in this case. why?

Reply via email to