[jira] [Commented] (SPARK-2580) broken pipe collecting schemardd results

2014-07-28 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077193#comment-14077193
 ] 

Apache Spark commented on SPARK-2580:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1625

 broken pipe collecting schemardd results
 

 Key: SPARK-2580
 URL: https://issues.apache.org/jira/browse/SPARK-2580
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.0.0
 Environment: fedora 21 local and rhel 7 clustered (standalone)
Reporter: Matthew Farrellee
Assignee: Davies Liu
  Labels: py4j, pyspark

 {code}
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
 # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
 data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
 sdata = sqlCtx.inferSchema(data)
 sdata.first()
 {code}
 result: note - result returned as well as error
 {code}
  sdata.first()
 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:290
 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at 
 PythonRDD.scala:290) with 1 output partitions (allowLocal=true)
 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at 
 PythonRDD.scala:290)
 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, 
 finish = 2
 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at 
 PythonRDD.scala:290, took 0.048348426 s
 {u'name': u'index', u'value': 0}
  PySpark worker failed with exception:
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 139, in _write_with_length
 stream.write(serialized)
 IOError: [Errno 32] Broken pipe
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 130, in launch_worker
 worker(listen_sock)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 119, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2580) broken pipe collecting schemardd results

2014-07-18 Thread Matthew Farrellee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066480#comment-14066480
 ] 

Matthew Farrellee commented on SPARK-2580:
--

fyi, this was discovered during spark summit 2014 using the wiki_parquet 
example.

 broken pipe collecting schemardd results
 

 Key: SPARK-2580
 URL: https://issues.apache.org/jira/browse/SPARK-2580
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.0.0
 Environment: fedora 21 local and rhel 7 clustered (standalone)
Reporter: Matthew Farrellee
  Labels: py4j, pyspark

 {code}
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
 # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
 data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
 sdata = sqlCtx.inferSchema(data)
 sdata.first()
 {code}
 result: note - result returned as well as error
 {code}
  sdata.first()
 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:290
 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at 
 PythonRDD.scala:290) with 1 output partitions (allowLocal=true)
 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at 
 PythonRDD.scala:290)
 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, 
 finish = 2
 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at 
 PythonRDD.scala:290, took 0.048348426 s
 {u'name': u'index', u'value': 0}
  PySpark worker failed with exception:
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 139, in _write_with_length
 stream.write(serialized)
 IOError: [Errno 32] Broken pipe
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 130, in launch_worker
 worker(listen_sock)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 119, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)