Re: Pyspark: Issue using sql in foreachBatch sink
Thanks Jungtaek for your help. On Fri, Jul 31, 2020 at 6:31 PM Jungtaek Lim wrote: > Python doesn't allow abbreviating () with no param, whereas Scala does. > Use `write()`, not `write`. > > On Wed, Jul 29, 2020 at 9:09 AM muru wrote: > >> In a pyspark SS job, trying to use sql instead of sql functions in >> foreachBatch sink >> throws AttributeError: 'JavaMember' object has no attribute 'format' >> exception. >> However, the same thing works in Scala API. >> >> Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same >> exception. >> Is it a bug or known issue with Pyspark implementation? I noticed that I >> could perform other operations except the write method. >> >> Please, let me know how to fix this issue. >> >> See below code examples >> # Spark Scala method >> def processData(batchDF: DataFrame, batchId: Long) { >>batchDF.createOrReplaceTempView("tbl") >>val outdf=batchDF.sparkSession.sql("select action, count(*) as count >> from tbl where date='2020-06-20' group by 1") >>outdf.printSchema() >>outdf.show >>outdf.coalesce(1).write.format("csv").save("/tmp/agg") >> } >> >> ## pyspark python method >> def process_data(bdf, bid): >> lspark = bdf._jdf.sparkSession() >> bdf.createOrReplaceTempView("tbl") >> outdf=lspark.sql("select action, count(*) as count from tbl where >> date='2020-06-20' group by 1") >> outdf.printSchema() >> # it works >> outdf.show() >> # throws AttributeError: 'JavaMember' object has no attribute 'format' >> exception >> outdf.coalesce(1).write.format("csv").save("/tmp/agg1") >> >> Here is the full exception >> 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id = >> 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId = >> e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error >> py4j.Py4JException: An exception was raised by the Python Proxy. Return >> Message: Traceback (most recent call last): >> File >> "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", >> line 2381, in _call_proxy >> return_value = getattr(self.pool[obj_id], method)(*params) >> File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call >> raise e >> AttributeError: 'JavaMember' object has no attribute 'format' >> at py4j.Protocol.getReturnValue(Protocol.java:473) >> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108) >> at com.sun.proxy.$Proxy20.call(Unknown Source) >> at >> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) >> at >> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) >> at >> org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35) >> at >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537) >> at >> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) >> at >> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) >> at >> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) >> at >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) >> at >> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) >> at >> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) >> at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org >> $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) >> at >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) >> at >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) >> at >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) >> at >> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) >> at >> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) >> at >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) >> at >> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) >> at >>
Re: Pyspark: Issue using sql in foreachBatch sink
Python doesn't allow abbreviating () with no param, whereas Scala does. Use `write()`, not `write`. On Wed, Jul 29, 2020 at 9:09 AM muru wrote: > In a pyspark SS job, trying to use sql instead of sql functions in > foreachBatch sink > throws AttributeError: 'JavaMember' object has no attribute 'format' > exception. > However, the same thing works in Scala API. > > Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same > exception. > Is it a bug or known issue with Pyspark implementation? I noticed that I > could perform other operations except the write method. > > Please, let me know how to fix this issue. > > See below code examples > # Spark Scala method > def processData(batchDF: DataFrame, batchId: Long) { >batchDF.createOrReplaceTempView("tbl") >val outdf=batchDF.sparkSession.sql("select action, count(*) as count > from tbl where date='2020-06-20' group by 1") >outdf.printSchema() >outdf.show >outdf.coalesce(1).write.format("csv").save("/tmp/agg") > } > > ## pyspark python method > def process_data(bdf, bid): > lspark = bdf._jdf.sparkSession() > bdf.createOrReplaceTempView("tbl") > outdf=lspark.sql("select action, count(*) as count from tbl where > date='2020-06-20' group by 1") > outdf.printSchema() > # it works > outdf.show() > # throws AttributeError: 'JavaMember' object has no attribute 'format' > exception > outdf.coalesce(1).write.format("csv").save("/tmp/agg1") > > Here is the full exception > 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id = > 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId = > e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error > py4j.Py4JException: An exception was raised by the Python Proxy. Return > Message: Traceback (most recent call last): > File > "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 2381, in _call_proxy > return_value = getattr(self.pool[obj_id], method)(*params) > File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call > raise e > AttributeError: 'JavaMember' object has no attribute 'format' > at py4j.Protocol.getReturnValue(Protocol.java:473) > at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108) > at com.sun.proxy.$Proxy20.call(Unknown Source) > at > org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) > at > org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) > at > org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org > $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > at > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > at > org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) > at org.apache.spark.sql.execution.streaming.StreamExecution.org > $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
Pyspark: Issue using sql in foreachBatch sink
In a pyspark SS job, trying to use sql instead of sql functions in foreachBatch sink throws AttributeError: 'JavaMember' object has no attribute 'format' exception. However, the same thing works in Scala API. Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same exception. Is it a bug or known issue with Pyspark implementation? I noticed that I could perform other operations except the write method. Please, let me know how to fix this issue. See below code examples # Spark Scala method def processData(batchDF: DataFrame, batchId: Long) { batchDF.createOrReplaceTempView("tbl") val outdf=batchDF.sparkSession.sql("select action, count(*) as count from tbl where date='2020-06-20' group by 1") outdf.printSchema() outdf.show outdf.coalesce(1).write.format("csv").save("/tmp/agg") } ## pyspark python method def process_data(bdf, bid): lspark = bdf._jdf.sparkSession() bdf.createOrReplaceTempView("tbl") outdf=lspark.sql("select action, count(*) as count from tbl where date='2020-06-20' group by 1") outdf.printSchema() # it works outdf.show() # throws AttributeError: 'JavaMember' object has no attribute 'format' exception outdf.coalesce(1).write.format("csv").save("/tmp/agg1") Here is the full exception 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id = 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId = e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error py4j.Py4JException: An exception was raised by the Python Proxy. Return Message: Traceback (most recent call last): File "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 2381, in _call_proxy return_value = getattr(self.pool[obj_id], method)(*params) File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call raise e AttributeError: 'JavaMember' object has no attribute 'format' at py4j.Protocol.getReturnValue(Protocol.java:473) at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108) at com.sun.proxy.$Proxy20.call(Unknown Source) at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) at org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55) at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160) at org.apache.spark.sql.execution.streaming.StreamExecution.org $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)
Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."
I had the same problem. One forum post elsewhere suggested that too much network communication might be using up available ports. I reduced the partition size via repartition(int) and it solved the problem. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26879.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."
Also, this is the command I use to submit the Spark application: ** where *recommendation_engine-0.1-py2.7.egg* is a Python egg of my own library I've written for this application, and *'file'* and *'/home/spark/enigma_analytics/tests/msg-epims0730_small.json'* are input arguments for the application. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26532.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."
Slight update I suppose? For some reason, sometimes it will connect and continue and the job will be completed. But most of the time I still run into this error and the job is killed and the application doesn't finish. Still have no idea why this is happening. I could really use some help here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26531.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."
dd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.RDD.collect(RDD.scala:926) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:209) at java.lang.Thread.run(Thread.java:745) And after this point, the application kills itself without completing the rest of the application job. The following are configurations that I give to the python spark application (named *submission.py*) before setting up the SparkContext variable: *submission.py:* I will include my configurations for Spark below on each machine: On the master machine (xx.xx.xx.248) *spark/conf/slaves:* *spark/conf/log4j.properties:* *spark/conf/spark-defaults.conf:* *spark/conf/spark-env.sh:* *spark/logs/SparkOut.log:* On the slave machine (xx.xx.xx.247) *spark/conf/log4j.properties:* *spark/conf/spark-env.sh:* I'll also attach an image below of what my Spark WebUI looks like while the application is running: <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/spark_cluster_overview.png> Finally I will attach the output logs of the Spark connections for both machines to this post (*SparkOut_248.log* and *SparkOut_247.log*) SparkOut_248.log <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/SparkOut_248.log> SparkOut_247.log <http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/SparkOut_247.log> Hopefully this will be all of the information needed in order for this issue to be resolved. If not, let me know what additional information to include in this post and I will do so. Thank you to anyone for taking the time to read this and help me out, I'm at a loss of what to do at the moment and I am getting very frustrated. -Craig Ignatowski -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
pyspark issue
Hi, I am running pyspark in windows and I am seeing an error while adding pyfiles to the sparkcontext. below is the example, sc = SparkContext(local,Sample,pyFiles=C:/sample/yattag.zip) this fails with no file found error for C The below logic is treating the path as individual files like C, : / etc. https://github.com/apache/spark/blob/master/python/pyspark/context.py#l195 It works if I use Spark Conf, sparkConf.set(spark.submit.pyFiles,***C:/sample/yattag.zip***) sc = SparkContext(local,Sample,conf=sparkConf) Is this an existing issue or I am not including the files in correct way in Spark Context? Thanks. when I run this, I am getting
Re: pyspark issue
It expects an iterable, and if you iterate over a string, you get the individual characters. Use a list instead: pyfiles=['/path/to/file'] On Mon, Jul 27, 2015 at 2:40 PM, Naveen Madhire vmadh...@umail.iu.edu wrote: Hi, I am running pyspark in windows and I am seeing an error while adding pyfiles to the sparkcontext. below is the example, sc = SparkContext(local,Sample,pyFiles=C:/sample/yattag.zip) this fails with no file found error for C The below logic is treating the path as individual files like C, : / etc. https://github.com/apache/spark/blob/master/python/pyspark/context.py#l195 It works if I use Spark Conf, sparkConf.set(spark.submit.pyFiles,***C:/sample/yattag.zip***) sc = SparkContext(local,Sample,conf=sparkConf) Is this an existing issue or I am not including the files in correct way in Spark Context? Thanks. when I run this, I am getting -- www.skrasser.com http://www.skrasser.com/?utm_source=sig
Re: PySpark issue with sortByKey: IndexError: list index out of range
Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen the IndexError yet. I've run into some other errors (too many open files), but these issues seem to have been discussed already. The dataset, by the way, was about 40 Gb and 188 million lines; I'm running a sort on 3 worker nodes with a total of about 80 cores. Thanks again for the tips! On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n18393...@n3.nabble.com wrote: Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=0 wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=2 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=3 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=4 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html To unsubscribe from PySpark issue with sortByKey: IndexError: list index out of range, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16445code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18871.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: PySpark issue with sortByKey: IndexError: list index out of range
The errors maybe happens because that there is not enough memory in worker, so it keeping spilling with many small files, could you verify that the PR [1] could fix your problem? [1] https://github.com/apache/spark/pull/3252 On Thu, Nov 13, 2014 at 11:28 AM, santon steven.m.an...@gmail.com wrote: Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen the IndexError yet. I've run into some other errors (too many open files), but these issues seem to have been discussed already. The dataset, by the way, was about 40 Gb and 188 million lines; I'm running a sort on 3 worker nodes with a total of about 80 cores. Thanks again for the tips! On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] [hidden email] wrote: Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email] wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] - To unsubscribe, e-mail: [hidden email] For additional commands, e-mail: [hidden email] If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html To unsubscribe from PySpark issue with sortByKey: IndexError: list index out of range, click here. NAML View this message in context: Re: PySpark issue with sortByKey: IndexError: list index out of range Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySpark issue with sortByKey: IndexError: list index out of range
Sorry for the delay. I'll try to add some more details on Monday. Unfortunately, I don't have a script to reproduce the error. Actually, it seemed to be more about the data set than the script. The same code on different data sets lead to different results; only larger data sets on the order of 40 GB seemed to crash with the described error. Also, I believe our cluster was recently updated to CDH 5.2, which uses Spark 1.1. I'll check to see if the issue was resolved. On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] ml-node+s1001560n18393...@n3.nabble.com wrote: Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=0 wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=2 - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=3 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=18393i=4 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html To unsubscribe from PySpark issue with sortByKey: IndexError: list index out of range, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16445code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18442.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: PySpark issue with sortByKey: IndexError: list index out of range
Could you tell how large is the data set? It will help us to debug this issue. On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySpark issue with sortByKey: IndexError: list index out of range
I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: PySpark issue with sortByKey: IndexError: list index out of range
It should be fixed in 1.1+. Could you have a script to reproduce it? On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote: I don't have any insight into this bug, but on Spark version 1.0.0 I ran into the same bug running the 'sort.py' example. On a smaller data set, it worked fine. On a larger data set I got this error: Traceback (most recent call last): File /home/skane/spark/examples/src/main/python/sort.py, line 30, in module .sortByKey(lambda x: x) File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey bounds.append(samples[index]) IndexError: list index out of range -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org