Re: Pyspark: Issue using sql in foreachBatch sink

2020-08-03 Thread muru
Thanks Jungtaek for your help.

On Fri, Jul 31, 2020 at 6:31 PM Jungtaek Lim 
wrote:

> Python doesn't allow abbreviating () with no param, whereas Scala does.
> Use `write()`, not `write`.
>
> On Wed, Jul 29, 2020 at 9:09 AM muru  wrote:
>
>> In a pyspark SS job, trying to use sql instead of sql functions in
>> foreachBatch sink
>> throws AttributeError: 'JavaMember' object has no attribute 'format'
>> exception.
>> However, the same thing works in Scala API.
>>
>> Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same
>> exception.
>> Is it a bug or known issue with Pyspark implementation? I noticed that I
>> could perform other operations except the write method.
>>
>> Please, let me know how to fix this issue.
>>
>> See below code examples
>> # Spark Scala method
>> def processData(batchDF: DataFrame, batchId: Long) {
>>batchDF.createOrReplaceTempView("tbl")
>>val outdf=batchDF.sparkSession.sql("select action, count(*) as count
>> from tbl where date='2020-06-20' group by 1")
>>outdf.printSchema()
>>outdf.show
>>outdf.coalesce(1).write.format("csv").save("/tmp/agg")
>> }
>>
>> ## pyspark python method
>> def process_data(bdf, bid):
>>   lspark = bdf._jdf.sparkSession()
>>   bdf.createOrReplaceTempView("tbl")
>>   outdf=lspark.sql("select action, count(*) as count from tbl where
>> date='2020-06-20' group by 1")
>>   outdf.printSchema()
>>   # it works
>>   outdf.show()
>>   # throws AttributeError: 'JavaMember' object has no attribute 'format'
>> exception
>>   outdf.coalesce(1).write.format("csv").save("/tmp/agg1")
>>
>> Here is the full exception
>> 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id =
>> 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId =
>> e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error
>> py4j.Py4JException: An exception was raised by the Python Proxy. Return
>> Message: Traceback (most recent call last):
>>   File
>> "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>> line 2381, in _call_proxy
>> return_value = getattr(self.pool[obj_id], method)(*params)
>>   File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call
>> raise e
>> AttributeError: 'JavaMember' object has no attribute 'format'
>> at py4j.Protocol.getReturnValue(Protocol.java:473)
>> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108)
>> at com.sun.proxy.$Proxy20.call(Unknown Source)
>> at
>> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
>> at
>> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
>> at
>> org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
>> at
>> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
>> at
>> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>> at
>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>> at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org
>> $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
>> at
>> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
>> at
>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
>> at
>> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>> at
>> 

Re: Pyspark: Issue using sql in foreachBatch sink

2020-07-31 Thread Jungtaek Lim
Python doesn't allow abbreviating () with no param, whereas Scala does. Use
`write()`, not `write`.

On Wed, Jul 29, 2020 at 9:09 AM muru  wrote:

> In a pyspark SS job, trying to use sql instead of sql functions in
> foreachBatch sink
> throws AttributeError: 'JavaMember' object has no attribute 'format'
> exception.
> However, the same thing works in Scala API.
>
> Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same
> exception.
> Is it a bug or known issue with Pyspark implementation? I noticed that I
> could perform other operations except the write method.
>
> Please, let me know how to fix this issue.
>
> See below code examples
> # Spark Scala method
> def processData(batchDF: DataFrame, batchId: Long) {
>batchDF.createOrReplaceTempView("tbl")
>val outdf=batchDF.sparkSession.sql("select action, count(*) as count
> from tbl where date='2020-06-20' group by 1")
>outdf.printSchema()
>outdf.show
>outdf.coalesce(1).write.format("csv").save("/tmp/agg")
> }
>
> ## pyspark python method
> def process_data(bdf, bid):
>   lspark = bdf._jdf.sparkSession()
>   bdf.createOrReplaceTempView("tbl")
>   outdf=lspark.sql("select action, count(*) as count from tbl where
> date='2020-06-20' group by 1")
>   outdf.printSchema()
>   # it works
>   outdf.show()
>   # throws AttributeError: 'JavaMember' object has no attribute 'format'
> exception
>   outdf.coalesce(1).write.format("csv").save("/tmp/agg1")
>
> Here is the full exception
> 20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id =
> 854a39d0-b944-4b52-bf05-cacf998e2cbd, runId =
> e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error
> py4j.Py4JException: An exception was raised by the Python Proxy. Return
> Message: Traceback (most recent call last):
>   File
> "/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
> line 2381, in _call_proxy
> return_value = getattr(self.pool[obj_id], method)(*params)
>   File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call
> raise e
> AttributeError: 'JavaMember' object has no attribute 'format'
> at py4j.Protocol.getReturnValue(Protocol.java:473)
> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108)
> at com.sun.proxy.$Proxy20.call(Unknown Source)
> at
> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
> at
> org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
> at
> org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
> at
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
> at
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
> at
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org
> $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
> at
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
> at
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
> at org.apache.spark.sql.execution.streaming.StreamExecution.org
> $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)

Pyspark: Issue using sql in foreachBatch sink

2020-07-28 Thread muru
In a pyspark SS job, trying to use sql instead of sql functions in
foreachBatch sink
throws AttributeError: 'JavaMember' object has no attribute 'format'
exception.
However, the same thing works in Scala API.

Please note, I tested in spark 2.4.5/2.4.6 and 3.0.0 and got the same
exception.
Is it a bug or known issue with Pyspark implementation? I noticed that I
could perform other operations except the write method.

Please, let me know how to fix this issue.

See below code examples
# Spark Scala method
def processData(batchDF: DataFrame, batchId: Long) {
   batchDF.createOrReplaceTempView("tbl")
   val outdf=batchDF.sparkSession.sql("select action, count(*) as count
from tbl where date='2020-06-20' group by 1")
   outdf.printSchema()
   outdf.show
   outdf.coalesce(1).write.format("csv").save("/tmp/agg")
}

## pyspark python method
def process_data(bdf, bid):
  lspark = bdf._jdf.sparkSession()
  bdf.createOrReplaceTempView("tbl")
  outdf=lspark.sql("select action, count(*) as count from tbl where
date='2020-06-20' group by 1")
  outdf.printSchema()
  # it works
  outdf.show()
  # throws AttributeError: 'JavaMember' object has no attribute 'format'
exception
  outdf.coalesce(1).write.format("csv").save("/tmp/agg1")

Here is the full exception
20/07/24 16:31:24 ERROR streaming.MicroBatchExecution: Query [id =
854a39d0-b944-4b52-bf05-cacf998e2cbd, runId =
e3d4dc7d-80e1-4164-8310-805d7713fc96] terminated with error
py4j.Py4JException: An exception was raised by the Python Proxy. Return
Message: Traceback (most recent call last):
  File
"/Users/muru/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
line 2381, in _call_proxy
return_value = getattr(self.pool[obj_id], method)(*params)
  File "/Users/muru/spark/python/pyspark/sql/utils.py", line 191, in call
raise e
AttributeError: 'JavaMember' object has no attribute 'format'
at py4j.Protocol.getReturnValue(Protocol.java:473)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:108)
at com.sun.proxy.$Proxy20.call(Unknown Source)
at
org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
at
org.apache.spark.sql.execution.streaming.sources.PythonForeachBatchHelper$$anonfun$callForeachBatch$1.apply(ForeachBatchSink.scala:55)
at
org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:35)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:537)
at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535)
at
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org
$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166)
at
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351)
at
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166)
at
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:160)
at org.apache.spark.sql.execution.streaming.StreamExecution.org
$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:281)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:193)


Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-05-04 Thread HLee
I had the same problem.  One forum post elsewhere suggested that too much
network communication might be using up available ports.  I reduced the
partition size via repartition(int) and it solved the problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-20 Thread craigiggy
Also, this is the command I use to submit the Spark application:

**

where *recommendation_engine-0.1-py2.7.egg* is a Python egg of my own
library I've written for this application, and *'file'* and
*'/home/spark/enigma_analytics/tests/msg-epims0730_small.json'* are input
arguments for the application.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26532.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-19 Thread craigiggy
Slight update I suppose?
For some reason, sometimes it will connect and continue and the job will be
completed. But most of the time I still run into this error and the job is
killed and the application doesn't finish.

Still have no idea why this is happening. I could really use some help here.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511p26531.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



PySpark Issue: "org.apache.spark.shuffle.FetchFailedException: Failed to connect to..."

2016-03-15 Thread craigiggy
dd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
at
org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
at 
org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)

And after this point, the application kills itself without completing the
rest of the application job. The following are configurations that I give to
the python spark application (named *submission.py*) before setting up the
SparkContext variable:

*submission.py:*

I will include my configurations for Spark below on each machine:

On the master machine (xx.xx.xx.248)

*spark/conf/slaves:*
*spark/conf/log4j.properties:*
*spark/conf/spark-defaults.conf:*
*spark/conf/spark-env.sh:*

*spark/logs/SparkOut.log:*

On the slave machine (xx.xx.xx.247)

*spark/conf/log4j.properties:*
*spark/conf/spark-env.sh:*


I'll also attach an image below of what my Spark WebUI looks like while the
application is running:

<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/spark_cluster_overview.png>
 

Finally I will attach the output logs of the Spark connections for both
machines to this post (*SparkOut_248.log* and *SparkOut_247.log*)

SparkOut_248.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/SparkOut_248.log>
  
SparkOut_247.log
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n26511/SparkOut_247.log>
  

Hopefully this will be all of the information needed in order for this issue
to be resolved. If not, let me know what additional information to include
in this post and I will do so. Thank you to anyone for taking the time to
read this and help me out, I'm at a loss of what to do at the moment and I
am getting very frustrated.

-Craig Ignatowski



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-Issue-org-apache-spark-shuffle-FetchFailedException-Failed-to-connect-to-tp26511.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



pyspark issue

2015-07-27 Thread Naveen Madhire
Hi,

I am running pyspark in windows and I am seeing an error while adding
pyfiles to the sparkcontext. below is the example,

sc = SparkContext(local,Sample,pyFiles=C:/sample/yattag.zip)

this fails with no file found error for C


The below logic is treating the path as individual files like C, : / etc.

https://github.com/apache/spark/blob/master/python/pyspark/context.py#l195


It works if I use Spark Conf,

sparkConf.set(spark.submit.pyFiles,***C:/sample/yattag.zip***)
sc = SparkContext(local,Sample,conf=sparkConf)


Is this an existing issue or I am not including the files in correct
way in Spark Context?


Thanks.





when I run this, I am getting


Re: pyspark issue

2015-07-27 Thread Sven Krasser
It expects an iterable, and if you iterate over a string, you get the
individual characters. Use a list instead:
pyfiles=['/path/to/file']

On Mon, Jul 27, 2015 at 2:40 PM, Naveen Madhire vmadh...@umail.iu.edu
wrote:

 Hi,

 I am running pyspark in windows and I am seeing an error while adding
 pyfiles to the sparkcontext. below is the example,

 sc = SparkContext(local,Sample,pyFiles=C:/sample/yattag.zip)

 this fails with no file found error for C


 The below logic is treating the path as individual files like C, : / 
 etc.

 https://github.com/apache/spark/blob/master/python/pyspark/context.py#l195


 It works if I use Spark Conf,

 sparkConf.set(spark.submit.pyFiles,***C:/sample/yattag.zip***)
 sc = SparkContext(local,Sample,conf=sparkConf)


 Is this an existing issue or I am not including the files in correct way in 
 Spark Context?


 Thanks.





 when I run this, I am getting




-- 
www.skrasser.com http://www.skrasser.com/?utm_source=sig


Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-13 Thread santon
Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen
the IndexError yet. I've run into some other errors (too many open
files), but these issues seem to have been discussed already. The dataset,
by the way, was about 40 Gb and 188 million lines; I'm running a sort on 3
worker nodes with a total of about 80 cores.

Thanks again for the tips!

On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] 
ml-node+s1001560n18393...@n3.nabble.com wrote:

 Could you tell how large is the data set? It will help us to debug this
 issue.

 On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=0 wrote:

  I don't have any insight into this bug, but on Spark version 1.0.0 I ran
 into
  the same bug running the 'sort.py' example. On a smaller data set, it
 worked
  fine. On a larger data set I got this error:
 
  Traceback (most recent call last):
File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
  module
  .sortByKey(lambda x: x)
File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
  bounds.append(samples[index])
  IndexError: list index out of range
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=1
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=2
 

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=3
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=4



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
  To unsubscribe from PySpark issue with sortByKey: IndexError: list index
 out of range, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16445code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-13 Thread Davies Liu
The errors maybe happens because that there is not enough memory in
worker, so it keeping spilling with many small files, could you verify
that the PR [1] could fix your problem?

[1] https://github.com/apache/spark/pull/3252

On Thu, Nov 13, 2014 at 11:28 AM, santon steven.m.an...@gmail.com wrote:
 Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen the
 IndexError yet. I've run into some other errors (too many open files), but
 these issues seem to have been discussed already. The dataset, by the way,
 was about 40 Gb and 188 million lines; I'm running a sort on 3 worker nodes
 with a total of about 80 cores.

 Thanks again for the tips!

 On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List]
 [hidden email] wrote:

 Could you tell how large is the data set? It will help us to debug this
 issue.

 On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email] wrote:

  I don't have any insight into this bug, but on Spark version 1.0.0 I ran
  into
  the same bug running the 'sort.py' example. On a smaller data set, it
  worked
  fine. On a larger data set I got this error:
 
  Traceback (most recent call last):
File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
  module
  .sortByKey(lambda x: x)
File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
  bounds.append(samples[index])
  IndexError: list index out of range
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
  For additional commands, e-mail: [hidden email]
 

 -
 To unsubscribe, e-mail: [hidden email]
 For additional commands, e-mail: [hidden email]



 
 If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
 To unsubscribe from PySpark issue with sortByKey: IndexError: list index
 out of range, click here.
 NAML



 
 View this message in context: Re: PySpark issue with sortByKey: IndexError:
 list index out of range

 Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-09 Thread santon
Sorry for the delay. I'll try to add some more details on Monday.

Unfortunately, I don't have a script to reproduce the error. Actually, it
seemed to be more about the data set than the script. The same code on
different data sets lead to different results; only larger data sets on the
order of 40 GB seemed to crash with the described error. Also, I believe
our cluster was recently updated to CDH 5.2, which uses Spark 1.1. I'll
check to see if the issue was resolved.

On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] 
ml-node+s1001560n18393...@n3.nabble.com wrote:

 Could you tell how large is the data set? It will help us to debug this
 issue.

 On Thu, Nov 6, 2014 at 10:39 AM, skane [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=0 wrote:

  I don't have any insight into this bug, but on Spark version 1.0.0 I ran
 into
  the same bug running the 'sort.py' example. On a smaller data set, it
 worked
  fine. On a larger data set I got this error:
 
  Traceback (most recent call last):
File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
  module
  .sortByKey(lambda x: x)
File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
  bounds.append(samples[index])
  IndexError: list index out of range
 
 
 
  --
  View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=1
  For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=2
 

 -
 To unsubscribe, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=3
 For additional commands, e-mail: [hidden email]
 http://user/SendEmail.jtp?type=nodenode=18393i=4



 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
  To unsubscribe from PySpark issue with sortByKey: IndexError: list index
 out of range, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=16445code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-07 Thread Davies Liu
Could you tell how large is the data set? It will help us to debug this issue.

On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote:
 I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
 the same bug running the 'sort.py' example. On a smaller data set, it worked
 fine. On a larger data set I got this error:

 Traceback (most recent call last):
   File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
 module
 .sortByKey(lambda x: x)
   File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
 bounds.append(samples[index])
 IndexError: list index out of range



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-06 Thread skane
I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
the same bug running the 'sort.py' example. On a smaller data set, it worked
fine. On a larger data set I got this error:

Traceback (most recent call last):
  File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
module
.sortByKey(lambda x: x)
  File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
bounds.append(samples[index])
IndexError: list index out of range



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: PySpark issue with sortByKey: IndexError: list index out of range

2014-11-06 Thread Davies Liu
It should be fixed in 1.1+.

Could you have a script to reproduce it?

On Thu, Nov 6, 2014 at 10:39 AM, skane sk...@websense.com wrote:
 I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
 the same bug running the 'sort.py' example. On a smaller data set, it worked
 fine. On a larger data set I got this error:

 Traceback (most recent call last):
   File /home/skane/spark/examples/src/main/python/sort.py, line 30, in
 module
 .sortByKey(lambda x: x)
   File /usr/lib/spark/python/pyspark/rdd.py, line 480, in sortByKey
 bounds.append(samples[index])
 IndexError: list index out of range



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org