[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704418#comment-14704418
 ] 

Apache Spark commented on SPARK-10100:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8332

 AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
 --

 Key: SPARK-10100
 URL: https://issues.apache.org/jira/browse/SPARK-10100
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Herman van Hovell
 Attachments: SPARK-10100.perf.test.scala


 Looks like Max (probably Min) implemented based on AggregateFunction2 is 
 slower than the old MaxFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10133) loadLibSVMFile fails to detect zero-based lines

2015-08-20 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-10133:
-

 Summary: loadLibSVMFile fails to detect zero-based lines
 Key: SPARK-10133
 URL: https://issues.apache.org/jira/browse/SPARK-10133
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Xusen Yin
Priority: Minor


https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L88

The code wants to assure that each line of vector is one-based and in ascending 
order, but it fails since the previous = -1 in the beginning.

In this condition, a libSVM format file that begins with 0-based index could 
read in normally, but the size of the according SparseVector is wrong, i.e. 
numFeatures - 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9107) Include memory usage for each job stage

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9107:
---

Assignee: Apache Spark

 Include memory usage for each job  stage
 -

 Key: SPARK-9107
 URL: https://issues.apache.org/jira/browse/SPARK-9107
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Reporter: Zhang, Liye
Assignee: Apache Spark

 In [spark-9104|https://issues.apache.org/jira/browse/SPARK-9104], the memory 
 usage is showed as running time memory usage, which means, we can only see 
 the current status of the memory consumption, just in the same way how the 
 Storage Tab works, it would not store the previous status, and the same 
 situation for history server WebUI. 
 The target for this issue is to show in different finished stages or in 
 different finished jobs, what is the memory usage status when/before the 
 job/stage completes. Also we can give out the maximum and minimum memory size 
 used during each job/stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9106) Log the memory usage info into history server

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704511#comment-14704511
 ] 

Apache Spark commented on SPARK-9106:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/7753

 Log the memory usage info into history server
 -

 Key: SPARK-9106
 URL: https://issues.apache.org/jira/browse/SPARK-9106
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Reporter: Zhang, Liye

 Save the memory usage info as eventLog, and can be traced from history 
 server. So that user can make an offline analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9105) Add an additional WebUI Tab for Memory Usage

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704510#comment-14704510
 ] 

Apache Spark commented on SPARK-9105:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/7753

 Add an additional WebUI Tab for Memory Usage
 

 Key: SPARK-9105
 URL: https://issues.apache.org/jira/browse/SPARK-9105
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Zhang, Liye

 Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage 
 status in different spark components. It should show the summary for each 
 executors and may also the details for each tasks. On this Tab, there may be 
 some duplicated information with Storage Tab, but they are in different 
 showing format, take RDD cache for example, the RDD cached size showed on 
 Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD 
 can be indexed with Executors, or tasks. Also, the two Tabs can share some 
 same Web Pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9107) Include memory usage for each job stage

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704512#comment-14704512
 ] 

Apache Spark commented on SPARK-9107:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/7753

 Include memory usage for each job  stage
 -

 Key: SPARK-9107
 URL: https://issues.apache.org/jira/browse/SPARK-9107
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Reporter: Zhang, Liye

 In [spark-9104|https://issues.apache.org/jira/browse/SPARK-9104], the memory 
 usage is showed as running time memory usage, which means, we can only see 
 the current status of the memory consumption, just in the same way how the 
 Storage Tab works, it would not store the previous status, and the same 
 situation for history server WebUI. 
 The target for this issue is to show in different finished stages or in 
 different finished jobs, what is the memory usage status when/before the 
 job/stage completes. Also we can give out the maximum and minimum memory size 
 used during each job/stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9107) Include memory usage for each job stage

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9107:
---

Assignee: (was: Apache Spark)

 Include memory usage for each job  stage
 -

 Key: SPARK-9107
 URL: https://issues.apache.org/jira/browse/SPARK-9107
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Reporter: Zhang, Liye

 In [spark-9104|https://issues.apache.org/jira/browse/SPARK-9104], the memory 
 usage is showed as running time memory usage, which means, we can only see 
 the current status of the memory consumption, just in the same way how the 
 Storage Tab works, it would not store the previous status, and the same 
 situation for history server WebUI. 
 The target for this issue is to show in different finished stages or in 
 different finished jobs, what is the memory usage status when/before the 
 job/stage completes. Also we can give out the maximum and minimum memory size 
 used during each job/stage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9106) Log the memory usage info into history server

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9106:
---

Assignee: Apache Spark

 Log the memory usage info into history server
 -

 Key: SPARK-9106
 URL: https://issues.apache.org/jira/browse/SPARK-9106
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Reporter: Zhang, Liye
Assignee: Apache Spark

 Save the memory usage info as eventLog, and can be traced from history 
 server. So that user can make an offline analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9105) Add an additional WebUI Tab for Memory Usage

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9105:
---

Assignee: (was: Apache Spark)

 Add an additional WebUI Tab for Memory Usage
 

 Key: SPARK-9105
 URL: https://issues.apache.org/jira/browse/SPARK-9105
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Zhang, Liye

 Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage 
 status in different spark components. It should show the summary for each 
 executors and may also the details for each tasks. On this Tab, there may be 
 some duplicated information with Storage Tab, but they are in different 
 showing format, take RDD cache for example, the RDD cached size showed on 
 Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD 
 can be indexed with Executors, or tasks. Also, the two Tabs can share some 
 same Web Pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9105) Add an additional WebUI Tab for Memory Usage

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9105:
---

Assignee: Apache Spark

 Add an additional WebUI Tab for Memory Usage
 

 Key: SPARK-9105
 URL: https://issues.apache.org/jira/browse/SPARK-9105
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Zhang, Liye
Assignee: Apache Spark

 Add a spark a WebUI Tab for Memory usage, the Tab should expose memory usage 
 status in different spark components. It should show the summary for each 
 executors and may also the details for each tasks. On this Tab, there may be 
 some duplicated information with Storage Tab, but they are in different 
 showing format, take RDD cache for example, the RDD cached size showed on 
 Storage Tab is indexed with RDD name, while on memory usage Tab, the RDD 
 can be indexed with Executors, or tasks. Also, the two Tabs can share some 
 same Web Pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9106) Log the memory usage info into history server

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9106:
---

Assignee: (was: Apache Spark)

 Log the memory usage info into history server
 -

 Key: SPARK-9106
 URL: https://issues.apache.org/jira/browse/SPARK-9106
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, Web UI
Reporter: Zhang, Liye

 Save the memory usage info as eventLog, and can be traced from history 
 server. So that user can make an offline analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10133) loadLibSVMFile fails to detect zero-based lines

2015-08-20 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704521#comment-14704521
 ] 

Xusen Yin commented on SPARK-10133:
---

Thanks. I ignored the above line.

 loadLibSVMFile fails to detect zero-based lines
 ---

 Key: SPARK-10133
 URL: https://issues.apache.org/jira/browse/SPARK-10133
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Xusen Yin
Priority: Minor

 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L88
 The code wants to assure that each line of vector is one-based and in 
 ascending order, but it fails since the previous = -1 in the beginning.
 In this condition, a libSVM format file that begins with 0-based index could 
 read in normally, but the size of the according SparseVector is wrong, i.e. 
 numFeatures - 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9089:
---

Assignee: Apache Spark

 Failing to run simple job on Spark Standalone Cluster
 -

 Key: SPARK-9089
 URL: https://issues.apache.org/jira/browse/SPARK-9089
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Staging
Reporter: Amar Goradia
Assignee: Apache Spark
Priority: Critical

 We are trying out Spark and as part of that, we have setup Standalone Spark 
 Cluster. As part of testing things out, we simple open PySpark shell and ran 
 this simple job: a=sc.parallelize([1,2,3]).count()
 As a result, we are getting errors. We tried googling around this error but 
 haven't been able to find exact reasoning behind why we are running into this 
 state. Can somebody please help us further look into this issue and advise us 
 on what we are missing here?
 Here is full error stack:
  a=sc.parallelize([1,2,3]).count()
 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1
 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 
 output partitions (allowLocal=false)
 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at 
 stdin:1)
 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List()
 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List()
 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] 
 at count at stdin:1), which has no missing parents
 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5
 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) 
 failed in Unknown s
 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 
 0.004963 s
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 972, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 963, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 771, in reduce
 vals = self.mapPartitions(func).collect()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 745, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.reflect.InvocationTargetException
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
 org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
   at 
 

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2015-08-20 Thread Silas Davis (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705278#comment-14705278
 ] 

Silas Davis commented on SPARK-3533:


I've looked at various solutions, and have summarised what I found in my post 
here: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Writing-to-multiple-outputs-in-Spark-td13298.html.
 The Stack Overflow question linked only address multiple Text outputs, and 
only does that for hadoop 1. My code synthesises the idea of using a wrapping 
OutputFormat, and of another gist that uses MultipleOuputs, but modifies 
saveAsNewAPIHadoopFile. My code also makes do with the current Spark API, but 
was enough effort, and seems common enough an aim that I'd argue some of it 
should be moved into Spark itself.

As for showing some code, my implementation is contained on the gist I have 
posted, and I have added this to the links attached to this ticket. I was 
hoping to get some comments on the code before embarking on a full pull request 
in which would require more consideration on where to place files etc. I'm not 
sure if you're suggesting it would be better to make a pull request now, or 
whether the gist is sufficient. I will open a pull request if you prefer. Is 
there anything else I should be doing to get committer buy-in?

[~nchammas] Have you been able to take a look at the code? 

 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9944) hive.metastore.warehouse.dir is not respected

2015-08-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705315#comment-14705315
 ] 

Yin Huai commented on SPARK-9944:
-

OK. I guess {{/user/ec2-user/warehouse}} is not the one you set, right? I took 
a look and here is my finding.

The reason is that in Spark SQL, if you do not specify a database name when you 
create a table, your table will be created under the current database and we 
will use the location of current database as the parent dir of your table dir. 
If you create a table in the {{default}} database, we are using the location of 
{{default}} database as the parent dir of your tables. Once this default db is 
created, {{hive.metastore.warehouse.dir}} cannot override it. However, Hive 
will allow you use {{hive.metastore.warehouse.dir}} as the parent dir of your 
tables if you are using {{default}} database and you do not specify the 
location of your table. See 
https://github.com/apache/hive/blob/release-1.2.1/ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java#L4473-L4479
 (this is added by https://issues.apache.org/jira/browse/HIVE-6374).

 hive.metastore.warehouse.dir is not respected
 -

 Key: SPARK-9944
 URL: https://issues.apache.org/jira/browse/SPARK-9944
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.4.1
Reporter: Manku Timma

 In 1.3.1, {{hive.metastore.warehouse.dir}} was honoured and table data was 
 stored there. In 1.4.0, this is no longer used. Instead 
 {{DBS.DB_LOCATION_URI}} of the metastore is used always. This breaks use 
 cases where the param is used to override the warehouse location.
 To reproduce the issue, start spark-shell with 
 {{hive.metastore.warehouse.dir}} set in hive-site.xml and run 
 {{df.saveAsTable(x)}}. You will see that the param is not honoured.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type

2015-08-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-9982.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

 SparkR DataFrame fail to return data of Decimal type
 

 Key: SPARK-9982
 URL: https://issues.apache.org/jira/browse/SPARK-9982
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Alex Shkurenko
Assignee: Alex Shkurenko
 Fix For: 1.5.0


 Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but 
 with the Decimal datatype coming from a Postgres DB:
 //Set up SparkR
 Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark)
 Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path 
 ~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
 library(SparkR)
 sc - sparkR.init(master=local)
 // Connect to a Postgres DB via JDBC
 sqlContext - sparkRSQL.init(sc)
 sql(sqlContext, 
 CREATE TEMPORARY TABLE mytable 
 USING org.apache.spark.sql.jdbc 
 OPTIONS (url 'jdbc:postgresql://servername:5432/dbname'
 ,dbtable 'mydbtable'
 )
 )
 // Try pulling a Decimal column from a table
 myDataFrame - sql(sqlContext,(select a_decimal_column  from mytable ))
 // The schema shows up fine
 show(myDataFrame)
 DataFrame[a_decimal_column:decimal(10,0)]
 schema(myDataFrame)
 StructType
 |-name = a_decimal_column, type = DecimalType(10,0), nullable = TRUE
 // ... but pulling data fails:
 localDF - collect(myDataFrame)
 Error in as.data.frame.default(x[[i]], optional = TRUE) : 
   cannot coerce class jobj to a data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type

2015-08-20 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-9982:
-
Assignee: Alex Shkurenko

 SparkR DataFrame fail to return data of Decimal type
 

 Key: SPARK-9982
 URL: https://issues.apache.org/jira/browse/SPARK-9982
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Alex Shkurenko
Assignee: Alex Shkurenko

 Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but 
 with the Decimal datatype coming from a Postgres DB:
 //Set up SparkR
 Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark)
 Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path 
 ~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
 library(SparkR)
 sc - sparkR.init(master=local)
 // Connect to a Postgres DB via JDBC
 sqlContext - sparkRSQL.init(sc)
 sql(sqlContext, 
 CREATE TEMPORARY TABLE mytable 
 USING org.apache.spark.sql.jdbc 
 OPTIONS (url 'jdbc:postgresql://servername:5432/dbname'
 ,dbtable 'mydbtable'
 )
 )
 // Try pulling a Decimal column from a table
 myDataFrame - sql(sqlContext,(select a_decimal_column  from mytable ))
 // The schema shows up fine
 show(myDataFrame)
 DataFrame[a_decimal_column:decimal(10,0)]
 schema(myDataFrame)
 StructType
 |-name = a_decimal_column, type = DecimalType(10,0), nullable = TRUE
 // ... but pulling data fails:
 localDF - collect(myDataFrame)
 Error in as.data.frame.default(x[[i]], optional = TRUE) : 
   cannot coerce class jobj to a data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9982) SparkR DataFrame fail to return data of Decimal type

2015-08-20 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705328#comment-14705328
 ] 

Shivaram Venkataraman commented on SPARK-9982:
--

Resolved by https://github.com/apache/spark/pull/8239

 SparkR DataFrame fail to return data of Decimal type
 

 Key: SPARK-9982
 URL: https://issues.apache.org/jira/browse/SPARK-9982
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.1
Reporter: Alex Shkurenko

 Got an issue similar to https://issues.apache.org/jira/browse/SPARK-8897, but 
 with the Decimal datatype coming from a Postgres DB:
 //Set up SparkR
 Sys.setenv(SPARK_HOME=/Users/ashkurenko/work/git_repos/spark)
 Sys.setenv(SPARKR_SUBMIT_ARGS=--driver-class-path 
 ~/Downloads/postgresql-9.4-1201.jdbc4.jar sparkr-shell)
 .libPaths(c(file.path(Sys.getenv(SPARK_HOME), R, lib), .libPaths()))
 library(SparkR)
 sc - sparkR.init(master=local)
 // Connect to a Postgres DB via JDBC
 sqlContext - sparkRSQL.init(sc)
 sql(sqlContext, 
 CREATE TEMPORARY TABLE mytable 
 USING org.apache.spark.sql.jdbc 
 OPTIONS (url 'jdbc:postgresql://servername:5432/dbname'
 ,dbtable 'mydbtable'
 )
 )
 // Try pulling a Decimal column from a table
 myDataFrame - sql(sqlContext,(select a_decimal_column  from mytable ))
 // The schema shows up fine
 show(myDataFrame)
 DataFrame[a_decimal_column:decimal(10,0)]
 schema(myDataFrame)
 StructType
 |-name = a_decimal_column, type = DecimalType(10,0), nullable = TRUE
 // ... but pulling data fails:
 localDF - collect(myDataFrame)
 Error in as.data.frame.default(x[[i]], optional = TRUE) : 
   cannot coerce class jobj to a data.frame



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7544) pyspark.sql.types.Row should implement __getitem__

2015-08-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704430#comment-14704430
 ] 

Yanbo Liang commented on SPARK-7544:


I can work on it.

 pyspark.sql.types.Row should implement __getitem__
 --

 Key: SPARK-7544
 URL: https://issues.apache.org/jira/browse/SPARK-7544
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor

 Following from the related discussions in [SPARK-7505] and [SPARK-7133], the 
 {{Row}} type should implement {{\_\_getitem\_\_}} so that people can do this
 {code}
 row['field']
 {code}
 instead of this:
 {code}
 row.field
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10131) running spark job in docker by mesos-slave

2015-08-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10131.
---
Resolution: Invalid

Do you mind asking this at u...@spark.apache.org? as far as I can tell you're 
just asking for assistance in understanding the problem rather than reporting a 
specific problem attributable to Spark. See 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark please.

 running spark job in docker by mesos-slave
 --

 Key: SPARK-10131
 URL: https://issues.apache.org/jira/browse/SPARK-10131
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.4.1
 Environment: docker 1.8.1
 mesos 0.23.0
 spark 1.4.1
Reporter: Stream Liu

 I try to running spark job in docker by mesos-slave.
 by i always get ERROR in mesos-slave
 E0820 07:46:08.780293 9 slave.cpp:1643] Failed to update resources for 
 container f2aeb5ee-2419-430c-be7d-8276947b909a of executor 
 '20150820-064813-1684252864-5050-1-S0' of framework 
 20150820-064813-1684252864-5050-1-0004, destroying container: Failed to 
 determine cgroup for the 'cpu' subsystem: Failed to read /proc/13071/cgroup: 
 Failed to open file '/proc/13071/cgroup': No such file or directory
 the target container could running but always exit with ExitCode137.
 i think this could be cause by cgroup ?
 the job could make work with out --containerizers=docker



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10134:


Assignee: Apache Spark

 Improve the performance of Binary Comparison
 

 Key: SPARK-10134
 URL: https://issues.apache.org/jira/browse/SPARK-10134
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
Assignee: Apache Spark
 Fix For: 1.6.0


 Currently, compare the binary byte by byte is quite slow, use the Guava 
 utility to improve the performance, which take 8 bytes one time in the 
 comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704487#comment-14704487
 ] 

Apache Spark commented on SPARK-10134:
--

User 'chenghao-intel' has created a pull request for this issue:
https://github.com/apache/spark/pull/8335

 Improve the performance of Binary Comparison
 

 Key: SPARK-10134
 URL: https://issues.apache.org/jira/browse/SPARK-10134
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.6.0


 Currently, compare the binary byte by byte is quite slow, use the Guava 
 utility to improve the performance, which take 8 bytes one time in the 
 comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10134:


Assignee: (was: Apache Spark)

 Improve the performance of Binary Comparison
 

 Key: SPARK-10134
 URL: https://issues.apache.org/jira/browse/SPARK-10134
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.6.0


 Currently, compare the binary byte by byte is quite slow, use the Guava 
 utility to improve the performance, which take 8 bytes one time in the 
 comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10130) type coercion for IF should have children resolved first

2015-08-20 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704513#comment-14704513
 ] 

Cheng Hao commented on SPARK-10130:
---

Can you change the fix version to 1.5? Lots of people suffer from the issue I 
think.

 type coercion for IF should have children resolved first
 

 Key: SPARK-10130
 URL: https://issues.apache.org/jira/browse/SPARK-10130
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang

 SELECT IF(a  0, a, 0) FROM (SELECT key a FROM src) temp;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction

2015-08-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704351#comment-14704351
 ] 

Yin Huai commented on SPARK-10100:
--

How about we leave these functions as is for now (looks like the improvement 
provided by updating expressions is not very significant and also  avoid code 
changes in the QA period )? 

 AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
 --

 Key: SPARK-10100
 URL: https://issues.apache.org/jira/browse/SPARK-10100
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Herman van Hovell
 Attachments: SPARK-10100.perf.test.scala


 Looks like Max (probably Min) implemented based on AggregateFunction2 is 
 slower than the old MaxFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9686) Spark hive jdbc client cannot get table from metadata store

2015-08-20 Thread pin_zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704374#comment-14704374
 ] 

pin_zhang commented on SPARK-9686:
--

What's the status of this bug? will it be fixed in 1.4.x?

 Spark hive jdbc client cannot get table from metadata store
 ---

 Key: SPARK-9686
 URL: https://issues.apache.org/jira/browse/SPARK-9686
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0, 1.4.1
Reporter: pin_zhang
Assignee: Cheng Lian

 1. Start  start-thriftserver.sh
 2. connect with beeline
 3. create table
 4.show tables, the new created table returned
 5.
   Class.forName(org.apache.hive.jdbc.HiveDriver);
   String URL = jdbc:hive2://localhost:1/default;
Properties info = new Properties();
 Connection conn = DriverManager.getConnection(URL, info);
   ResultSet tables = conn.getMetaData().getTables(conn.getCatalog(),
null, null, null);
 Problem:
No tables with returned this API, that work in spark1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10130) type coercion for IF should have children resolved first

2015-08-20 Thread Adrian Wang (JIRA)
Adrian Wang created SPARK-10130:
---

 Summary: type coercion for IF should have children resolved first
 Key: SPARK-10130
 URL: https://issues.apache.org/jira/browse/SPARK-10130
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang


SELECT IF(a  0, a, 0) FROM (SELECT key a FROM src) temp;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9040) StructField datatype Conversion Error

2015-08-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704393#comment-14704393
 ] 

Yanbo Liang commented on SPARK-9040:


[~vnayak053] The code work well on Spark 1.4. Do you have try it on Spark 1.4?

 StructField datatype Conversion Error
 -

 Key: SPARK-9040
 URL: https://issues.apache.org/jira/browse/SPARK-9040
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, SQL
Affects Versions: 1.3.0
 Environment: Cloudera 5.3 on CDH 6
Reporter: Sandeep Pal

 The following issue occurs if I specify the StructFields in specific order in 
 StructType as follow:
 fields = [StructField(d, IntegerType(), True),StructField(b, 
 IntegerType(), True),StructField(a, StringType(), True),StructField(c, 
 IntegerType(), True)]
 But the following code words fine:
 fields = [StructField(d, IntegerType(), True),StructField(b, 
 IntegerType(), True),StructField(c, IntegerType(), True),StructField(a, 
 StringType(), True)]
 ipython-input-27-9d675dd6a2c9 in module()
  18 
  19 schema = StructType(fields)
 --- 20 schemasimid_simple = 
 sqlContext.createDataFrame(simid_simplereqfields, schema)
  21 schemasimid_simple.registerTempTable(simid_simple)
 /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/context.py in 
 createDataFrame(self, data, schema, samplingRatio)
 302 
 303 for row in rows:
 -- 304 _verify_type(row, schema)
 305 
 306 # convert python objects to sql data
 /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
 986  length of fields (%d) % (len(obj), 
 len(dataType.fields)))
 987 for v, f in zip(obj, dataType.fields):
 -- 988 _verify_type(v, f.dataType)
 989 
 990 _cached_cls = weakref.WeakValueDictionary()
 /usr/local/bin/spark-1.3.1-bin-hadoop2.6/python/pyspark/sql/types.py in 
 _verify_type(obj, dataType)
 970 if type(obj) not in _acceptable_types[_type]:
 971 raise TypeError(%s can not accept object in type %s
 -- 972 % (dataType, type(obj)))
 973 
 974 if isinstance(dataType, ArrayType):
 TypeError: StringType can not accept object in type type 'int'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9098) Inconsistent Dense Vectors hashing between PySpark and Scala

2015-08-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9098.
--
  Resolution: Duplicate
Target Version/s:   (was: 1.6.0)

I agree, I think this is a subset of the broader fix/issue in SPARK-9793

 Inconsistent Dense Vectors hashing between PySpark and Scala
 

 Key: SPARK-9098
 URL: https://issues.apache.org/jira/browse/SPARK-9098
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.3.1, 1.4.0
Reporter: Maciej Szymkiewicz
Priority: Minor

 When using Scala  it is possible to group a RDD using DenseVector as a key:
 {code}
 import org.apache.spark.mllib.linalg.Vectors
 val rdd = sc.parallelize(
 (Vectors.dense(1, 2, 3), 10) :: (Vectors.dense(1, 2, 3), 20) :: Nil)
 rdd.groupByKey.count
 {code}
 returns 1 as expected.
 In PySpark {{DenseVector}} {{___hash___}} seems to be inherited from the 
 {{object}} and based on memory address:
 {code}
 from pyspark.mllib.linalg import DenseVector
 rdd = sc.parallelize(
 [(DenseVector([1, 2, 3]), 10), (DenseVector([1, 2, 3]), 20)])
 rdd.groupByKey().count()
 {code}
 returns 2.
 Since underlaying `numpy.ndarray` can be used to mutate DenseVector hashing 
 doesn't look meaningful at all:
 {code}
  dv = DenseVector([1, 2, 3])
  hdv1 = hash(dv)
  dv.array[0] = 3.0
  hdv2 = hash(dv)
  hdv1 == hdv2
 True
  dv == DenseVector([1, 2, 3])
 False
 {code}
 In my opinion the best approach would be to enforce immutability and provide 
 a meaningful hashing. An alternative is to make {{DenseVector}} unhashable 
 same as {{numpy.ndarray}}.
 Source: 
 http://stackoverflow.com/questions/31449412/how-to-groupbykey-a-rdd-with-densevector-as-key-in-spark/31451752



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10067) Long delay (16 seconds) when running local session on offline machine

2015-08-20 Thread Daniel Pinyol (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Pinyol updated SPARK-10067:
--
Comment: was deleted

(was: Fixed after upgrading to JDK 1.8.0._60. Probably due to 
http://bugs.java.com/view_bug.do?bug_id=8077102)

 Long delay (16 seconds) when running local session on offline machine
 -

 Key: SPARK-10067
 URL: https://issues.apache.org/jira/browse/SPARK-10067
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
 Environment: Mac 10.10.5, java 1.8.0_51 from IntelliJ 14.1
Reporter: Daniel Pinyol
Priority: Minor

 If I run this
 {code:java}
 SparkContext sc = new SparkContext(local, test);
 {code}
 on a machine with no network, it hangs during 15 or 16 seconds during this 
 point, and then it successfully resumes. Looks like the problem is when 
 checking the kerberos realm (see callstack below).
 Is there anyway to avoid this annoying delay? I reviewed 
 https://spark.apache.org/docs/latest/configuration.html, but couldn't find 
 any solution.
 thanks
 {noformat}
 main@1 prio=5 tid=0x1 nid=NA runnable
   java.lang.Thread.State: RUNNABLE
 at 
 java.net.PlainDatagramSocketImpl.peekData(PlainDatagramSocketImpl.java:-1)
 - locked 0x758 (a java.net.PlainDatagramSocketImpl)
 at java.net.DatagramSocket.receive(DatagramSocket.java:787)
 - locked 0x732 (a java.net.DatagramSocket)
 - locked 0x759 (a java.net.DatagramPacket)
 at com.sun.jndi.dns.DnsClient.doUdpQuery(DnsClient.java:413)
 at com.sun.jndi.dns.DnsClient.query(DnsClient.java:207)
 at com.sun.jndi.dns.Resolver.query(Resolver.java:81)
 at com.sun.jndi.dns.DnsContext.c_getAttributes(DnsContext.java:434)
 at 
 com.sun.jndi.toolkit.ctx.ComponentDirContext.p_getAttributes(ComponentDirContext.java:235)
 at 
 com.sun.jndi.toolkit.ctx.PartialCompositeDirContext.getAttributes(PartialCompositeDirContext.java:141)
 at 
 com.sun.jndi.toolkit.url.GenericURLDirContext.getAttributes(GenericURLDirContext.java:103)
 at 
 sun.security.krb5.KrbServiceLocator.getKerberosService(KrbServiceLocator.java:85)
 at sun.security.krb5.Config.checkRealm(Config.java:1120)
 at sun.security.krb5.Config.getRealmFromDNS(Config.java:1093)
 at sun.security.krb5.Config.getDefaultRealm(Config.java:987)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:483)
 at 
 org.apache.hadoop.security.authentication.util.KerberosUtil.getDefaultRealm(KerberosUtil.java:75)
 at 
 org.apache.hadoop.security.authentication.util.KerberosName.clinit(KerberosName.java:85)
 at 
 org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:225)
 - locked 0x57d (a java.lang.Class)
 at 
 org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:214)
 at 
 org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:669)
 at 
 org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571)
 at 
 org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2162)
 at 
 org.apache.spark.util.Utils$$anonfun$getCurrentUserName$1.apply(Utils.scala:2162)
 at scala.Option.getOrElse(Option.scala:121)
 at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2162)
 at org.apache.spark.SparkContext.init(SparkContext.scala:301)
 at org.apache.spark.SparkContext.init(SparkContext.scala:155)
 at org.apache.spark.SparkContext.init(SparkContext.scala:170)
 at DataFrameSandbox.init(DataFrameSandbox.java:31)
 at DataFrameSandbox.main(DataFrameSandbox.java:45)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10132) daemon crash caused by memory leak

2015-08-20 Thread ZemingZhao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZemingZhao updated SPARK-10132:
---
Attachment: xqjmap.live
xqjmap.all
oracle_gclog

attach the gclog and jmap info

 daemon crash caused by memory leak
 --

 Key: SPARK-10132
 URL: https://issues.apache.org/jira/browse/SPARK-10132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.4.1
 Environment: 1. Cluster: 7 Redhat notes cluster, each has 32 cores
 2. OS type: Red Hat Enterprise Linux Server release 7.1 (Maipo)
 3. Java version:   tried both Oracle jdk 1.6 and 1.7 
 java version 1.6.0_13
 Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
 4. JVM Option on spark-env.sh, 
 Notes: SPARK_DAEMON_MEMORY was set to 300M to speed up the crash process
 SPARK_DAEMON_JAVA_OPTS=-Xloggc:/root/spark/oracle_gclog
 SPARK_DAEMON_MEMORY=300m
Reporter: ZemingZhao
Priority: Critical
 Attachments: oracle_gclog, xqjmap.all, xqjmap.live


 constantly submit short batch workload onto spark. 
 spark master and worker will crash casued by memory leak.
 according to the gclog and jmap info, this leak should be related to Akka but 
 cannot find the root cause by now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704540#comment-14704540
 ] 

Apache Spark commented on SPARK-9089:
-

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8337

 Failing to run simple job on Spark Standalone Cluster
 -

 Key: SPARK-9089
 URL: https://issues.apache.org/jira/browse/SPARK-9089
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Staging
Reporter: Amar Goradia
Priority: Critical

 We are trying out Spark and as part of that, we have setup Standalone Spark 
 Cluster. As part of testing things out, we simple open PySpark shell and ran 
 this simple job: a=sc.parallelize([1,2,3]).count()
 As a result, we are getting errors. We tried googling around this error but 
 haven't been able to find exact reasoning behind why we are running into this 
 state. Can somebody please help us further look into this issue and advise us 
 on what we are missing here?
 Here is full error stack:
  a=sc.parallelize([1,2,3]).count()
 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1
 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 
 output partitions (allowLocal=false)
 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at 
 stdin:1)
 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List()
 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List()
 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] 
 at count at stdin:1), which has no missing parents
 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5
 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) 
 failed in Unknown s
 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 
 0.004963 s
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 972, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 963, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 771, in reduce
 vals = self.mapPartitions(func).collect()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 745, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.reflect.InvocationTargetException
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
 org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
   at 
 

[jira] [Assigned] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9089:
---

Assignee: (was: Apache Spark)

 Failing to run simple job on Spark Standalone Cluster
 -

 Key: SPARK-9089
 URL: https://issues.apache.org/jira/browse/SPARK-9089
 Project: Spark
  Issue Type: Question
  Components: PySpark
Affects Versions: 1.4.0
 Environment: Staging
Reporter: Amar Goradia
Priority: Critical

 We are trying out Spark and as part of that, we have setup Standalone Spark 
 Cluster. As part of testing things out, we simple open PySpark shell and ran 
 this simple job: a=sc.parallelize([1,2,3]).count()
 As a result, we are getting errors. We tried googling around this error but 
 haven't been able to find exact reasoning behind why we are running into this 
 state. Can somebody please help us further look into this issue and advise us 
 on what we are missing here?
 Here is full error stack:
  a=sc.parallelize([1,2,3]).count()
 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1
 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 
 output partitions (allowLocal=false)
 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at 
 stdin:1)
 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List()
 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List()
 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] 
 at count at stdin:1), which has no missing parents
 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5
 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) 
 failed in Unknown s
 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 
 0.004963 s
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 972, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 963, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 771, in reduce
 vals = self.mapPartitions(func).collect()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 745, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.reflect.InvocationTargetException
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
 org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
   at 
 

[jira] [Comment Edited] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster

2015-08-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704559#comment-14704559
 ] 

Yanbo Liang edited comment on SPARK-9089 at 8/20/15 9:18 AM:
-

I think this issue is due to failure of compression codec construction and it 
throws {{InvocationTargetException}}.
It usually happened when Snappy was configured as the compression codec.
Here we can catch the {{InvocationTargetException}} and throw 
{{IllegalArgumentException}} that tells users to fallback to another one such 
as LZF.


was (Author: yanboliang):
I think this issue is due to failure of compression codec construction and it 
throws {{InvocationTargetException}}.
It usually happened when Snappy was configured as the compression codec.
Here we can catch the {{InvocationTargetException}} and throw 
{{IllegalArgumentException}} that tells users to switch to fallback compression 
codec such as LZF.

 Failing to run simple job on Spark Standalone Cluster
 -

 Key: SPARK-9089
 URL: https://issues.apache.org/jira/browse/SPARK-9089
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Staging
Reporter: Amar Goradia
Priority: Critical

 We are trying out Spark and as part of that, we have setup Standalone Spark 
 Cluster. As part of testing things out, we simple open PySpark shell and ran 
 this simple job: a=sc.parallelize([1,2,3]).count()
 As a result, we are getting errors. We tried googling around this error but 
 haven't been able to find exact reasoning behind why we are running into this 
 state. Can somebody please help us further look into this issue and advise us 
 on what we are missing here?
 Here is full error stack:
  a=sc.parallelize([1,2,3]).count()
 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1
 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 
 output partitions (allowLocal=false)
 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at 
 stdin:1)
 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List()
 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List()
 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] 
 at count at stdin:1), which has no missing parents
 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5
 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) 
 failed in Unknown s
 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 
 0.004963 s
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 972, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 963, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 771, in reduce
 vals = self.mapPartitions(func).collect()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 745, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.reflect.InvocationTargetException
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
 org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
 

[jira] [Commented] (SPARK-8805) Spark shell not working

2015-08-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704497#comment-14704497
 ] 

Sean Owen commented on SPARK-8805:
--

Yeah, I think the problem is your version of bash is pretty old. bash 4 has 
been around for a long time; I'd use that.

 Spark shell not working
 ---

 Key: SPARK-8805
 URL: https://issues.apache.org/jira/browse/SPARK-8805
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core, Windows
Reporter: Perinkulam I Ganesh

 I am using Git Bash on windows.  Installed Open jdk1.8.0_45 and spark 1.4.0
 I am able to build spark and install it. But when ever I execute spark shell 
 it gives me the following error:
 $ spark-shell
 /c/.../spark/bin/spark-class: line 76: conditional binary operator expected



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'

2015-08-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10122.
---
Resolution: Not A Problem

I don't think this is a bug. Yes, only the initial RDD is actually a Kafka RDD. 
You need to operate on it if you manipulate offset ranges.

 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 

 Key: SPARK-10122
 URL: https://issues.apache.org/jira/browse/SPARK-10122
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Amit Ramesh
Priority: Critical
  Labels: kafka

 SPARK-8389 added the offsetRanges interface to Kafka direct streams. This 
 however appears to break when chaining operations after a transform 
 operation. Following is example code that would result in an error (stack 
 trace below). Note that if the 'count()' operation is taken out of the 
 example code then this error does not occur anymore, and the Kafka data is 
 printed.
 {code:title=kafka_test.py|collapse=true}
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.streaming.kafka import KafkaUtils
 def attach_kafka_metadata(kafka_rdd):
 offset_ranges = kafka_rdd.offsetRanges()
 return kafka_rdd
 if __name__ == __main__:
 sc = SparkContext(appName='kafka-test')
 ssc = StreamingContext(sc, 10)
 kafka_stream = KafkaUtils.createDirectStream(
 ssc,
 [TOPIC],
 kafkaParams={
 'metadata.broker.list': BROKERS,
 },
 )
 kafka_stream.transform(attach_kafka_metadata).count().pprint()
 ssc.start()
 ssc.awaitTermination()
 {code}
 {code:title=Stack trace|collapse=true}
 Traceback (most recent call last):
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, 
 line 62, in call
 r = self.func(t, *rdds)
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, 
 line 332, in lambda
 func = lambda t, rdd: oldfunc(rdd)
   File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in 
 attach_kafka_metadata
 offset_ranges = kafka_rdd.offsetRanges()
 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10092) Multi-DB support follow up

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704516#comment-14704516
 ] 

Apache Spark commented on SPARK-10092:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8336

 Multi-DB support follow up
 --

 Key: SPARK-10092
 URL: https://issues.apache.org/jira/browse/SPARK-10092
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker

 Seems we need a follow-up work for our multi-db support. Here are issues we 
 need to address.
 1. saveAsTable always save the table in the folder of the current database
 2. HiveContext's refrshTable and analyze do not dbName.tableName.
 3. It will be good to use TableIdentifier in CreateTableUsing, 
 CreateTableUsingAsSelect, CreateTempTableUsing, CreateTempTableUsingAsSelect, 
 CreateMetastoreDataSource, and CreateMetastoreDataSourceAsSelect, instead of 
 using string representation (actually, in several places we have already 
 parsed the string and get the TableIdentifier). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10130) type coercion for IF should have children resolved first

2015-08-20 Thread Adrian Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Wang updated SPARK-10130:

Fix Version/s: 1.5.0

 type coercion for IF should have children resolved first
 

 Key: SPARK-10130
 URL: https://issues.apache.org/jira/browse/SPARK-10130
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
 Fix For: 1.5.0


 SELECT IF(a  0, a, 0) FROM (SELECT key a FROM src) temp;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2015-08-20 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704534#comment-14704534
 ] 

Littlestar commented on SPARK-2883:
---

spark 1.4.1: The orc file writer relies on HiveContext and Hive metastore



 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: New Feature
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Assignee: Zhan Zhang
Priority: Critical
 Fix For: 1.4.0

 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10132) daemon crash caused by memory leak

2015-08-20 Thread ZemingZhao (JIRA)
ZemingZhao created SPARK-10132:
--

 Summary: daemon crash caused by memory leak
 Key: SPARK-10132
 URL: https://issues.apache.org/jira/browse/SPARK-10132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1, 1.3.1
 Environment: 1. Cluster: 7 Redhat notes cluster, each has 32 cores

2. OS type: Red Hat Enterprise Linux Server release 7.1 (Maipo)

3. Java version:   tried both Oracle jdk 1.6 and 1.7 
java version 1.6.0_13
Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
java version 1.7.0
Java(TM) SE Runtime Environment (build 1.7.0-b147)
Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)

4. JVM Option on spark-env.sh, 
Notes: SPARK_DAEMON_MEMORY was set to 300M to speed up the crash process
SPARK_DAEMON_JAVA_OPTS=-Xloggc:/root/spark/oracle_gclog
SPARK_DAEMON_MEMORY=300m
Reporter: ZemingZhao
Priority: Critical


constantly submit short batch workload onto spark. 
spark master and worker will crash casued by memory leak.
according to the gclog and jmap info, this leak should be related to Akka but 
cannot find the root cause by now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10134) Improve the performance of Binary Comparison

2015-08-20 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-10134:
-

 Summary: Improve the performance of Binary Comparison
 Key: SPARK-10134
 URL: https://issues.apache.org/jira/browse/SPARK-10134
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.6.0


Currently, compare the binary byte by byte is quite slow, use the Guava utility 
to improve the performance, which take 8 bytes one time in the comparison.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10132) daemon crash caused by memory leak

2015-08-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10132.
---
Resolution: Invalid

I don't think any of this suggests a problem in Spark though, right? You just 
ran out of memory.

I'm provisionally closing this since there is no detail about Spark here, a 
reproduction, or argument that there is a memory leak. It can be reopened if 
this detail is provided.

 daemon crash caused by memory leak
 --

 Key: SPARK-10132
 URL: https://issues.apache.org/jira/browse/SPARK-10132
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.1, 1.4.1
 Environment: 1. Cluster: 7 Redhat notes cluster, each has 32 cores
 2. OS type: Red Hat Enterprise Linux Server release 7.1 (Maipo)
 3. Java version:   tried both Oracle jdk 1.6 and 1.7 
 java version 1.6.0_13
 Java(TM) SE Runtime Environment (build 1.6.0_13-b03)
 Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode)
 java version 1.7.0
 Java(TM) SE Runtime Environment (build 1.7.0-b147)
 Java HotSpot(TM) 64-Bit Server VM (build 21.0-b17, mixed mode)
 4. JVM Option on spark-env.sh, 
 Notes: SPARK_DAEMON_MEMORY was set to 300M to speed up the crash process
 SPARK_DAEMON_JAVA_OPTS=-Xloggc:/root/spark/oracle_gclog
 SPARK_DAEMON_MEMORY=300m
Reporter: ZemingZhao
Priority: Critical
 Attachments: oracle_gclog, xqjmap.all, xqjmap.live


 constantly submit short batch workload onto spark. 
 spark master and worker will crash casued by memory leak.
 according to the gclog and jmap info, this leak should be related to Akka but 
 cannot find the root cause by now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster

2015-08-20 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-9089:
---
Component/s: (was: PySpark)
 Spark Core

 Failing to run simple job on Spark Standalone Cluster
 -

 Key: SPARK-9089
 URL: https://issues.apache.org/jira/browse/SPARK-9089
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Staging
Reporter: Amar Goradia
Priority: Critical

 We are trying out Spark and as part of that, we have setup Standalone Spark 
 Cluster. As part of testing things out, we simple open PySpark shell and ran 
 this simple job: a=sc.parallelize([1,2,3]).count()
 As a result, we are getting errors. We tried googling around this error but 
 haven't been able to find exact reasoning behind why we are running into this 
 state. Can somebody please help us further look into this issue and advise us 
 on what we are missing here?
 Here is full error stack:
  a=sc.parallelize([1,2,3]).count()
 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1
 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 
 output partitions (allowLocal=false)
 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at 
 stdin:1)
 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List()
 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List()
 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] 
 at count at stdin:1), which has no missing parents
 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5
 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) 
 failed in Unknown s
 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 
 0.004963 s
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 972, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 963, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 771, in reduce
 vals = self.mapPartitions(func).collect()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 745, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.reflect.InvocationTargetException
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
 org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
 org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
   at 
 

[jira] [Resolved] (SPARK-10133) loadLibSVMFile fails to detect zero-based lines

2015-08-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10133.
---
Resolution: Not A Problem

No, because the indices have already had 1 subtracted from them.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L82

I'm sure about this since I also made the same mistake when reviewing this 
change, enough that I'm provisionally closing this.

 loadLibSVMFile fails to detect zero-based lines
 ---

 Key: SPARK-10133
 URL: https://issues.apache.org/jira/browse/SPARK-10133
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Xusen Yin
Priority: Minor

 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L88
 The code wants to assure that each line of vector is one-based and in 
 ascending order, but it fails since the previous = -1 in the beginning.
 In this condition, a libSVM format file that begins with 0-based index could 
 read in normally, but the size of the according SparseVector is wrong, i.e. 
 numFeatures - 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9089) Failing to run simple job on Spark Standalone Cluster

2015-08-20 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704559#comment-14704559
 ] 

Yanbo Liang commented on SPARK-9089:


I think this issue is due to failure of compression codec construction and it 
throws {{InvocationTargetException}}.
It usually happened when Snappy was configured as the compression codec.
Here we can catch the {{InvocationTargetException}} and throw 
{{IllegalArgumentException}} that tells users to switch to fallback compression 
codec such as LZF.

 Failing to run simple job on Spark Standalone Cluster
 -

 Key: SPARK-9089
 URL: https://issues.apache.org/jira/browse/SPARK-9089
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 1.4.0
 Environment: Staging
Reporter: Amar Goradia
Priority: Critical

 We are trying out Spark and as part of that, we have setup Standalone Spark 
 Cluster. As part of testing things out, we simple open PySpark shell and ran 
 this simple job: a=sc.parallelize([1,2,3]).count()
 As a result, we are getting errors. We tried googling around this error but 
 haven't been able to find exact reasoning behind why we are running into this 
 state. Can somebody please help us further look into this issue and advise us 
 on what we are missing here?
 Here is full error stack:
  a=sc.parallelize([1,2,3]).count()
 15/07/16 00:52:15 INFO SparkContext: Starting job: count at stdin:1
 15/07/16 00:52:15 INFO DAGScheduler: Got job 5 (count at stdin:1) with 2 
 output partitions (allowLocal=false)
 15/07/16 00:52:15 INFO DAGScheduler: Final stage: ResultStage 5(count at 
 stdin:1)
 15/07/16 00:52:15 INFO DAGScheduler: Parents of final stage: List()
 15/07/16 00:52:15 INFO DAGScheduler: Missing parents: List()
 15/07/16 00:52:15 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[12] 
 at count at stdin:1), which has no missing parents
 15/07/16 00:52:15 INFO TaskSchedulerImpl: Cancelling stage 5
 15/07/16 00:52:15 INFO DAGScheduler: ResultStage 5 (count at stdin:1) 
 failed in Unknown s
 15/07/16 00:52:15 INFO DAGScheduler: Job 5 failed: count at stdin:1, took 
 0.004963 s
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 972, in count
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 963, in sum
 return self.mapPartitions(lambda x: [sum(x)]).reduce(operator.add)
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 771, in reduce
 vals = self.mapPartitions(func).collect()
   File /opt/spark/spark-1.4.0-bin-hadoop2.4/python/pyspark/rdd.py, line 
 745, in collect
 port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /opt/spark/spark-1.4.0-bin-hadoop2.4/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 300, in get_return_value
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.collectAndServe.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 
 serialization failed: java.lang.reflect.InvocationTargetException
 sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:68)
 org.apache.spark.io.CompressionCodec$.createCodec(CompressionCodec.scala:60)
 org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$setConf(TorrentBroadcast.scala:73)
 org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:80)
 org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
 org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
 org.apache.spark.SparkContext.broadcast(SparkContext.scala:1289)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:874)
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:815)
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:799)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1419)
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)

[jira] [Resolved] (SPARK-8854) Documentation for Association Rules

2015-08-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8854.
--
Resolution: Duplicate

 Documentation for Association Rules
 ---

 Key: SPARK-8854
 URL: https://issues.apache.org/jira/browse/SPARK-8854
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Reporter: Feynman Liang
Priority: Minor

 Documentation describing how to generate association rules from frequent 
 itemsets needs to be provided. The relevant method is 
 {{FPGrowthModel.generateAssociationRules}}. This will likely be added to the 
 existing section for frequent-itemsets using FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10143) Parquet changed the behavior of calculating splits

2015-08-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10143:
-
Component/s: SQL

 Parquet changed the behavior of calculating splits
 --

 Key: SPARK-10143
 URL: https://issues.apache.org/jira/browse/SPARK-10143
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical

 When Parquet's task side metadata is enabled (by default it is enabled and it 
 needs to be enabled to deal with tables with many files), Parquet delegates 
 the work of calculating initial splits to FileInputFormat (see 
 https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
  If filesystem's block size is smaller than the row group size and users do 
 not set min split size, splits in the initial split list will have lots of 
 dummy splits and they contribute to empty tasks (because the starting point 
 and ending point of a split does not cover the starting point of a row 
 group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10146) Have an easy way to set data source reader/writer specific confs

2015-08-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706205#comment-14706205
 ] 

Yin Huai commented on SPARK-10146:
--

One possible way to do it is that every data source defines a list of confs 
that can be applied to its reader/writer and we let users set those confs in 
SQLConf or through data source options. Then, we propagate those confs to the 
reader/writer.

 Have an easy way to set data source reader/writer specific confs
 

 Key: SPARK-10146
 URL: https://issues.apache.org/jira/browse/SPARK-10146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, it is hard to set data source reader/writer specifics confs 
 correctly (e.g. parquet's row group size). Users need to set those confs in 
 hadoop conf before start the application or through 
 {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be 
 great if we can have an easy to set those confs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10146) Have an easy way to set data source reader/writer specific confs

2015-08-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10146:
-
Issue Type: Improvement  (was: Bug)

 Have an easy way to set data source reader/writer specific confs
 

 Key: SPARK-10146
 URL: https://issues.apache.org/jira/browse/SPARK-10146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, it is hard to set data source reader/writer specifics confs 
 correctly (e.g. parquet's row group size). Users need to set those confs in 
 hadoop conf before start the application or through 
 {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be 
 great if we can have an easy to set those confs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10146) Have an easy way to set data source reader/writer specific confs

2015-08-20 Thread Yin Huai (JIRA)
Yin Huai created SPARK-10146:


 Summary: Have an easy way to set data source reader/writer 
specific confs
 Key: SPARK-10146
 URL: https://issues.apache.org/jira/browse/SPARK-10146
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Priority: Critical


Right now, it is hard to set data source reader/writer specifics confs 
correctly (e.g. parquet's row group size). Users need to set those confs in 
hadoop conf before start the application or through 
{{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be 
great if we can have an easy to set those confs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs

2015-08-20 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-10147:
-
Description: 
Phenomenon:App still shows in HistoryServer web when the event file has been 
deleted on hdfs.
Cause: It is because *log-replay-executor* thread and *clean log* thread both 
will write value to object *application*, so it has synchronization problem

  was:

It is because *log-replay-executor* thread and *clean log* thread both will 
write value to object *application*, so it has synchronization problem


 App shouldn't show in HistoryServer web when the event file has been deleted 
 on hdfs
 

 Key: SPARK-10147
 URL: https://issues.apache.org/jira/browse/SPARK-10147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 Phenomenon:App still shows in HistoryServer web when the event file has been 
 deleted on hdfs.
 Cause: It is because *log-replay-executor* thread and *clean log* thread both 
 will write value to object *application*, so it has synchronization problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs

2015-08-20 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-10147:
-
Summary: App shouldn't show in HistoryServer web when the event file has 
been deleted on hdfs  (was: App still shows in HistoryServer web when the event 
file has been deleted on hdfs)

 App shouldn't show in HistoryServer web when the event file has been deleted 
 on hdfs
 

 Key: SPARK-10147
 URL: https://issues.apache.org/jira/browse/SPARK-10147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 It is because *log-replay-executor* thread and *clean log* thread both will 
 write value to object *application*, so it has synchronization problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9983) Local physical operators for query execution

2015-08-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9983:
---
Description: 
In distributed query execution, there are two kinds of operators:

(1) operators that exchange data between different executors or threads: 
examples include broadcast, shuffle.

(2) operators that process data in a single thread: examples include project, 
filter, group by, etc.

This ticket proposes clearly differentiating them and create local operators in 
Spark. This leads to a lot of benefits: easier to test, easier to optimize data 
exchange, better design (single responsibility), and potentially even having a 
hyper-optimized single-node version of DataFrame.


  was:
In distributed query execution, there are two kinds of operators:

(1) operators that exchange data between different executors or threads: 
examples include broadcast, shuffle.

(2) operators that process data in a single thread: examples include project, 
filter, group by, etc.

This ticket proposes clearly differentiating them and create local operators in 
Spark. This leads to a lot of benefits: easier to test, easier to optimize data 
exchange, and better design (single responsibility).




 Local physical operators for query execution
 

 Key: SPARK-9983
 URL: https://issues.apache.org/jira/browse/SPARK-9983
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Reynold Xin
Assignee: Shixiong Zhu

 In distributed query execution, there are two kinds of operators:
 (1) operators that exchange data between different executors or threads: 
 examples include broadcast, shuffle.
 (2) operators that process data in a single thread: examples include project, 
 filter, group by, etc.
 This ticket proposes clearly differentiating them and create local operators 
 in Spark. This leads to a lot of benefits: easier to test, easier to optimize 
 data exchange, better design (single responsibility), and potentially even 
 having a hyper-optimized single-node version of DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'

2015-08-20 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706186#comment-14706186
 ] 

Saisai Shao commented on SPARK-10122:
-

Hi [~aramesh], thanks a lot for pointing this out. This is actually a bug, 
sorry for not covering it in the unit test.

The problem is Python will compact a series of {{TransformedDStream}} into one:

{code}
if (isinstance(prev, TransformedDStream) and
not prev.is_cached and not prev.is_checkpointed):
prev_func = prev.func
self.func = lambda t, rdd: func(t, prev_func(t, rdd))
self.prev = prev.prev
{code}

As {{KafkaTransformedDStream}} is a subclass of {{TransformedDStream}}, so it 
will be compacted to replace with its parent DStream, as the code shows 
{{self.prev = prev.prev}}, which is a DStream, get offset ranges on DStream 
will throw an exception as you mentioned before.

I will submit a PR to fix this, so you could try with the patch to see if it is 
fixed.

 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 

 Key: SPARK-10122
 URL: https://issues.apache.org/jira/browse/SPARK-10122
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Amit Ramesh
  Labels: kafka

 SPARK-8389 added the offsetRanges interface to Kafka direct streams. This 
 however appears to break when chaining operations after a transform 
 operation. Following is example code that would result in an error (stack 
 trace below). Note that if the 'count()' operation is taken out of the 
 example code then this error does not occur anymore, and the Kafka data is 
 printed.
 {code:title=kafka_test.py|collapse=true}
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.streaming.kafka import KafkaUtils
 def attach_kafka_metadata(kafka_rdd):
 offset_ranges = kafka_rdd.offsetRanges()
 return kafka_rdd
 if __name__ == __main__:
 sc = SparkContext(appName='kafka-test')
 ssc = StreamingContext(sc, 10)
 kafka_stream = KafkaUtils.createDirectStream(
 ssc,
 [TOPIC],
 kafkaParams={
 'metadata.broker.list': BROKERS,
 },
 )
 kafka_stream.transform(attach_kafka_metadata).count().pprint()
 ssc.start()
 ssc.awaitTermination()
 {code}
 {code:title=Stack trace|collapse=true}
 Traceback (most recent call last):
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, 
 line 62, in call
 r = self.func(t, *rdds)
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, 
 line 332, in lambda
 func = lambda t, rdd: oldfunc(rdd)
   File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in 
 attach_kafka_metadata
 offset_ranges = kafka_rdd.offsetRanges()
 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706202#comment-14706202
 ] 

Apache Spark commented on SPARK-10122:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8347

 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 

 Key: SPARK-10122
 URL: https://issues.apache.org/jira/browse/SPARK-10122
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Amit Ramesh
  Labels: kafka

 SPARK-8389 added the offsetRanges interface to Kafka direct streams. This 
 however appears to break when chaining operations after a transform 
 operation. Following is example code that would result in an error (stack 
 trace below). Note that if the 'count()' operation is taken out of the 
 example code then this error does not occur anymore, and the Kafka data is 
 printed.
 {code:title=kafka_test.py|collapse=true}
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.streaming.kafka import KafkaUtils
 def attach_kafka_metadata(kafka_rdd):
 offset_ranges = kafka_rdd.offsetRanges()
 return kafka_rdd
 if __name__ == __main__:
 sc = SparkContext(appName='kafka-test')
 ssc = StreamingContext(sc, 10)
 kafka_stream = KafkaUtils.createDirectStream(
 ssc,
 [TOPIC],
 kafkaParams={
 'metadata.broker.list': BROKERS,
 },
 )
 kafka_stream.transform(attach_kafka_metadata).count().pprint()
 ssc.start()
 ssc.awaitTermination()
 {code}
 {code:title=Stack trace|collapse=true}
 Traceback (most recent call last):
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, 
 line 62, in call
 r = self.func(t, *rdds)
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, 
 line 332, in lambda
 func = lambda t, rdd: oldfunc(rdd)
   File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in 
 attach_kafka_metadata
 offset_ranges = kafka_rdd.offsetRanges()
 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10146) Have an easy way to set data source reader/writer specific confs

2015-08-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706205#comment-14706205
 ] 

Yin Huai edited comment on SPARK-10146 at 8/21/15 3:42 AM:
---

One possible way is that every data source defines a list of confs that can be 
applied to its reader/writer and we let users set those confs in SQLConf or 
through data source options. Then, we propagate those confs to the 
reader/writer.


was (Author: yhuai):
One possible way to do it is that every data source defines a list of confs 
that can be applied to its reader/writer and we let users set those confs in 
SQLConf or through data source options. Then, we propagate those confs to the 
reader/writer.

 Have an easy way to set data source reader/writer specific confs
 

 Key: SPARK-10146
 URL: https://issues.apache.org/jira/browse/SPARK-10146
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Critical

 Right now, it is hard to set data source reader/writer specifics confs 
 correctly (e.g. parquet's row group size). Users need to set those confs in 
 hadoop conf before start the application or through 
 {{org.apache.spark.deploy.SparkHadoopUtil.get.conf}} at runtime. It will be 
 great if we can have an easy to set those confs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10147) App still shows in HistoryServer web when the event file has been deleted on hdfs

2015-08-20 Thread meiyoula (JIRA)
meiyoula created SPARK-10147:


 Summary: App still shows in HistoryServer web when the event file 
has been deleted on hdfs
 Key: SPARK-10147
 URL: https://issues.apache.org/jira/browse/SPARK-10147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula


It is because *log-replay-executor* thread and *clean log* thread both will 
write value to object *application*, so it has synchronization problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8467) Add LDAModel.describeTopics() in Python

2015-08-20 Thread Hrishikesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706257#comment-14706257
 ] 

Hrishikesh commented on SPARK-8467:
---

[~yuu.ishik...@gmail.com], are you still working on this?

 Add LDAModel.describeTopics() in Python
 ---

 Key: SPARK-8467
 URL: https://issues.apache.org/jira/browse/SPARK-8467
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Yu Ishikawa

 Add LDAModel. describeTopics() in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9669) Support PySpark with Mesos Cluster mode

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9669:
---

Assignee: (was: Apache Spark)

 Support PySpark with Mesos Cluster mode
 ---

 Key: SPARK-9669
 URL: https://issues.apache.org/jira/browse/SPARK-9669
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, PySpark
Reporter: Timothy Chen

 PySpark with cluster mode with Mesos is not yet supported.
 We need to enable it and make sure it's able to launch Pyspark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9669) Support PySpark with Mesos Cluster mode

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9669:
---

Assignee: Apache Spark

 Support PySpark with Mesos Cluster mode
 ---

 Key: SPARK-9669
 URL: https://issues.apache.org/jira/browse/SPARK-9669
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, PySpark
Reporter: Timothy Chen
Assignee: Apache Spark

 PySpark with cluster mode with Mesos is not yet supported.
 We need to enable it and make sure it's able to launch Pyspark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9669) Support PySpark with Mesos Cluster mode

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706262#comment-14706262
 ] 

Apache Spark commented on SPARK-9669:
-

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8349

 Support PySpark with Mesos Cluster mode
 ---

 Key: SPARK-9669
 URL: https://issues.apache.org/jira/browse/SPARK-9669
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, PySpark
Reporter: Timothy Chen

 PySpark with cluster mode with Mesos is not yet supported.
 We need to enable it and make sure it's able to launch Pyspark jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9848) Add @Since annotation to new public APIs in 1.5

2015-08-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706178#comment-14706178
 ] 

Xiangrui Meng commented on SPARK-9848:
--

No, that would be too much for this release. We plan to do that after 1.5.

 Add @Since annotation to new public APIs in 1.5
 ---

 Key: SPARK-9848
 URL: https://issues.apache.org/jira/browse/SPARK-9848
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML, MLlib
Reporter: Xiangrui Meng
Assignee: Manoj Kumar
Priority: Critical
  Labels: starter

 We should get a list of new APIs from SPARK-9660. cc: [~fliang]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8400) ml.ALS doesn't handle -1 block size

2015-08-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706182#comment-14706182
 ] 

Xiangrui Meng commented on SPARK-8400:
--

Sorry for my late reply! We check numBlocks in LocalIndexEncoder. However, I'm 
not sure whether this happens before any data shuffling. It might be better to 
check numUserBlocks and numItemBlocks directly.

 ml.ALS doesn't handle -1 block size
 ---

 Key: SPARK-8400
 URL: https://issues.apache.org/jira/browse/SPARK-8400
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.3.1
Reporter: Xiangrui Meng
Assignee: Bryan Cutler

 Under spark.mllib, if number blocks is set to -1, we set the block size 
 automatically based on the input partition size. However, this behavior is 
 not preserved in the spark.ml API. If user sets -1 in Spark 1.3, it will not 
 work, but no error messages will show.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10137) Avoid to restart receivers if scheduleReceivers returns balanced results

2015-08-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10137:
--
Assignee: Shixiong Zhu

 Avoid to restart receivers if scheduleReceivers returns balanced results
 

 Key: SPARK-10137
 URL: https://issues.apache.org/jira/browse/SPARK-10137
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
Priority: Critical

 In some cases, even if scheduleReceivers returns balanced results, 
 ReceiverTracker still may reject some receivers and force them to restart. 
 See my PR for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10137) Avoid to restart receivers if scheduleReceivers returns balanced results

2015-08-20 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10137:
--
Priority: Critical  (was: Major)

 Avoid to restart receivers if scheduleReceivers returns balanced results
 

 Key: SPARK-10137
 URL: https://issues.apache.org/jira/browse/SPARK-10137
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Shixiong Zhu
Priority: Critical

 In some cases, even if scheduleReceivers returns balanced results, 
 ReceiverTracker still may reject some receivers and force them to restart. 
 See my PR for more details.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming

2015-08-20 Thread Baogang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706139#comment-14706139
 ] 

Baogang Wang edited comment on SPARK-10145 at 8/21/15 3:27 AM:
---

spark.serializer org.apache.spark.serializer.KryoSerializer
spark.akka.frameSize1024
spark.driver.extraJavaOptions   -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
spark.akka.timeout  900
spark.storage.memoryFraction0.4
spark.rdd.compress  true
spark.shuffle.blockTransferService  nio
spark.yarn.executor.memoryOverhead  1024


was (Author: heayin):
# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled   true
# spark.eventLog.dir   hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory  5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value 
-Dnumbers=one two three
#spark.core.connection.ack.wait.timeout 3600
#spark.core.connection.auth.wait.timeout3600
spark.akka.frameSize1024
spark.driver.extraJavaOptions   -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
spark.akka.timeout  900
spark.storage.memoryFraction0.4
spark.rdd.compress  true
spark.shuffle.blockTransferService  nio
spark.yarn.executor.memoryOverhead  1024

 Executor exit without useful messages when spark runs in spark-streaming
 

 Key: SPARK-10145
 URL: https://issues.apache.org/jira/browse/SPARK-10145
 Project: Spark
  Issue Type: Bug
  Components: Streaming, YARN
 Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 
 cores and 32g memory  
Reporter: Baogang Wang
Priority: Critical
   Original Estimate: 168h
  Remaining Estimate: 168h

 Each node is allocated 30g memory by Yarn.
 My application receives messages from Kafka by directstream. Each application 
 consists of 4 dstream window
 Spark application is submitted by this command:
 spark-submit --class spark_security.safe.SafeSockPuppet  --driver-memory 3g 
 --executor-memory 3g --num-executors 3 --executor-cores 4  --name 
 safeSparkDealerUser --master yarn  --deploy-mode cluster  
 spark_Security-1.0-SNAPSHOT.jar.nocalse 
 hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties
 After about 1 hours, some executor exits. There is no more yarn logs after 
 the executor exits and there is no stack when the executor exits.
 When I see the yarn node manager log, it shows as follows :
 2015-08-17 17:25:41,550 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Start request for container_1439803298368_0005_01_01 by user root
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Creating a new application reference for app application_1439803298368_0005
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root   
 IP=172.19.160.102   OPERATION=Start Container Request   
 TARGET=ContainerManageImpl  RESULT=SUCCESS  
 APPID=application_1439803298368_0005
 CONTAINERID=container_1439803298368_0005_01_01
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Application application_1439803298368_0005 transitioned from NEW to INITING
 2015-08-17 17:25:41,552 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Adding container_1439803298368_0005_01_01 to application 
 application_1439803298368_0005
 2015-08-17 17:25:41,557 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
  rollingMonitorInterval is set as -1. The log rolling mornitoring interval is 
 disabled. The logs will be aggregated after this application is finished.
 2015-08-17 17:25:41,663 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Application application_1439803298368_0005 transitioned from INITING to 
 RUNNING
 2015-08-17 17:25:41,664 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1439803298368_0005_01_01 transitioned from NEW to 
 LOCALIZING
 2015-08-17 17:25:41,664 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: 

[jira] [Assigned] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10147:


Assignee: (was: Apache Spark)

 App shouldn't show in HistoryServer web when the event file has been deleted 
 on hdfs
 

 Key: SPARK-10147
 URL: https://issues.apache.org/jira/browse/SPARK-10147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 Phenomenon:App still shows in HistoryServer web when the event file has been 
 deleted on hdfs.
 Cause: It is because *log-replay-executor* thread and *clean log* thread both 
 will write value to object *application*, so it has synchronization problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706217#comment-14706217
 ] 

Apache Spark commented on SPARK-10147:
--

User 'XuTingjun' has created a pull request for this issue:
https://github.com/apache/spark/pull/8348

 App shouldn't show in HistoryServer web when the event file has been deleted 
 on hdfs
 

 Key: SPARK-10147
 URL: https://issues.apache.org/jira/browse/SPARK-10147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 Phenomenon:App still shows in HistoryServer web when the event file has been 
 deleted on hdfs.
 Cause: It is because *log-replay-executor* thread and *clean log* thread both 
 will write value to object *application*, so it has synchronization problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10147) App shouldn't show in HistoryServer web when the event file has been deleted on hdfs

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10147:


Assignee: Apache Spark

 App shouldn't show in HistoryServer web when the event file has been deleted 
 on hdfs
 

 Key: SPARK-10147
 URL: https://issues.apache.org/jira/browse/SPARK-10147
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Assignee: Apache Spark

 Phenomenon:App still shows in HistoryServer web when the event file has been 
 deleted on hdfs.
 Cause: It is because *log-replay-executor* thread and *clean log* thread both 
 will write value to object *application*, so it has synchronization problem



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) RDD-like API on top of Catalyst/DataFrame

2015-08-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706244#comment-14706244
 ] 

Reynold Xin commented on SPARK-:


This needs to be designed first. I'm not sure if static code analysis is a 
great idea since they fail often. I'm open to ideas though.


 RDD-like API on top of Catalyst/DataFrame
 -

 Key: SPARK-
 URL: https://issues.apache.org/jira/browse/SPARK-
 Project: Spark
  Issue Type: Story
  Components: SQL
Reporter: Reynold Xin

 The RDD API is very flexible, and as a result harder to optimize its 
 execution in some cases. The DataFrame API, on the other hand, is much easier 
 to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
 use UDFs, lack of strong types in Scala/Java).
 As a Spark user, I want an API that sits somewhere in the middle of the 
 spectrum so I can write most of my applications with that API, and yet it can 
 be optimized well by Spark to achieve performance and stability.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10142) Python checkpoint recovery does not work with non-local file path

2015-08-20 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-10142:
-

 Summary: Python checkpoint recovery does not work with non-local 
file path
 Key: SPARK-10142
 URL: https://issues.apache.org/jira/browse/SPARK-10142
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.4.1, 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10144) Actually show peak execution memory on UI by default

2015-08-20 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-10144:
--
Summary: Actually show peak execution memory on UI by default  (was: 
Actually show peak execution memory by default)

 Actually show peak execution memory on UI by default
 

 Key: SPARK-10144
 URL: https://issues.apache.org/jira/browse/SPARK-10144
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or

 The peak execution memory metric was introduced in SPARK-8735. That was 
 before Tungsten was enabled by default, so it assumed that 
 `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer 
 the case...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10144) Actually show peak execution memory by default

2015-08-20 Thread Andrew Or (JIRA)
Andrew Or created SPARK-10144:
-

 Summary: Actually show peak execution memory by default
 Key: SPARK-10144
 URL: https://issues.apache.org/jira/browse/SPARK-10144
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or


The peak execution memory metric was introduced in SPARK-8735. That was before 
Tungsten was enabled by default, so it assumed that `spark.sql.unsafe.enabled` 
must be explicitly set to true. This is no longer the case...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'

2015-08-20 Thread Amit Ramesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amit Ramesh reopened SPARK-10122:
-

 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 

 Key: SPARK-10122
 URL: https://issues.apache.org/jira/browse/SPARK-10122
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Amit Ramesh
Priority: Critical
  Labels: kafka

 SPARK-8389 added the offsetRanges interface to Kafka direct streams. This 
 however appears to break when chaining operations after a transform 
 operation. Following is example code that would result in an error (stack 
 trace below). Note that if the 'count()' operation is taken out of the 
 example code then this error does not occur anymore, and the Kafka data is 
 printed.
 {code:title=kafka_test.py|collapse=true}
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.streaming.kafka import KafkaUtils
 def attach_kafka_metadata(kafka_rdd):
 offset_ranges = kafka_rdd.offsetRanges()
 return kafka_rdd
 if __name__ == __main__:
 sc = SparkContext(appName='kafka-test')
 ssc = StreamingContext(sc, 10)
 kafka_stream = KafkaUtils.createDirectStream(
 ssc,
 [TOPIC],
 kafkaParams={
 'metadata.broker.list': BROKERS,
 },
 )
 kafka_stream.transform(attach_kafka_metadata).count().pprint()
 ssc.start()
 ssc.awaitTermination()
 {code}
 {code:title=Stack trace|collapse=true}
 Traceback (most recent call last):
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, 
 line 62, in call
 r = self.func(t, *rdds)
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, 
 line 332, in lambda
 func = lambda t, rdd: oldfunc(rdd)
   File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in 
 attach_kafka_metadata
 offset_ranges = kafka_rdd.offsetRanges()
 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'

2015-08-20 Thread Amit Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704674#comment-14704674
 ] 

Amit Ramesh edited comment on SPARK-10122 at 8/20/15 10:51 AM:
---

[~srowen] as you can see in the example, offsetRanges() is being applied to the 
initial RDD as part of the transform operation. And the code works fine if the 
line 'kafka_stream.transform(attach_kafka_metadata).count().pprint()' is 
changed to 'kafka_stream.transform(attach_kafka_metadata).pprint()'.


was (Author: aramesh):
[~srowen] as you can see in the example, offsetRanges() is being applied to the 
initial RDD as part of the transform operation. And the code works file if the 
line 'kafka_stream.transform(attach_kafka_metadata).count().pprint()' is 
changed to 'kafka_stream.transform(attach_kafka_metadata).pprint()'.

 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 

 Key: SPARK-10122
 URL: https://issues.apache.org/jira/browse/SPARK-10122
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Amit Ramesh
Priority: Critical
  Labels: kafka

 SPARK-8389 added the offsetRanges interface to Kafka direct streams. This 
 however appears to break when chaining operations after a transform 
 operation. Following is example code that would result in an error (stack 
 trace below). Note that if the 'count()' operation is taken out of the 
 example code then this error does not occur anymore, and the Kafka data is 
 printed.
 {code:title=kafka_test.py|collapse=true}
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.streaming.kafka import KafkaUtils
 def attach_kafka_metadata(kafka_rdd):
 offset_ranges = kafka_rdd.offsetRanges()
 return kafka_rdd
 if __name__ == __main__:
 sc = SparkContext(appName='kafka-test')
 ssc = StreamingContext(sc, 10)
 kafka_stream = KafkaUtils.createDirectStream(
 ssc,
 [TOPIC],
 kafkaParams={
 'metadata.broker.list': BROKERS,
 },
 )
 kafka_stream.transform(attach_kafka_metadata).count().pprint()
 ssc.start()
 ssc.awaitTermination()
 {code}
 {code:title=Stack trace|collapse=true}
 Traceback (most recent call last):
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, 
 line 62, in call
 r = self.func(t, *rdds)
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, 
 line 332, in lambda
 func = lambda t, rdd: oldfunc(rdd)
   File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in 
 attach_kafka_metadata
 offset_ranges = kafka_rdd.offsetRanges()
 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'

2015-08-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10122:
--
Priority: Major  (was: Critical)

Ah I see now, you are not operating on the transformed stream resulting from 
count(); it comes after and not before in the example. 

This should be OK in the Scala API in my experience, but it looks like Python 
operates a little differently. When a transformation is applied to a 
transformed stream, it collapses the two transformations. So the count + kafka 
offset function are turned into one, which is applied to the raw DStream behind 
the Kafka DStream and it fails. I think.

[~jerryshao] [~davies] [~tdas] worth a look. It may be a case for not 
implementing the dstream transformation this way, if this guess is right.

 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 

 Key: SPARK-10122
 URL: https://issues.apache.org/jira/browse/SPARK-10122
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Amit Ramesh
  Labels: kafka

 SPARK-8389 added the offsetRanges interface to Kafka direct streams. This 
 however appears to break when chaining operations after a transform 
 operation. Following is example code that would result in an error (stack 
 trace below). Note that if the 'count()' operation is taken out of the 
 example code then this error does not occur anymore, and the Kafka data is 
 printed.
 {code:title=kafka_test.py|collapse=true}
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.streaming.kafka import KafkaUtils
 def attach_kafka_metadata(kafka_rdd):
 offset_ranges = kafka_rdd.offsetRanges()
 return kafka_rdd
 if __name__ == __main__:
 sc = SparkContext(appName='kafka-test')
 ssc = StreamingContext(sc, 10)
 kafka_stream = KafkaUtils.createDirectStream(
 ssc,
 [TOPIC],
 kafkaParams={
 'metadata.broker.list': BROKERS,
 },
 )
 kafka_stream.transform(attach_kafka_metadata).count().pprint()
 ssc.start()
 ssc.awaitTermination()
 {code}
 {code:title=Stack trace|collapse=true}
 Traceback (most recent call last):
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, 
 line 62, in call
 r = self.func(t, *rdds)
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, 
 line 332, in lambda
 func = lambda t, rdd: oldfunc(rdd)
   File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in 
 attach_kafka_metadata
 offset_ranges = kafka_rdd.offsetRanges()
 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10122) AttributeError: 'RDD' object has no attribute 'offsetRanges'

2015-08-20 Thread Amit Ramesh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704674#comment-14704674
 ] 

Amit Ramesh commented on SPARK-10122:
-

[~srowen] as you can see in the example, offsetRanges() is being applied to the 
initial RDD as part of the transform operation. And the code works file if the 
line 'kafka_stream.transform(attach_kafka_metadata).count().pprint()' is 
changed to 'kafka_stream.transform(attach_kafka_metadata).pprint()'.

 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 

 Key: SPARK-10122
 URL: https://issues.apache.org/jira/browse/SPARK-10122
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Reporter: Amit Ramesh
Priority: Critical
  Labels: kafka

 SPARK-8389 added the offsetRanges interface to Kafka direct streams. This 
 however appears to break when chaining operations after a transform 
 operation. Following is example code that would result in an error (stack 
 trace below). Note that if the 'count()' operation is taken out of the 
 example code then this error does not occur anymore, and the Kafka data is 
 printed.
 {code:title=kafka_test.py|collapse=true}
 from pyspark import SparkContext
 from pyspark.streaming import StreamingContext
 from pyspark.streaming.kafka import KafkaUtils
 def attach_kafka_metadata(kafka_rdd):
 offset_ranges = kafka_rdd.offsetRanges()
 return kafka_rdd
 if __name__ == __main__:
 sc = SparkContext(appName='kafka-test')
 ssc = StreamingContext(sc, 10)
 kafka_stream = KafkaUtils.createDirectStream(
 ssc,
 [TOPIC],
 kafkaParams={
 'metadata.broker.list': BROKERS,
 },
 )
 kafka_stream.transform(attach_kafka_metadata).count().pprint()
 ssc.start()
 ssc.awaitTermination()
 {code}
 {code:title=Stack trace|collapse=true}
 Traceback (most recent call last):
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/util.py, 
 line 62, in call
 r = self.func(t, *rdds)
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File 
 /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/dstream.py, line 
 616, in lambda
 self.func = lambda t, rdd: func(t, prev_func(t, rdd))
   File /home/spark/spark/python/lib/pyspark.zip/pyspark/streaming/kafka.py, 
 line 332, in lambda
 func = lambda t, rdd: oldfunc(rdd)
   File /home/spark/ad_realtime/batch/kafka_test.py, line 7, in 
 attach_kafka_metadata
 offset_ranges = kafka_rdd.offsetRanges()
 AttributeError: 'RDD' object has no attribute 'offsetRanges'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10015) ML model broadcasts should be stored in private vars: spark.ml tree ensembles

2015-08-20 Thread Sameer Abhyankar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704740#comment-14704740
 ] 

Sameer Abhyankar commented on SPARK-10015:
--

[~josephkb] I have created a common trait in the PR for SPARK-10017 that we can 
reuse for all of these other related Jiras (10015-10020). If, the PR for 
SPARK-10017 looks ok and is merged, I can update the PRs for the other Jiras. 
Thx!

 ML model broadcasts should be stored in private vars: spark.ml tree ensembles
 -

 Key: SPARK-10015
 URL: https://issues.apache.org/jira/browse/SPARK-10015
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 See parent for details.  Applies to:
 * GBTClassifier
 * RandomForestClassifier
 * GBTRegressor
 * RandomForestRegressor



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10092) Multi-DB support follow up

2015-08-20 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-10092.

   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8336
[https://github.com/apache/spark/pull/8336]

 Multi-DB support follow up
 --

 Key: SPARK-10092
 URL: https://issues.apache.org/jira/browse/SPARK-10092
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.5.0


 Seems we need a follow-up work for our multi-db support. Here are issues we 
 need to address.
 1. saveAsTable always save the table in the folder of the current database
 2. HiveContext's refrshTable and analyze do not dbName.tableName.
 3. It will be good to use TableIdentifier in CreateTableUsing, 
 CreateTableUsingAsSelect, CreateTempTableUsing, CreateTempTableUsingAsSelect, 
 CreateMetastoreDataSource, and CreateMetastoreDataSourceAsSelect, instead of 
 using string representation (actually, in several places we have already 
 parsed the string and get the TableIdentifier). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704712#comment-14704712
 ] 

Apache Spark commented on SPARK-8436:
-

User 'x1-' has created a pull request for this issue:
https://github.com/apache/spark/pull/8339

 Inconsistent behavior when converting a Timestamp column to Integer/Long and 
 then convert back to Timestamp
 ---

 Key: SPARK-8436
 URL: https://issues.apache.org/jira/browse/SPARK-8436
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Le Minh Tu
Priority: Minor

 I'm aware that when converting from Integer/LongType to Timestamp, the 
 column's values should be in milliseconds. However, I was surprised when 
 trying to do this 
 `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()`
  and got back a totally different datetime ('event_time' is initially a 
 TimestampType). There must be some constraints in implementation that I'm not 
 aware of but it would be nice if a double conversion like this returns the 
 initial value as one might expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8436:
---

Assignee: (was: Apache Spark)

 Inconsistent behavior when converting a Timestamp column to Integer/Long and 
 then convert back to Timestamp
 ---

 Key: SPARK-8436
 URL: https://issues.apache.org/jira/browse/SPARK-8436
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Le Minh Tu
Priority: Minor

 I'm aware that when converting from Integer/LongType to Timestamp, the 
 column's values should be in milliseconds. However, I was surprised when 
 trying to do this 
 `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()`
  and got back a totally different datetime ('event_time' is initially a 
 TimestampType). There must be some constraints in implementation that I'm not 
 aware of but it would be nice if a double conversion like this returns the 
 initial value as one might expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10100) AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction

2015-08-20 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704742#comment-14704742
 ] 

Herman van Hovell commented on SPARK-10100:
---

Lets leave it for 1.6.

 AggregateFunction2's Max is slower than AggregateExpression1's MaxFunction
 --

 Key: SPARK-10100
 URL: https://issues.apache.org/jira/browse/SPARK-10100
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Herman van Hovell
 Attachments: SPARK-10100.perf.test.scala


 Looks like Max (probably Min) implemented based on AggregateFunction2 is 
 slower than the old MaxFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6196) Add MAPR 4.0.2 support to the build

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704704#comment-14704704
 ] 

Apache Spark commented on SPARK-6196:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8338

 Add MAPR 4.0.2 support to the build
 ---

 Key: SPARK-6196
 URL: https://issues.apache.org/jira/browse/SPARK-6196
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Trystan Leftwich
Priority: Minor
  Labels: build

 Mapr 4.0.2 upgraded to use hadoop 2.5.1 and the current mapr build doesn't 
 support building for 4.0.2
 http://doc.mapr.com/display/RelNotes/Version+4.0.2+Release+Notes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10109) NPE when saving Parquet To HDFS

2015-08-20 Thread Virgil Palanciuc (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14704710#comment-14704710
 ] 

Virgil Palanciuc commented on SPARK-10109:
--

I think I know what caused this - the code I was using above was in a parallel 
foreach, and the outputPath was always the same; I thought there shouldn't 
be a problem since the pids were always different (and the write is partitioned 
by pid), but I guess some metadata/temporary files were not? After removing the 
parallel iteration on 'pid', it works fine.

 NPE when saving Parquet To HDFS
 ---

 Key: SPARK-10109
 URL: https://issues.apache.org/jira/browse/SPARK-10109
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
 Environment: Sparc-ec2, standalone cluster on amazon
Reporter: Virgil Palanciuc

 Very simple code, trying to save a dataframe
 I get this in the driver
 {quote}
 15/08/19 11:21:41 INFO TaskSetManager: Lost task 9.2 in stage 217.0 (TID 
 4748) on executor 172.xx.xx.xx: java.lang.NullPointerException (null) 
 and  (not for that task):
 15/08/19 11:21:46 WARN TaskSetManager: Lost task 5.0 in stage 543.0 (TID 
 5607, 172.yy.yy.yy): java.lang.NullPointerException
 at 
 parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146)
 at 
 parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112)
 at 
 parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73)
 at 
 org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:88)
 at 
 org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
 at 
 org.apache.spark.sql.sources.DynamicPartitionWriterContainer$$anonfun$clearOutputWriters$1.apply(commands.scala:536)
 at 
 scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
 at 
 scala.collection.mutable.HashMap$$anon$2$$anonfun$foreach$3.apply(HashMap.scala:107)
 at 
 scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
 at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
 at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:107)
 at 
 org.apache.spark.sql.sources.DynamicPartitionWriterContainer.clearOutputWriters(commands.scala:536)
 at 
 org.apache.spark.sql.sources.DynamicPartitionWriterContainer.abortTask(commands.scala:552)
 at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$2(commands.scala:269)
 at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
 at 
 org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insertWithDynamicPartitions$3.apply(commands.scala:229)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {quote}
 I get this in the executor log:
 {quote}
 15/08/19 11:21:41 WARN DFSClient: DataStreamer Exception
 org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException):
  No lease on 
 /gglogs/2015-07-27/_temporary/_attempt_201508191119_0217_m_09_2/dpid=18432/pid=1109/part-r-9-46ac3a79-a95c-4d9c-a2f1-b3ee76f6a46c.snappy.parquet
  File does not exist. Holder DFSClient_NONMAPREDUCE_1730998114_63 does not 
 have any open files.
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2396)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:2387)
   at 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2183)
   at 
 org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:481)
   at 
 org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297)
   at 
 org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080)
   at 
 org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1002)
   at 

[jira] [Assigned] (SPARK-8436) Inconsistent behavior when converting a Timestamp column to Integer/Long and then convert back to Timestamp

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8436:
---

Assignee: Apache Spark

 Inconsistent behavior when converting a Timestamp column to Integer/Long and 
 then convert back to Timestamp
 ---

 Key: SPARK-8436
 URL: https://issues.apache.org/jira/browse/SPARK-8436
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Le Minh Tu
Assignee: Apache Spark
Priority: Minor

 I'm aware that when converting from Integer/LongType to Timestamp, the 
 column's values should be in milliseconds. However, I was surprised when 
 trying to do this 
 `a.select(a['event_time'].astype(LongType()).astype(TimestampType())).first()`
  and got back a totally different datetime ('event_time' is initially a 
 TimestampType). There must be some constraints in implementation that I'm not 
 aware of but it would be nice if a double conversion like this returns the 
 initial value as one might expect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10135) Percent of pruned partitions is shown wrong

2015-08-20 Thread Romi Kuntsman (JIRA)
Romi Kuntsman created SPARK-10135:
-

 Summary: Percent of pruned partitions is shown wrong
 Key: SPARK-10135
 URL: https://issues.apache.org/jira/browse/SPARK-10135
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Romi Kuntsman
Priority: Trivial


When reading partitioned Parquet in SparkSQL, an info message about the number 
of pruned partitions is displayed.

Actual:
Selected 15 partitions out of 181, pruned -1106.7% partitions.

Expected:
Selected 15 partitions out of 181, pruned 91.71270718232044% partitions.

Fix: (i'm newbie here so please help make patch, thanks!)
in DataSourceStrategy.scala in method apply()

insted of:
val percentPruned = (1 - total.toDouble / selected.toDouble) * 100
should be:
val percentPruned = (1 - selected.toDouble / total.toDouble) * 100




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706041#comment-14706041
 ] 

Apache Spark commented on SPARK-8580:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8341

 Add Parquet files generated by different systems to test interoperability and 
 compatibility
 ---

 Key: SPARK-8580
 URL: https://issues.apache.org/jira/browse/SPARK-8580
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 
 to improve interoperability with other systems (reading non-standard Parquet 
 files they generate, and generating standard Parquet files), it would be good 
 to have a set of standard test Parquet files generated by various 
 systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old 
 versions of Spark SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8580:
---

Assignee: Cheng Lian  (was: Apache Spark)

 Add Parquet files generated by different systems to test interoperability and 
 compatibility
 ---

 Key: SPARK-8580
 URL: https://issues.apache.org/jira/browse/SPARK-8580
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Cheng Lian

 As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 
 to improve interoperability with other systems (reading non-standard Parquet 
 files they generate, and generating standard Parquet files), it would be good 
 to have a set of standard test Parquet files generated by various 
 systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old 
 versions of Spark SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8580) Add Parquet files generated by different systems to test interoperability and compatibility

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8580:
---

Assignee: Apache Spark  (was: Cheng Lian)

 Add Parquet files generated by different systems to test interoperability and 
 compatibility
 ---

 Key: SPARK-8580
 URL: https://issues.apache.org/jira/browse/SPARK-8580
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Assignee: Apache Spark

 As we are implementing Parquet backwards-compatibility rules for Spark 1.5.0 
 to improve interoperability with other systems (reading non-standard Parquet 
 files they generate, and generating standard Parquet files), it would be good 
 to have a set of standard test Parquet files generated by various 
 systems/tools (parquet-thrift, parquet-avro, parquet-hive, Impala, and old 
 versions of Spark SQL) to ensure compatibility.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10143:


Assignee: (was: Apache Spark)

 Parquet changed the behavior of calculating splits
 --

 Key: SPARK-10143
 URL: https://issues.apache.org/jira/browse/SPARK-10143
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical

 When Parquet's task side metadata is enabled (by default it is enabled and it 
 needs to be enabled to deal with tables with many files), Parquet delegates 
 the work of calculating initial splits to FileInputFormat (see 
 https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
  If filesystem's block size is smaller than the row group size and users do 
 not set min split size, splits in the initial split list will have lots of 
 dummy splits and they contribute to empty tasks (because the starting point 
 and ending point of a split does not cover the starting point of a row 
 group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706045#comment-14706045
 ] 

Apache Spark commented on SPARK-10143:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8346

 Parquet changed the behavior of calculating splits
 --

 Key: SPARK-10143
 URL: https://issues.apache.org/jira/browse/SPARK-10143
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical

 When Parquet's task side metadata is enabled (by default it is enabled and it 
 needs to be enabled to deal with tables with many files), Parquet delegates 
 the work of calculating initial splits to FileInputFormat (see 
 https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
  If filesystem's block size is smaller than the row group size and users do 
 not set min split size, splits in the initial split list will have lots of 
 dummy splits and they contribute to empty tasks (because the starting point 
 and ending point of a split does not cover the starting point of a row 
 group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10143) Parquet changed the behavior of calculating splits

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10143:


Assignee: Apache Spark

 Parquet changed the behavior of calculating splits
 --

 Key: SPARK-10143
 URL: https://issues.apache.org/jira/browse/SPARK-10143
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Critical

 When Parquet's task side metadata is enabled (by default it is enabled and it 
 needs to be enabled to deal with tables with many files), Parquet delegates 
 the work of calculating initial splits to FileInputFormat (see 
 https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
  If filesystem's block size is smaller than the row group size and users do 
 not set min split size, splits in the initial split list will have lots of 
 dummy splits and they contribute to empty tasks (because the starting point 
 and ending point of a split does not cover the starting point of a row 
 group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-10143) Parquet changed the behavior of calculating splits

2015-08-20 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10143:
-
Comment: was deleted

(was: For something quick, we can use the row group size set in hadoop conf to 
set the min split size.)

 Parquet changed the behavior of calculating splits
 --

 Key: SPARK-10143
 URL: https://issues.apache.org/jira/browse/SPARK-10143
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical

 When Parquet's task side metadata is enabled (by default it is enabled and it 
 needs to be enabled to deal with tables with many files), Parquet delegates 
 the work of calculating initial splits to FileInputFormat (see 
 https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
  If filesystem's block size is smaller than the row group size and users do 
 not set min split size, splits in the initial split list will have lots of 
 dummy splits and they contribute to empty tasks (because the starting point 
 and ending point of a split does not cover the starting point of a row 
 group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10143) Parquet changed the behavior of calculating splits

2015-08-20 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705988#comment-14705988
 ] 

Yin Huai commented on SPARK-10143:
--

[~rdblue] Can you confirm the behavior change of Parquet? Looks like we are 
just asking FileInputFormat to give us the initial splits. I am thinking to use 
the current setting of parquet row group size as the fs min split size for the 
job. What do you think? Thanks :)

 Parquet changed the behavior of calculating splits
 --

 Key: SPARK-10143
 URL: https://issues.apache.org/jira/browse/SPARK-10143
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Yin Huai
Priority: Critical

 When Parquet's task side metadata is enabled (by default it is enabled and it 
 needs to be enabled to deal with tables with many files), Parquet delegates 
 the work of calculating initial splits to FileInputFormat (see 
 https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java#L301-L311).
  If filesystem's block size is smaller than the row group size and users do 
 not set min split size, splits in the initial split list will have lots of 
 dummy splits and they contribute to empty tasks (because the starting point 
 and ending point of a split does not cover the starting point of a row 
 group). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10144) Actually show peak execution memory on UI by default

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10144:


Assignee: Apache Spark  (was: Andrew Or)

 Actually show peak execution memory on UI by default
 

 Key: SPARK-10144
 URL: https://issues.apache.org/jira/browse/SPARK-10144
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Apache Spark

 The peak execution memory metric was introduced in SPARK-8735. That was 
 before Tungsten was enabled by default, so it assumed that 
 `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer 
 the case...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10144) Actually show peak execution memory on UI by default

2015-08-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706033#comment-14706033
 ] 

Apache Spark commented on SPARK-10144:
--

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/8345

 Actually show peak execution memory on UI by default
 

 Key: SPARK-10144
 URL: https://issues.apache.org/jira/browse/SPARK-10144
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or

 The peak execution memory metric was introduced in SPARK-8735. That was 
 before Tungsten was enabled by default, so it assumed that 
 `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer 
 the case...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10144) Actually show peak execution memory on UI by default

2015-08-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10144:


Assignee: Andrew Or  (was: Apache Spark)

 Actually show peak execution memory on UI by default
 

 Key: SPARK-10144
 URL: https://issues.apache.org/jira/browse/SPARK-10144
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.5.0
Reporter: Andrew Or
Assignee: Andrew Or

 The peak execution memory metric was introduced in SPARK-8735. That was 
 before Tungsten was enabled by default, so it assumed that 
 `spark.sql.unsafe.enabled` must be explicitly set to true. This is no longer 
 the case...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming

2015-08-20 Thread Baogang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706139#comment-14706139
 ] 

Baogang Wang commented on SPARK-10145:
--

the spark-defaults.conf is as follows:
 Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled   true
# spark.eventLog.dir   hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory  5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value 
-Dnumbers=one two three
#spark.core.connection.ack.wait.timeout 3600
#spark.core.connection.auth.wait.timeout3600
spark.akka.frameSize1024
spark.driver.extraJavaOptions   -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
spark.akka.timeout  900
spark.storage.memoryFraction0.4
spark.rdd.compress  true
spark.shuffle.blockTransferService  nio
spark.yarn.executor.memoryOverhead  1024

 Executor exit without useful messages when spark runs in spark-streaming
 

 Key: SPARK-10145
 URL: https://issues.apache.org/jira/browse/SPARK-10145
 Project: Spark
  Issue Type: Bug
  Components: Streaming, YARN
 Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 
 cores and 32g memory  
Reporter: Baogang Wang
Priority: Critical
   Original Estimate: 168h
  Remaining Estimate: 168h

 Each node is allocated 30g memory by Yarn.
 My application receives messages from Kafka by directstream. Each application 
 consists of 4 dstream window
 Spark application is submitted by this command:
 spark-submit --class spark_security.safe.SafeSockPuppet  --driver-memory 3g 
 --executor-memory 3g --num-executors 3 --executor-cores 4  --name 
 safeSparkDealerUser --master yarn  --deploy-mode cluster  
 spark_Security-1.0-SNAPSHOT.jar.nocalse 
 hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties
 After about 1 hours, some executor exits. There is no more yarn logs after 
 the executor exits and there is no stack when the executor exits.
 When I see the yarn node manager log, it shows as follows :
 2015-08-17 17:25:41,550 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Start request for container_1439803298368_0005_01_01 by user root
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Creating a new application reference for app application_1439803298368_0005
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root   
 IP=172.19.160.102   OPERATION=Start Container Request   
 TARGET=ContainerManageImpl  RESULT=SUCCESS  
 APPID=application_1439803298368_0005
 CONTAINERID=container_1439803298368_0005_01_01
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Application application_1439803298368_0005 transitioned from NEW to INITING
 2015-08-17 17:25:41,552 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Adding container_1439803298368_0005_01_01 to application 
 application_1439803298368_0005
 2015-08-17 17:25:41,557 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
  rollingMonitorInterval is set as -1. The log rolling mornitoring interval is 
 disabled. The logs will be aggregated after this application is finished.
 2015-08-17 17:25:41,663 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Application application_1439803298368_0005 transitioned from INITING to 
 RUNNING
 2015-08-17 17:25:41,664 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
  Container container_1439803298368_0005_01_01 transitioned from NEW to 
 LOCALIZING
 2015-08-17 17:25:41,664 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Got 
 event CONTAINER_INIT for appId application_1439803298368_0005
 2015-08-17 17:25:41,664 INFO 
 org.apache.spark.network.yarn.YarnShuffleService: Initializing container 
 container_1439803298368_0005_01_01
 2015-08-17 17:25:41,665 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:
  Resource 
 hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/user/root/.sparkStaging/application_1439803298368_0005/spark-assembly-1.3.1-hadoop2.6.0.jar
  transitioned from INIT to 

[jira] [Comment Edited] (SPARK-10145) Executor exit without useful messages when spark runs in spark-streaming

2015-08-20 Thread Baogang Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14706139#comment-14706139
 ] 

Baogang Wang edited comment on SPARK-10145 at 8/21/15 2:34 AM:
---

# Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled   true
# spark.eventLog.dir   hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory  5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value 
-Dnumbers=one two three
#spark.core.connection.ack.wait.timeout 3600
#spark.core.connection.auth.wait.timeout3600
spark.akka.frameSize1024
spark.driver.extraJavaOptions   -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
spark.akka.timeout  900
spark.storage.memoryFraction0.4
spark.rdd.compress  true
spark.shuffle.blockTransferService  nio
spark.yarn.executor.memoryOverhead  1024


was (Author: heayin):
the spark-defaults.conf is as follows:
 Default system properties included when running spark-submit.
# This is useful for setting default environmental settings.

# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled   true
# spark.eventLog.dir   hdfs://namenode:8021/directory
spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory  5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value 
-Dnumbers=one two three
#spark.core.connection.ack.wait.timeout 3600
#spark.core.connection.auth.wait.timeout3600
spark.akka.frameSize1024
spark.driver.extraJavaOptions   -Dhdp.version=2.2.0.0–2041
spark.yarn.am.extraJavaOptions  -Dhdp.version=2.2.0.0–2041
spark.akka.timeout  900
spark.storage.memoryFraction0.4
spark.rdd.compress  true
spark.shuffle.blockTransferService  nio
spark.yarn.executor.memoryOverhead  1024

 Executor exit without useful messages when spark runs in spark-streaming
 

 Key: SPARK-10145
 URL: https://issues.apache.org/jira/browse/SPARK-10145
 Project: Spark
  Issue Type: Bug
  Components: Streaming, YARN
 Environment: spark 1.3.1, hadoop 2.6.0, 6 nodes, each node has 32 
 cores and 32g memory  
Reporter: Baogang Wang
Priority: Critical
   Original Estimate: 168h
  Remaining Estimate: 168h

 Each node is allocated 30g memory by Yarn.
 My application receives messages from Kafka by directstream. Each application 
 consists of 4 dstream window
 Spark application is submitted by this command:
 spark-submit --class spark_security.safe.SafeSockPuppet  --driver-memory 3g 
 --executor-memory 3g --num-executors 3 --executor-cores 4  --name 
 safeSparkDealerUser --master yarn  --deploy-mode cluster  
 spark_Security-1.0-SNAPSHOT.jar.nocalse 
 hdfs://A01-R08-3-I160-102.JD.LOCAL:9000/spark_properties/safedealer.properties
 After about 1 hours, some executor exits. There is no more yarn logs after 
 the executor exits and there is no stack when the executor exits.
 When I see the yarn node manager log, it shows as follows :
 2015-08-17 17:25:41,550 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Start request for container_1439803298368_0005_01_01 by user root
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:
  Creating a new application reference for app application_1439803298368_0005
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root   
 IP=172.19.160.102   OPERATION=Start Container Request   
 TARGET=ContainerManageImpl  RESULT=SUCCESS  
 APPID=application_1439803298368_0005
 CONTAINERID=container_1439803298368_0005_01_01
 2015-08-17 17:25:41,551 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Application application_1439803298368_0005 transitioned from NEW to INITING
 2015-08-17 17:25:41,552 INFO 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:
  Adding container_1439803298368_0005_01_01 to application 
 application_1439803298368_0005
 2015-08-17 17:25:41,557 WARN 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
  rollingMonitorInterval is set as -1. The log rolling mornitoring interval is 
 disabled. The 

[jira] [Resolved] (SPARK-10140) Add target fields to @Since annotation

2015-08-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10140.
---
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8344
[https://github.com/apache/spark/pull/8344]

 Add target fields to @Since annotation
 --

 Key: SPARK-10140
 URL: https://issues.apache.org/jira/browse/SPARK-10140
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.5.0


 Add target fields to @Since so constructor params and fields also get 
 annotated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >