[jira] [Issue Comment Deleted] (SPARK-6724) Model import/export for FPGrowth

2015-08-31 Thread hujiayin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hujiayin updated SPARK-6724:

Comment: was deleted

(was: ok, : ))

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow

2015-08-31 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10351.
-
   Resolution: Fixed
 Assignee: Feynman Liang
Fix Version/s: 1.6.0

> UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Assignee: Feynman Liang
>Priority: Critical
> Fix For: 1.6.0
>
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10259) Add @Since annotation to ml.classification

2015-08-31 Thread Hiroshi Takahashi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723027#comment-14723027
 ] 

Hiroshi Takahashi commented on SPARK-10259:
---

Becasue this is a 'starter' issue and I'm new, I'd like to work on this issue, 
if possible.

> Add @Since annotation to ml.classification
> --
>
> Key: SPARK-10259
> URL: https://issues.apache.org/jira/browse/SPARK-10259
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10354) First cost RDD shouldn't be cached in k-means|| and the following cost RDD should use MEMORY_AND_DISK

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723031#comment-14723031
 ] 

Apache Spark commented on SPARK-10354:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8526

> First cost RDD shouldn't be cached in k-means|| and the following cost RDD 
> should use MEMORY_AND_DISK
> -
>
> Key: SPARK-10354
> URL: https://issues.apache.org/jira/browse/SPARK-10354
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> The first RDD doesn't need to be cached, other cost RDDs should use 
> MEMORY_AND_DISK to avoid recomputing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10354) First cost RDD shouldn't be cached in k-means|| and the following cost RDD should use MEMORY_AND_DISK

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10354:


Assignee: Xiangrui Meng  (was: Apache Spark)

> First cost RDD shouldn't be cached in k-means|| and the following cost RDD 
> should use MEMORY_AND_DISK
> -
>
> Key: SPARK-10354
> URL: https://issues.apache.org/jira/browse/SPARK-10354
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> The first RDD doesn't need to be cached, other cost RDDs should use 
> MEMORY_AND_DISK to avoid recomputing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10354) First cost RDD shouldn't be cached in k-means|| and the following cost RDD should use MEMORY_AND_DISK

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10354:


Assignee: Apache Spark  (was: Xiangrui Meng)

> First cost RDD shouldn't be cached in k-means|| and the following cost RDD 
> should use MEMORY_AND_DISK
> -
>
> Key: SPARK-10354
> URL: https://issues.apache.org/jira/browse/SPARK-10354
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Minor
>
> The first RDD doesn't need to be cached, other cost RDDs should use 
> MEMORY_AND_DISK to avoid recomputing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10354) First cost RDD shouldn't be cached in k-means|| and the following cost RDD should use MEMORY_AND_DISK

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10354.
---
  Resolution: Fixed
   Fix Version/s: 1.5.1
  1.4.2
  1.3.2
Target Version/s: 1.3.2, 1.4.2, 1.5.1  (was: 1.5.1)

Resolved by https://github.com/apache/spark/pull/8526.

> First cost RDD shouldn't be cached in k-means|| and the following cost RDD 
> should use MEMORY_AND_DISK
> -
>
> Key: SPARK-10354
> URL: https://issues.apache.org/jira/browse/SPARK-10354
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
> Fix For: 1.3.2, 1.4.2, 1.5.1
>
>
> The first RDD doesn't need to be cached, other cost RDDs should use 
> MEMORY_AND_DISK to avoid recomputing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10360) Real time analytics using SparkR and Apache Kafka

2015-08-31 Thread Niharika (JIRA)
Niharika created SPARK-10360:


 Summary: Real time analytics using SparkR and Apache Kafka
 Key: SPARK-10360
 URL: https://issues.apache.org/jira/browse/SPARK-10360
 Project: Spark
  Issue Type: Question
  Components: SparkR
Affects Versions: 1.4.1
 Environment: SparkR
Reporter: Niharika


I want to do real time analytics in sparkR where I want to fetch the data from 
Apache Kafka every second. Is there any way I can do real time analytics in R 
language(similar to Spark Streaming in Scala or Java using KafkaUtils)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10361) model.predictAll() fails at user_product.first()

2015-08-31 Thread Velu nambi (JIRA)
Velu nambi created SPARK-10361:
--

 Summary: model.predictAll() fails at user_product.first()
 Key: SPARK-10361
 URL: https://issues.apache.org/jira/browse/SPARK-10361
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.1, 1.3.1, 1.5.0
 Environment: Windows 10, Python 2.7 and with all the three versions of 
Spark
Reporter: Velu nambi


This code, adapted from the documentation, fails when calling PredictAll() 
after an ALS.train()


15/08/31 00:11:45 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.write(Unknown Source)
at java.io.DataOutputStream.write(Unknown Source)
at java.io.FilterOutputStream.write(Unknown Source)
at 
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
15/08/31 00:11:45 ERROR PythonRDD: This may have been caused by a prior 
exception:
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.write(Unknown Source)
at java.io.DataOutputStream.write(Unknown Source)
at java.io.FilterOutputStream.write(Unknown Source)
at 
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
15/08/31 00:11:45 ERROR Executor: Exception in task 0.0 in stage 187.0 (TID 85)
java.net.SocketException: Connection reset by peer: socket write error
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(Unknown Source)
at java.net.SocketOutputStream.write(Unknown Source)
at java.io.BufferedOutputStream.write(Unknown Source)
at java.io.DataOutputStream.write(Unknown Source)
at java.io.FilterOutputStream.write(Unknown Source)
at 
org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
at 
org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
at 
org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
at 
org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
15/08/31 00:11:45 WARN TaskSetManager: Lost task 0.0 in stage 187.0 (TID 85, 
localhost): java.net.SocketException: Connection reset by peer: socket write 
error
at 

[jira] [Assigned] (SPARK-10149) Locality Level is ANY on "Details for Stage" WebUI page

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10149:


Assignee: (was: Apache Spark)

> Locality Level is ANY on "Details for Stage" WebUI page
> ---
>
> Key: SPARK-10149
> URL: https://issues.apache.org/jira/browse/SPARK-10149
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.4.1
>Reporter: Yun Zhao
>
> Locality Level is ANY on "Details for Stage" WebUI page
> When a sc.textFile(XX) program is running, Locality Level is ANY which should 
> be NODE_LOCAL on stage 0. 
> org.apache.spark.scheduler.TaskSetManager
> {quote}
>   // Check for node-local tasks
>   if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {
> for (index <- speculatableTasks if canRunOnHost(index)) {
>   val locations = tasks(index).preferredLocations.map(_.host)
>   if (locations.contains(host)) {
> speculatableTasks -= index
> return Some((index, TaskLocality.NODE_LOCAL))
>   }
> }
>   }
> {quote} 
> The variable "locations" is hostname of HDFS split, which is from 
> InetAddress.getHostName.
> The variable "host" is ip of Executor, which is from 
> InetAddress.getLocalHost.getHostAddress.
> org.apache.spark.deploy.worker.WorkerArguments
> {quote}
> var host = Utils.localHostName()
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10149) Locality Level is ANY on "Details for Stage" WebUI page

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10149:


Assignee: Apache Spark

> Locality Level is ANY on "Details for Stage" WebUI page
> ---
>
> Key: SPARK-10149
> URL: https://issues.apache.org/jira/browse/SPARK-10149
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.4.1
>Reporter: Yun Zhao
>Assignee: Apache Spark
>
> Locality Level is ANY on "Details for Stage" WebUI page
> When a sc.textFile(XX) program is running, Locality Level is ANY which should 
> be NODE_LOCAL on stage 0. 
> org.apache.spark.scheduler.TaskSetManager
> {quote}
>   // Check for node-local tasks
>   if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {
> for (index <- speculatableTasks if canRunOnHost(index)) {
>   val locations = tasks(index).preferredLocations.map(_.host)
>   if (locations.contains(host)) {
> speculatableTasks -= index
> return Some((index, TaskLocality.NODE_LOCAL))
>   }
> }
>   }
> {quote} 
> The variable "locations" is hostname of HDFS split, which is from 
> InetAddress.getHostName.
> The variable "host" is ip of Executor, which is from 
> InetAddress.getLocalHost.getHostAddress.
> org.apache.spark.deploy.worker.WorkerArguments
> {quote}
> var host = Utils.localHostName()
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10149) Locality Level is ANY on "Details for Stage" WebUI page

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723073#comment-14723073
 ] 

Apache Spark commented on SPARK-10149:
--

User 'wulei-bj-cn' has created a pull request for this issue:
https://github.com/apache/spark/pull/8533

> Locality Level is ANY on "Details for Stage" WebUI page
> ---
>
> Key: SPARK-10149
> URL: https://issues.apache.org/jira/browse/SPARK-10149
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.4.1
>Reporter: Yun Zhao
>
> Locality Level is ANY on "Details for Stage" WebUI page
> When a sc.textFile(XX) program is running, Locality Level is ANY which should 
> be NODE_LOCAL on stage 0. 
> org.apache.spark.scheduler.TaskSetManager
> {quote}
>   // Check for node-local tasks
>   if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {
> for (index <- speculatableTasks if canRunOnHost(index)) {
>   val locations = tasks(index).preferredLocations.map(_.host)
>   if (locations.contains(host)) {
> speculatableTasks -= index
> return Some((index, TaskLocality.NODE_LOCAL))
>   }
> }
>   }
> {quote} 
> The variable "locations" is hostname of HDFS split, which is from 
> InetAddress.getHostName.
> The variable "host" is ip of Executor, which is from 
> InetAddress.getLocalHost.getHostAddress.
> org.apache.spark.deploy.worker.WorkerArguments
> {quote}
> var host = Utils.localHostName()
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-6839) BlockManager.dataDeserialize leaks resources on user exceptions

2015-08-31 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid closed SPARK-6839.
---
Resolution: Won't Fix

> BlockManager.dataDeserialize leaks resources on user exceptions
> ---
>
> Key: SPARK-6839
> URL: https://issues.apache.org/jira/browse/SPARK-6839
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> From a discussion with [~vanzin] on {{ByteBufferInputStream}}, we realized 
> that 
> [{{BlockManager.dataDeserialize}}|https://github.com/apache/spark/blob/b5c51c8df480f1a82a82e4d597d8eea631bffb4e/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L1202]
>  doesn't  guarantee the underlying InputStream is properly closed.  In 
> particular, {{BlockManager.dispose(byteBuffer)}} will not get called any time 
> there is an exception in user code.
> The problem is that right now, we convert the input streams to iterators, and 
> only close the input stream if the end of the iterator is reached.  But, we 
> might never reach the end of the iterator -- the obvious case is if there is 
> a bug in the user code, so tasks fail part of the way through the iterator.
> I think the solution is to give {{BlockManager.dataDeserialize}} a 
> {{TaskContext}} so it can call {{context.addTaskCompletionListener}} to do 
> the cleanup (as is done in {{ShuffleBlockFetcherIterator}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2015-08-31 Thread Mao, Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724616#comment-14724616
 ] 

Mao, Wei commented on SPARK-8426:
-

post the design doc under related JIRA (both 8425 and 8426)
https://docs.google.com/document/d/1EqdocdbOH0eZ0Vp1RAHsE-8gKv9yPez1Xt8W5xgXn3I/edit?usp=sharing

[~sandyr]Sandy, could you help to review and comment

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-8426) Add blacklist mechanism for YARN container allocation

2015-08-31 Thread Mao, Wei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724616#comment-14724616
 ] 

Mao, Wei edited comment on SPARK-8426 at 9/1/15 2:29 AM:
-

post the design doc under related JIRA (both 8425 and 8426)
https://docs.google.com/document/d/1EqdocdbOH0eZ0Vp1RAHsE-8gKv9yPez1Xt8W5xgXn3I/edit?usp=sharing

[~sandyr]
Sandy, could you help to review and comment


was (Author: mwws):
post the design doc under related JIRA (both 8425 and 8426)
https://docs.google.com/document/d/1EqdocdbOH0eZ0Vp1RAHsE-8gKv9yPez1Xt8W5xgXn3I/edit?usp=sharing

[~sandyr]Sandy, could you help to review and comment

> Add blacklist mechanism for YARN container allocation
> -
>
> Key: SPARK-8426
> URL: https://issues.apache.org/jira/browse/SPARK-8426
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, YARN
>Reporter: Saisai Shao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10377) Cassandra connector affected by backport change

2015-08-31 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724612#comment-14724612
 ] 

Yin Huai commented on SPARK-10377:
--

[~frodeso] Is 
https://github.com/datastax/spark-cassandra-connector/blob/v1.4.0-M3/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala#L90
 the only place that we refer to TakeOrdered in Cassandra connector? If so, we 
can rename it back.

> Cassandra connector affected by backport change
> ---
>
> Key: SPARK-10377
> URL: https://issues.apache.org/jira/browse/SPARK-10377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2
>Reporter: Frode Sormo
>
> The backport in SPARK-7289 and SPARK-9949 includes the refactor of 
> TakeOrdered to TakeOrderedAndProject, which breaks code that refers to 
> TakeOrdered. In a minor version update that is perhaps not expected - 
> specifically, the Cassandra connector refers to this by name and no longer 
> works.
> Example use case it to use the Cassandra connector in Scala and creating a 
> CassandraSQLContext:
> import com.datastax.spark.connector._
> import sqlContext.implicits._
> import org.apache.spark.sql.cassandra.CassandraSQLContext
> val cassandraSQLContext = new CassandraSQLContext(sc);
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.TakeOrdered()Lorg/apache/spark/sql/execution/SparkStrategies$TakeOrdered$;
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.(CassandraSQLContext.scala:90)
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext.(CassandraSQLContext.scala:85)
> (Source code link: 
> https://github.com/datastax/spark-cassandra-connector/blob/v1.4.0-M3/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala)
>  
> This is with version 1.4.0M3 of the Datastax Cassandra connector, but affects 
> other 1.4 versions as well.
> Issue has also been reported to Datastax, here: 
> https://datastax-oss.atlassian.net/browse/SPARKC-238



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10377) Cassandra connector affected by backport change

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724620#comment-14724620
 ] 

Apache Spark commented on SPARK-10377:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8545

> Cassandra connector affected by backport change
> ---
>
> Key: SPARK-10377
> URL: https://issues.apache.org/jira/browse/SPARK-10377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2
>Reporter: Frode Sormo
>Assignee: Yin Huai
>
> The backport in SPARK-7289 and SPARK-9949 includes the refactor of 
> TakeOrdered to TakeOrderedAndProject, which breaks code that refers to 
> TakeOrdered. In a minor version update that is perhaps not expected - 
> specifically, the Cassandra connector refers to this by name and no longer 
> works.
> Example use case it to use the Cassandra connector in Scala and creating a 
> CassandraSQLContext:
> import com.datastax.spark.connector._
> import sqlContext.implicits._
> import org.apache.spark.sql.cassandra.CassandraSQLContext
> val cassandraSQLContext = new CassandraSQLContext(sc);
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.TakeOrdered()Lorg/apache/spark/sql/execution/SparkStrategies$TakeOrdered$;
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.(CassandraSQLContext.scala:90)
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext.(CassandraSQLContext.scala:85)
> (Source code link: 
> https://github.com/datastax/spark-cassandra-connector/blob/v1.4.0-M3/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala)
>  
> This is with version 1.4.0M3 of the Datastax Cassandra connector, but affects 
> other 1.4 versions as well.
> Issue has also been reported to Datastax, here: 
> https://datastax-oss.atlassian.net/browse/SPARKC-238



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10377) Cassandra connector affected by backport change

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10377:


Assignee: Yin Huai  (was: Apache Spark)

> Cassandra connector affected by backport change
> ---
>
> Key: SPARK-10377
> URL: https://issues.apache.org/jira/browse/SPARK-10377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2
>Reporter: Frode Sormo
>Assignee: Yin Huai
>
> The backport in SPARK-7289 and SPARK-9949 includes the refactor of 
> TakeOrdered to TakeOrderedAndProject, which breaks code that refers to 
> TakeOrdered. In a minor version update that is perhaps not expected - 
> specifically, the Cassandra connector refers to this by name and no longer 
> works.
> Example use case it to use the Cassandra connector in Scala and creating a 
> CassandraSQLContext:
> import com.datastax.spark.connector._
> import sqlContext.implicits._
> import org.apache.spark.sql.cassandra.CassandraSQLContext
> val cassandraSQLContext = new CassandraSQLContext(sc);
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.TakeOrdered()Lorg/apache/spark/sql/execution/SparkStrategies$TakeOrdered$;
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.(CassandraSQLContext.scala:90)
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext.(CassandraSQLContext.scala:85)
> (Source code link: 
> https://github.com/datastax/spark-cassandra-connector/blob/v1.4.0-M3/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala)
>  
> This is with version 1.4.0M3 of the Datastax Cassandra connector, but affects 
> other 1.4 versions as well.
> Issue has also been reported to Datastax, here: 
> https://datastax-oss.atlassian.net/browse/SPARKC-238



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10377) Cassandra connector affected by backport change

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10377:


Assignee: Apache Spark  (was: Yin Huai)

> Cassandra connector affected by backport change
> ---
>
> Key: SPARK-10377
> URL: https://issues.apache.org/jira/browse/SPARK-10377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2
>Reporter: Frode Sormo
>Assignee: Apache Spark
>
> The backport in SPARK-7289 and SPARK-9949 includes the refactor of 
> TakeOrdered to TakeOrderedAndProject, which breaks code that refers to 
> TakeOrdered. In a minor version update that is perhaps not expected - 
> specifically, the Cassandra connector refers to this by name and no longer 
> works.
> Example use case it to use the Cassandra connector in Scala and creating a 
> CassandraSQLContext:
> import com.datastax.spark.connector._
> import sqlContext.implicits._
> import org.apache.spark.sql.cassandra.CassandraSQLContext
> val cassandraSQLContext = new CassandraSQLContext(sc);
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.TakeOrdered()Lorg/apache/spark/sql/execution/SparkStrategies$TakeOrdered$;
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.(CassandraSQLContext.scala:90)
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext.(CassandraSQLContext.scala:85)
> (Source code link: 
> https://github.com/datastax/spark-cassandra-connector/blob/v1.4.0-M3/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala)
>  
> This is with version 1.4.0M3 of the Datastax Cassandra connector, but affects 
> other 1.4 versions as well.
> Issue has also been reported to Datastax, here: 
> https://datastax-oss.atlassian.net/browse/SPARKC-238



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724701#comment-14724701
 ] 

Apache Spark commented on SPARK-10329:
--

User 'HuJiayin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8546

> Cost RDD in k-means|| initialization is not storage-efficient
> -
>
> Key: SPARK-10329
> URL: https://issues.apache.org/jira/browse/SPARK-10329
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Xiangrui Meng
>Assignee: hujiayin
>  Labels: clustering
>
> Currently we use `RDD[Vector]` to store point cost during k-means|| 
> initialization, where each `Vector` has size `runs`. This is not 
> storage-efficient because `runs` is usually 1 and then each record is a 
> Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
> introduce two objects (DenseVector and its values array), which could cost 16 
> bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
> for reporting this issue!
> There are several solutions:
> 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
> record.
> 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
> `Array[Double]` object covers 1024 instances, which could remove most of the 
> overhead.
> Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
> kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10329:


Assignee: hujiayin  (was: Apache Spark)

> Cost RDD in k-means|| initialization is not storage-efficient
> -
>
> Key: SPARK-10329
> URL: https://issues.apache.org/jira/browse/SPARK-10329
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Xiangrui Meng
>Assignee: hujiayin
>  Labels: clustering
>
> Currently we use `RDD[Vector]` to store point cost during k-means|| 
> initialization, where each `Vector` has size `runs`. This is not 
> storage-efficient because `runs` is usually 1 and then each record is a 
> Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
> introduce two objects (DenseVector and its values array), which could cost 16 
> bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
> for reporting this issue!
> There are several solutions:
> 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
> record.
> 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
> `Array[Double]` object covers 1024 instances, which could remove most of the 
> overhead.
> Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
> kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10329) Cost RDD in k-means|| initialization is not storage-efficient

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10329:


Assignee: Apache Spark  (was: hujiayin)

> Cost RDD in k-means|| initialization is not storage-efficient
> -
>
> Key: SPARK-10329
> URL: https://issues.apache.org/jira/browse/SPARK-10329
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>  Labels: clustering
>
> Currently we use `RDD[Vector]` to store point cost during k-means|| 
> initialization, where each `Vector` has size `runs`. This is not 
> storage-efficient because `runs` is usually 1 and then each record is a 
> Vector of size 1. What we need is just the 8 bytes to store the cost, but we 
> introduce two objects (DenseVector and its values array), which could cost 16 
> bytes. That is 200% overhead. Thanks [~Grace Huang] and Jiayin Hu from Intel 
> for reporting this issue!
> There are several solutions:
> 1. Use `RDD[Array[Double]]` instead of `RDD[Vector]`, which saves 8 bytes per 
> record.
> 2. Use `RDD[Array[Double]]` but batch the values for storage, e.g. each 
> `Array[Double]` object covers 1024 instances, which could remove most of the 
> overhead.
> Besides, using MEMORY_AND_DISK instead of MEMORY_ONLY could prevent cost RDDs 
> kicking out the training dataset from memory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10368) "Could not parse Master URL" leaves spark-shell unusable

2015-08-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10368:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

I'm not sure it's a bug; the error is reported correctly. Failing fast for this 
and other types of errors is probably appropriate, but maybe not all. That 
would give a better user experience as you kind of have to dig to see this has 
nothing to do with sqlContext

> "Could not parse Master URL" leaves spark-shell unusable
> 
>
> Key: SPARK-10368
> URL: https://issues.apache.org/jira/browse/SPARK-10368
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 1.5.0
> Environment: Spark built from today's sources: 
> {{f0f563a3c43fc9683e6920890cce44611c0c5f4b}}
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When executing {{spark-shell}} with incorrect value for {{--master}} the 
> exception is thrown (twice!), but the shell remains open and is almost 
> unusable.
> {code}
> ➜  spark git:(master) ✗ ./bin/spark-shell --master mesos:localhost:8080
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/08/31 15:00:10 INFO SecurityManager: Changing view acls to: jacek
> 15/08/31 15:00:10 INFO SecurityManager: Changing modify acls to: jacek
> 15/08/31 15:00:10 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jacek); users 
> with modify permissions: Set(jacek)
> 15/08/31 15:00:11 INFO HttpServer: Starting HTTP Server
> 15/08/31 15:00:11 INFO Utils: Successfully started service 'HTTP server' on 
> port 56110.
> 15/08/31 15:00:14 INFO Main: Spark class server started at 
> http://192.168.99.1:56110
> 15/08/31 15:00:14 INFO SparkContext: Running Spark version 1.5.0-SNAPSHOT
> 15/08/31 15:00:14 INFO SecurityManager: Changing view acls to: jacek
> 15/08/31 15:00:14 INFO SecurityManager: Changing modify acls to: jacek
> 15/08/31 15:00:14 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(jacek); users 
> with modify permissions: Set(jacek)
> 15/08/31 15:00:15 INFO Slf4jLogger: Slf4jLogger started
> 15/08/31 15:00:15 INFO Remoting: Starting remoting
> 15/08/31 15:00:15 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@192.168.99.1:56112]
> 15/08/31 15:00:15 INFO Utils: Successfully started service 'sparkDriver' on 
> port 56112.
> 15/08/31 15:00:15 INFO SparkEnv: Registering MapOutputTracker
> 15/08/31 15:00:15 INFO SparkEnv: Registering BlockManagerMaster
> 15/08/31 15:00:15 INFO DiskBlockManager: Created local directory at 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/blockmgr-291429cd-8ca8-4622-87e5-b4c7ee68afcd
> 15/08/31 15:00:15 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
> 15/08/31 15:00:15 INFO HttpFileServer: HTTP File server directory is 
> /private/var/folders/0w/kb0d3rqn4zb9fcc91pxhgn8wgn/T/spark-803119c2-e940-4817-a166-836f16c9027d/httpd-585b0b63-0512-4ae9-8739-8032c4125c77
> 15/08/31 15:00:15 INFO HttpServer: Starting HTTP Server
> 15/08/31 15:00:15 INFO Utils: Successfully started service 'HTTP file server' 
> on port 56113.
> 15/08/31 15:00:15 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/08/31 15:00:15 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 15/08/31 15:00:15 INFO SparkUI: Started SparkUI at http://192.168.99.1:4040
> 15/08/31 15:00:15 ERROR SparkContext: Error initializing SparkContext.
> org.apache.spark.SparkException: Could not parse Master URL: 
> 'mesos:localhost:8080'
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2693)
>   at org.apache.spark.SparkContext.(SparkContext.scala:506)
>   at org.apache.spark.repl.Main$.createSparkContext(Main.scala:78)
>   at $line3.$read$$iw$$iw.(:12)
>   at $line3.$read$$iw.(:21)
>   at $line3.$read.(:23)
>   at $line3.$read$.(:27)
>   at $line3.$read$.()
>   at $line3.$eval$.$print$lzycompute(:7)
>   at $line3.$eval$.$print(:6)
>   at $line3.$eval.$print()
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   

[jira] [Created] (SPARK-10369) Fix a bug that Receiver could not be started after deregistering

2015-08-31 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-10369:


 Summary: Fix a bug that Receiver could not be started after 
deregistering
 Key: SPARK-10369
 URL: https://issues.apache.org/jira/browse/SPARK-10369
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Shixiong Zhu
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies

2015-08-31 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-10370:


 Summary: After a stages map outputs are registered, all running 
attempts should be marked as zombies
 Key: SPARK-10370
 URL: https://issues.apache.org/jira/browse/SPARK-10370
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Imran Rashid
Assignee: Imran Rashid


Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
"complete" by registering all its map output and starting the downstream 
stages, before the latest task set has completed.  This will result in the 
earlier task set continuing to submit tasks, that are both unnecessary and 
increase the chance of hitting SPARK-8029.

Spark should mark all tasks sets for a stage as zombie as soon as its map 
output is registered.  Note that this involves coordination between the various 
scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at least) which 
isn't easily testable with the current setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10369) Fix a bug that Receiver could not be started after deregistering

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10369:


Assignee: (was: Apache Spark)

> Fix a bug that Receiver could not be started after deregistering
> 
>
> Key: SPARK-10369
> URL: https://issues.apache.org/jira/browse/SPARK-10369
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10369) Fix a bug that Receiver could not be started after deregistering

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10369:


Assignee: Apache Spark

> Fix a bug that Receiver could not be started after deregistering
> 
>
> Key: SPARK-10369
> URL: https://issues.apache.org/jira/browse/SPARK-10369
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10369) Fix a bug that Receiver could not be started after deregistering

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723537#comment-14723537
 ] 

Apache Spark commented on SPARK-10369:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/8538

> Fix a bug that Receiver could not be started after deregistering
> 
>
> Key: SPARK-10369
> URL: https://issues.apache.org/jira/browse/SPARK-10369
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9642) LinearRegression should supported weighted data

2015-08-31 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723562#comment-14723562
 ] 

Seth Hendrickson commented on SPARK-9642:
-

NP, thanks for responding!

> LinearRegression should supported weighted data
> ---
>
> Key: SPARK-9642
> URL: https://issues.apache.org/jira/browse/SPARK-9642
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Meihua Wu
>  Labels: 1.6
>
> In many modeling application, data points are not necessarily sampled with 
> equal probabilities. Linear regression should support weighting which account 
> the over or under sampling. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10371) Optimize sequential projections

2015-08-31 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10371:
-

 Summary: Optimize sequential projections
 Key: SPARK-10371
 URL: https://issues.apache.org/jira/browse/SPARK-10371
 Project: Spark
  Issue Type: New Feature
  Components: ML, SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng


In ML pipelines, each transformer/estimator appends new columns to the input 
DataFrame. For example, it might produce DataFrames like the following columns: 
a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), and d = 
udf_d(c). Some UDFs could be expensive. However, if we materialize c and d, 
udf_b, and udf_c are triggered twice, i.e., value c is not re-used.

It would be nice to detect this pattern and re-use intermediate values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10371) Optimize sequential projections

2015-08-31 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723585#comment-14723585
 ] 

Xiangrui Meng commented on SPARK-10371:
---

ping [~yhuai]

> Optimize sequential projections
> ---
>
> Key: SPARK-10371
> URL: https://issues.apache.org/jira/browse/SPARK-10371
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>
> In ML pipelines, each transformer/estimator appends new columns to the input 
> DataFrame. For example, it might produce DataFrames like the following 
> columns: a, b, c, d, where a is from raw input, b = udf_b(a), c = udf_c(b), 
> and d = udf_d(c). Some UDFs could be expensive. However, if we materialize c 
> and d, udf_b, and udf_c are triggered twice, i.e., value c is not re-used.
> It would be nice to detect this pattern and re-use intermediate values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10372) Tests for entire scheduler

2015-08-31 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-10372:


 Summary: Tests for entire scheduler
 Key: SPARK-10372
 URL: https://issues.apache.org/jira/browse/SPARK-10372
 Project: Spark
  Issue Type: Test
  Components: Spark Core
Affects Versions: 1.5.0
Reporter: Imran Rashid
Assignee: Imran Rashid


The current testing framework for the scheduler only tests individual classes 
in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc.  Of 
course that is useful, but we are missing tests which cover the interaction 
between these components.  We also have larger tests which run entire spark 
jobs, but that doesn't allow fine grained control of failures for verifying 
spark's fault-tolerance.

Adding a framework for testing the scheduler as a whole will:

1. Allow testing bugs which involve the interaction between multiple parts of 
the scheduler, eg. SPARK-10370

2. Greater confidence in refactoring the scheduler as a whole.  Given the tight 
coordination between the components its hard to consider any refactoring, since 
it would be unlikely to be covered by any tests.

3. Make it easier to increase test coverage.  Writing tests for the 
{{DAGScheduler}} now requires intimate knowledge of exactly how the components 
fit together -- a lot of work goes into mimicking the appropriate behavior of 
the other components.  Furthermore, it makes the tests harder to understand for 
the un-initiated -- which parts are simulating some condition of an external 
system (eg., losing an executor), and which parts are just interaction with 
other parts of the scheduler (eg., task resubmission)?  These tests will allow 
to just work at the level of the interaction w/ the executors -- tasks 
complete, tasks fail, executors are lost, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10361) model.predictAll() fails at user_product.first()

2015-08-31 Thread Velu nambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723621#comment-14723621
 ] 

Velu nambi commented on SPARK-10361:


Thanks [~srowen].

Is this a known issue, any suggestions ?

> model.predictAll() fails at user_product.first()
> 
>
> Key: SPARK-10361
> URL: https://issues.apache.org/jira/browse/SPARK-10361
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
> Environment: Windows 10, Python 2.7 and with all the three versions 
> of Spark
>Reporter: Velu nambi
>
> This code, adapted from the documentation, fails when calling PredictAll() 
> after an ALS.train()
> 15/08/31 00:11:45 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
> java.net.SocketException: Connection reset by peer: socket write error
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>   at java.net.SocketOutputStream.write(Unknown Source)
>   at java.io.BufferedOutputStream.write(Unknown Source)
>   at java.io.DataOutputStream.write(Unknown Source)
>   at java.io.FilterOutputStream.write(Unknown Source)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
> 15/08/31 00:11:45 ERROR PythonRDD: This may have been caused by a prior 
> exception:
> java.net.SocketException: Connection reset by peer: socket write error
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>   at java.net.SocketOutputStream.write(Unknown Source)
>   at java.io.BufferedOutputStream.write(Unknown Source)
>   at java.io.DataOutputStream.write(Unknown Source)
>   at java.io.FilterOutputStream.write(Unknown Source)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
> 15/08/31 00:11:45 ERROR Executor: Exception in task 0.0 in stage 187.0 (TID 
> 85)
> java.net.SocketException: Connection reset by peer: socket write error
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>   at java.net.SocketOutputStream.write(Unknown Source)
>   at java.io.BufferedOutputStream.write(Unknown Source)
>   at java.io.DataOutputStream.write(Unknown Source)
>   at java.io.FilterOutputStream.write(Unknown Source)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
>   at 

[jira] [Commented] (SPARK-10360) Real time analytics using SparkR and Apache Kafka

2015-08-31 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723625#comment-14723625
 ] 

Shivaram Venkataraman commented on SPARK-10360:
---

There is no way to do this right now as far as I know --  I guess one way to do 
this might be to build a DataFrame from your stream and try to access that from 
SparkR.
BTW this question belongs in the user email list u...@spark.apache.org (more 
details at http://spark.apache.org/community.html) rather than on the JIRA. 

> Real time analytics using SparkR and Apache Kafka
> -
>
> Key: SPARK-10360
> URL: https://issues.apache.org/jira/browse/SPARK-10360
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 1.4.1
> Environment: SparkR
>Reporter: Niharika
>
> I want to do real time analytics in sparkR where I want to fetch the data 
> from Apache Kafka every second. Is there any way I can do real time analytics 
> in R language(similar to Spark Streaming in Scala or Java using KafkaUtils)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10360) Real time analytics using SparkR and Apache Kafka

2015-08-31 Thread Niharika (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723652#comment-14723652
 ] 

Niharika commented on SPARK-10360:
--

Thank you Shivaram for the reply, it is indeed helpful. I will close the issue.

> Real time analytics using SparkR and Apache Kafka
> -
>
> Key: SPARK-10360
> URL: https://issues.apache.org/jira/browse/SPARK-10360
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 1.4.1
> Environment: SparkR
>Reporter: Niharika
>
> I want to do real time analytics in sparkR where I want to fetch the data 
> from Apache Kafka every second. Is there any way I can do real time analytics 
> in R language(similar to Spark Streaming in Scala or Java using KafkaUtils)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10360) Real time analytics using SparkR and Apache Kafka

2015-08-31 Thread Niharika (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niharika closed SPARK-10360.

Resolution: Fixed

> Real time analytics using SparkR and Apache Kafka
> -
>
> Key: SPARK-10360
> URL: https://issues.apache.org/jira/browse/SPARK-10360
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 1.4.1
> Environment: SparkR
>Reporter: Niharika
>
> I want to do real time analytics in sparkR where I want to fetch the data 
> from Apache Kafka every second. Is there any way I can do real time analytics 
> in R language(similar to Spark Streaming in Scala or Java using KafkaUtils)?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7454) Perf test for power iteration clustering (PIC)

2015-08-31 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723666#comment-14723666
 ] 

Feynman Liang commented on SPARK-7454:
--

[~mengxr] [~josephkb] can we close this since PR 86 was merged?

> Perf test for power iteration clustering (PIC)
> --
>
> Key: SPARK-7454
> URL: https://issues.apache.org/jira/browse/SPARK-7454
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10361) model.predictAll() fails at user_product.first()

2015-08-31 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723668#comment-14723668
 ] 

Sean Owen commented on SPARK-10361:
---

The tests are passing and I haven't heard anything like this. It does point to 
a local problem. At least, this stack trace is not the problem per se; the 
Python process wasn't able to connect to the JVM. You'd need to see why.

> model.predictAll() fails at user_product.first()
> 
>
> Key: SPARK-10361
> URL: https://issues.apache.org/jira/browse/SPARK-10361
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
> Environment: Windows 10, Python 2.7 and with all the three versions 
> of Spark
>Reporter: Velu nambi
>
> This code, adapted from the documentation, fails when calling PredictAll() 
> after an ALS.train()
> 15/08/31 00:11:45 ERROR PythonRDD: Python worker exited unexpectedly (crashed)
> java.net.SocketException: Connection reset by peer: socket write error
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>   at java.net.SocketOutputStream.write(Unknown Source)
>   at java.io.BufferedOutputStream.write(Unknown Source)
>   at java.io.DataOutputStream.write(Unknown Source)
>   at java.io.FilterOutputStream.write(Unknown Source)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
> 15/08/31 00:11:45 ERROR PythonRDD: This may have been caused by a prior 
> exception:
> java.net.SocketException: Connection reset by peer: socket write error
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>   at java.net.SocketOutputStream.write(Unknown Source)
>   at java.io.BufferedOutputStream.write(Unknown Source)
>   at java.io.DataOutputStream.write(Unknown Source)
>   at java.io.FilterOutputStream.write(Unknown Source)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:248)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1772)
>   at 
> org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:208)
> 15/08/31 00:11:45 ERROR Executor: Exception in task 0.0 in stage 187.0 (TID 
> 85)
> java.net.SocketException: Connection reset by peer: socket write error
>   at java.net.SocketOutputStream.socketWrite0(Native Method)
>   at java.net.SocketOutputStream.socketWrite(Unknown Source)
>   at java.net.SocketOutputStream.write(Unknown Source)
>   at java.io.BufferedOutputStream.write(Unknown Source)
>   at java.io.DataOutputStream.write(Unknown Source)
>   at java.io.FilterOutputStream.write(Unknown Source)
>   at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:413)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:425)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> 

[jira] [Updated] (SPARK-10264) Add @Since annotation to ml.recoomendation

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10264:
--
Assignee: Tijo Thomas

> Add @Since annotation to ml.recoomendation
> --
>
> Key: SPARK-10264
> URL: https://issues.apache.org/jira/browse/SPARK-10264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Tijo Thomas
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10264) Add @Since annotation to ml.recoomendation

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10264:
--
Target Version/s: 1.6.0

> Add @Since annotation to ml.recoomendation
> --
>
> Key: SPARK-10264
> URL: https://issues.apache.org/jira/browse/SPARK-10264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Tijo Thomas
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10373) Move @since annotator to pyspark to be shared by all components

2015-08-31 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-10373:
-

 Summary: Move @since annotator to pyspark to be shared by all 
components
 Key: SPARK-10373
 URL: https://issues.apache.org/jira/browse/SPARK-10373
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 1.5.0
Reporter: Xiangrui Meng
Assignee: Davies Liu


Python's `@since` is defined under `pyspark.sql`. It would be nice to move it 
under `pyspark` to be shared by all components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10258) Add @Since annotation to ml.feature

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10258:
--
Assignee: Martin Brown

> Add @Since annotation to ml.feature
> ---
>
> Key: SPARK-10258
> URL: https://issues.apache.org/jira/browse/SPARK-10258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Martin Brown
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10259) Add @Since annotation to ml.classification

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10259:
--
Assignee: Hiroshi Takahashi

> Add @Since annotation to ml.classification
> --
>
> Key: SPARK-10259
> URL: https://issues.apache.org/jira/browse/SPARK-10259
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Hiroshi Takahashi
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10259) Add @Since annotation to ml.classification

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10259:
--
Target Version/s: 1.6.0

> Add @Since annotation to ml.classification
> --
>
> Key: SPARK-10259
> URL: https://issues.apache.org/jira/browse/SPARK-10259
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Hiroshi Takahashi
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10358) Spark-sql throws IOException on exit when using HDFS to store event log.

2015-08-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723705#comment-14723705
 ] 

Marcelo Vanzin commented on SPARK-10358:


This is fixed in 1.4 and later, I doubt we'll accept patches to 1.3 at this 
point in time...

> Spark-sql throws IOException on exit when using HDFS to store event log.
> 
>
> Key: SPARK-10358
> URL: https://issues.apache.org/jira/browse/SPARK-10358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: * spark-1.3.1-bin-hadoop2.6
> * hadoop-2.6.0
> * Red hat 2.6.32-504.el6.x86_64
>Reporter: Sioa Song
>Priority: Minor
> Fix For: 1.3.1
>
>
> h2. Summary 
> In Spark 1.3.1, if using HDFS to store event log, spark-sql will throw an 
> "java.io.IOException: Filesystem closed" when exit. 
> h2. How to reproduce 
> 1. Enable event log mechanism, and configure the file location to HDFS. 
>You can do this by setting these two properties in spark-defaults.conf: 
> spark.eventLog.enabled  true 
> spark.eventLog.dir  hdfs://x:x/spark-events 
> 2. start spark-sql, and type exit once it starts. 
> {noformat} 
> spark-sql> exit; 
> 15/08/14 06:29:20 ERROR scheduler.LiveListenerBus: Listener 
> EventLoggingListener threw an exception 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  
> at java.lang.reflect.Method.invoke(Method.java:597) 
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>  
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>  
> at scala.Option.foreach(Option.scala:236) 
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
>  
> at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:181)
>  
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
>  
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>  
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>  
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) 
> at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36)
>  
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76)
>  
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
>  
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
>  
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1678) 
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)
>  
> Caused by: java.io.IOException: Filesystem closed 
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) 
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1985) 
> at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1946) 
> at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) 
> ... 19 more 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/metrics/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/threadDump,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/environment/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/environment,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 
> 15/08/14 06:29:20 INFO 

[jira] [Commented] (SPARK-10358) Spark-sql throws IOException on exit when using HDFS to store event log.

2015-08-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723757#comment-14723757
 ] 

Marcelo Vanzin commented on SPARK-10358:


Original issue: SPARK-6014

> Spark-sql throws IOException on exit when using HDFS to store event log.
> 
>
> Key: SPARK-10358
> URL: https://issues.apache.org/jira/browse/SPARK-10358
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1
> Environment: * spark-1.3.1-bin-hadoop2.6
> * hadoop-2.6.0
> * Red hat 2.6.32-504.el6.x86_64
>Reporter: Sioa Song
>Priority: Minor
> Fix For: 1.3.1
>
>
> h2. Summary 
> In Spark 1.3.1, if using HDFS to store event log, spark-sql will throw an 
> "java.io.IOException: Filesystem closed" when exit. 
> h2. How to reproduce 
> 1. Enable event log mechanism, and configure the file location to HDFS. 
>You can do this by setting these two properties in spark-defaults.conf: 
> spark.eventLog.enabled  true 
> spark.eventLog.dir  hdfs://x:x/spark-events 
> 2. start spark-sql, and type exit once it starts. 
> {noformat} 
> spark-sql> exit; 
> 15/08/14 06:29:20 ERROR scheduler.LiveListenerBus: Listener 
> EventLoggingListener threw an exception 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  
> at java.lang.reflect.Method.invoke(Method.java:597) 
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>  
> at 
> org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
>  
> at scala.Option.foreach(Option.scala:236) 
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
>  
> at 
> org.apache.spark.scheduler.EventLoggingListener.onApplicationEnd(EventLoggingListener.scala:181)
>  
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:54)
>  
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>  
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
>  
> at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:53) 
> at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:36)
>  
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:76)
>  
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
>  
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply(AsynchronousListenerBus.scala:61)
>  
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1678) 
> at 
> org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:60)
>  
> Caused by: java.io.IOException: Filesystem closed 
> at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:795) 
> at 
> org.apache.hadoop.hdfs.DFSOutputStream.flushOrSync(DFSOutputStream.java:1985) 
> at org.apache.hadoop.hdfs.DFSOutputStream.hflush(DFSOutputStream.java:1946) 
> at 
> org.apache.hadoop.fs.FSDataOutputStream.hflush(FSDataOutputStream.java:130) 
> ... 19 more 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/metrics/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/static,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/threadDump,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/executors,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/environment/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/environment,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 
> 15/08/14 06:29:20 INFO handler.ContextHandler: stopped 
> 

[jira] [Created] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Matt Cheah (JIRA)
Matt Cheah created SPARK-10374:
--

 Summary: Spark-core 1.5.0-RC2 can create version conflicts with 
apps depending on protobuf-2.4
 Key: SPARK-10374
 URL: https://issues.apache.org/jira/browse/SPARK-10374
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: Matt Cheah
Priority: Blocker
 Fix For: 1.5.0


My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
When I run the driver application, I can hit the following error:

{code}
… java.lang.UnsupportedOperationException: This is 
supposed to be overridden by subclasses.
at 
com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
at 
com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
{code}

This application used to work when pulling in Spark 1.4.1 dependencies, and 
thus this is a regression.

I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
modules. It appears that Spark used to shade its protobuf dependencies and 
hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However when 
I ran dependencyInsight again against Spark 1.5 and it looks like protobuf is 
no longer shaded from the Spark module.

1.4.1 dependencyInsight:

{code}
com.google.protobuf:protobuf-java:2.4.0a
+--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
|\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
| +--- compile
| \--- org.apache.spark:spark-core_2.10:1.4.1
|  +--- compile
|  +--- org.apache.spark:spark-sql_2.10:1.4.1
|  |\--- compile
|  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
|   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
\--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
 \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)

org.spark-project.protobuf:protobuf-java:2.5.0-spark
\--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
 \--- org.apache.spark:spark-core_2.10:1.4.1
  +--- compile
  +--- org.apache.spark:spark-sql_2.10:1.4.1
  |\--- compile
  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
{code}

1.5.0-rc2 dependencyInsight:

{code}
com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
\--- com.typesafe.akka:akka-remote_2.10:2.3.11
 \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
  +--- compile
  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
  |\--- compile
  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)

com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
+--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
|\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
| +--- compile
| \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
|  +--- compile
|  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
|  |\--- compile
|  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
|   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
\--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
 \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
{code}

Clearly we can't force the version to be one way or the other. If I force 
protobuf to use 2.5.0, then invoking Hadoop code from my application will break 
as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other hand, 
forcing protobuf to use version 2.4 breaks spark-core code that is compiled 
against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are not binary 
compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723768#comment-14723768
 ] 

Matt Cheah commented on SPARK-10374:


I intend to create a smaller standalone program that reproduces the issue and 
where I can paste the full dependency graph. The application where I saw the 
issue at first is pretty big and viewing the whole graph would be pretty much 
intractible.

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>Priority: Blocker
> Fix For: 1.5.0
>
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> {code}
> Clearly we can't force the version to be one way or the other. If I force 
> protobuf to use 2.5.0, then invoking Hadoop code from my application will 
> break as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other 
> hand, forcing protobuf to use version 2.4 breaks spark-core code that is 
> compiled against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are 
> not binary compatible.



--
This 

[jira] [Updated] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10374:
--
 Priority: Major  (was: Blocker)
Fix Version/s: (was: 1.5.0)

([~mcheah] don't set Fix version or Blocker: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)

I think the more basic issue is that you have a build for Hadoop 2.2+ and are 
using 2.0. The artifacts in Maven won't necessarily work for you. You need 
something like the cdh4 profile and a custom build ... but here it's the akka 
dependency that would need a custom build I think.

Also I'm not clear you marked the dependencies as "provided"? although I don't 
know that's the issue.

TBH I don't know if Spark necessarily works with Hadoop 2.0.0. 1.4 didn't fully 
work with 1.x.

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> {code}
> Clearly we can't force the version to be one way or the other. If I force 
> protobuf to use 

[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723792#comment-14723792
 ] 

Patrick Wendell commented on SPARK-10374:
-

Hey Matt,

I think the only thing that could have influenced you is that we changed our 
default advertised akka dependency. We used to advertise an older version of 
akka that shaded protobuf. What happens if you manually coerce that version of 
akka in your application?

Spark itself doesn't directly use protobuf. But some of our dependencies do, 
including both akka and Hadoop. My guess is that you are now in a situation 
where you can't reconcile the akka and hadoop protobuf versions and make them 
both happy. This would be consistent with the changes we made in 1.5 in 
SPARK-7042.

The fix would be to exclude all com.typsafe.akka artifacts from Spark and 
manually add org.spark-project.akka to your build.

However, since you didn't post a full stack trace, I can't know for sure 
whether it is akka that complains when you try to fix the protobuf version at 
2.4.

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- 

[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723797#comment-14723797
 ] 

Marcelo Vanzin commented on SPARK-10374:


This is actually caused by the akka version change. In 1.4, Spark depends on a 
custom build of akka ({{2.3.4-spark}}) that has a shaded protobuf dependency. 
1.5 depends on {{2.3.11}} which depends on the unshaded protobuf 2.5.0.

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> {code}
> Clearly we can't force the version to be one way or the other. If I force 
> protobuf to use 2.5.0, then invoking Hadoop code from my application will 
> break as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other 
> hand, forcing protobuf to use version 2.4 breaks spark-core code that is 
> compiled against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are 
> not binary compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723802#comment-14723802
 ] 

Marcelo Vanzin commented on SPARK-10374:


BTW since akka depends on protobuf, one cannot simply override the dependency, 
since then akka might break. Does anyone know what's the extent of akka's use 
of protobuf?

This does sounds pretty bad (it may make Spark's hadoop-1 builds unusable, at 
least in certain situations).

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> {code}
> Clearly we can't force the version to be one way or the other. If I force 
> protobuf to use 2.5.0, then invoking Hadoop code from my application will 
> break as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other 
> hand, forcing protobuf to use version 2.4 breaks spark-core code that is 
> compiled against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are 
> not binary compatible.



--
This message was sent by 

[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723817#comment-14723817
 ] 

Marcelo Vanzin commented on SPARK-10374:


nevermind, Patrick pointed out that the hadoop-1 binaries actually build with a 
different version of akka, so this is restricted to the published maven 
artifacts. Adding a dependency exclusion for {{protobuf-java}} in the 
spark-core dependency should fix this.

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> {code}
> Clearly we can't force the version to be one way or the other. If I force 
> protobuf to use 2.5.0, then invoking Hadoop code from my application will 
> break as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other 
> hand, forcing protobuf to use version 2.4 breaks spark-core code that is 
> compiled against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are 
> not binary compatible.



--
This message was sent by Atlassian JIRA

[jira] [Resolved] (SPARK-8730) Deser primitive class with Java serialization

2015-08-31 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-8730.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 7122
[https://github.com/apache/spark/pull/7122]

> Deser primitive class with Java serialization
> -
>
> Key: SPARK-8730
> URL: https://issues.apache.org/jira/browse/SPARK-8730
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Eugen Cepoi
>Priority: Critical
> Fix For: 1.6.0
>
>
> Objects that contain as property a primitive Class, can not be deserialized 
> using java serde. Class.forName does not work for primitives.
> Exemple of object:
> class Foo extends Serializable {
>   val intClass = classOf[Int]
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10374) Spark-core 1.5.0-RC2 can create version conflicts with apps depending on protobuf-2.4

2015-08-31 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723826#comment-14723826
 ] 

Matt Cheah commented on SPARK-10374:


I'll try switching the Akka version pulled in by Spark and see how it goes. 
Thanks!

> Spark-core 1.5.0-RC2 can create version conflicts with apps depending on 
> protobuf-2.4
> -
>
> Key: SPARK-10374
> URL: https://issues.apache.org/jira/browse/SPARK-10374
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.0
>Reporter: Matt Cheah
>
> My Hadoop cluster is running 2.0.0-CDH4.7.0, and I have an application that 
> depends on the Spark 1.5.0 libraries via Gradle, and Hadoop 2.0.0 libraries. 
> When I run the driver application, I can hit the following error:
> {code}
> … java.lang.UnsupportedOperationException: This is 
> supposed to be overridden by subclasses.
> at 
> com.google.protobuf.GeneratedMessage.getUnknownFields(GeneratedMessage.java:180)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$GetFileInfoRequestProto.getSerializedSize(ClientNamenodeProtocolProtos.java:30108)
> at 
> com.google.protobuf.AbstractMessageLite.toByteString(AbstractMessageLite.java:49)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.constructRpcRequest(ProtobufRpcEngine.java:149)
> {code}
> This application used to work when pulling in Spark 1.4.1 dependencies, and 
> thus this is a regression.
> I used Gradle’s dependencyInsight task to dig a bit deeper. Against our Spark 
> 1.4.1-backed project, it shows that dependency resolution pulls in Protobuf 
> 2.4.0a from the Hadoop CDH4 modules and Protobuf 2.5.0-spark from the Spark 
> modules. It appears that Spark used to shade its protobuf dependencies and 
> hence Spark’s and Hadoop’s protobuf dependencies wouldn’t collide. However 
> when I ran dependencyInsight again against Spark 1.5 and it looks like 
> protobuf is no longer shaded from the Spark module.
> 1.4.1 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.4.0a
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.4.1
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.4.1
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.4.1
> |   \--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> org.spark-project.protobuf:protobuf-java:2.5.0-spark
> \--- org.spark-project.akka:akka-remote_2.10:2.3.4-spark
>  \--- org.apache.spark:spark-core_2.10:1.4.1
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.4.1
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.4.1
>\--- org.apache.spark:spark-sql_2.10:1.4.1 (*)
> {code}
> 1.5.0-rc2 dependencyInsight:
> {code}
> com.google.protobuf:protobuf-java:2.5.0 (conflict resolution)
> \--- com.typesafe.akka:akka-remote_2.10:2.3.11
>  \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
>   +--- compile
>   +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
>   |\--- compile
>   \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
>\--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> com.google.protobuf:protobuf-java:2.4.0a -> 2.5.0
> +--- org.apache.hadoop:hadoop-common:2.0.0-cdh4.6.0
> |\--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0
> | +--- compile
> | \--- org.apache.spark:spark-core_2.10:1.5.0-rc2
> |  +--- compile
> |  +--- org.apache.spark:spark-sql_2.10:1.5.0-rc2
> |  |\--- compile
> |  \--- org.apache.spark:spark-catalyst_2.10:1.5.0-rc2
> |   \--- org.apache.spark:spark-sql_2.10:1.5.0-rc2 (*)
> \--- org.apache.hadoop:hadoop-hdfs:2.0.0-cdh4.6.0
>  \--- org.apache.hadoop:hadoop-client:2.0.0-mr1-cdh4.6.0 (*)
> {code}
> Clearly we can't force the version to be one way or the other. If I force 
> protobuf to use 2.5.0, then invoking Hadoop code from my application will 
> break as Hadoop 2.0.0 jars are compiled against protobuf-2.4. On the other 
> hand, forcing protobuf to use version 2.4 breaks spark-core code that is 
> compiled against protobuf-2.5. Note that protobuf-2.4 and protobuf-2.5 are 
> not binary compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-10359) Enumerate Spark's dependencies in a file and diff against it for new pull requests

2015-08-31 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723844#comment-14723844
 ] 

Patrick Wendell commented on SPARK-10359:
-

The approach in SPARK-4123 was a bit different, but there is some overlap. We 
ended up reverting that patch because it wasn't working consistently. I'll 
close that one as a dup of this one.

> Enumerate Spark's dependencies in a file and diff against it for new pull 
> requests 
> ---
>
> Key: SPARK-10359
> URL: https://issues.apache.org/jira/browse/SPARK-10359
> Project: Spark
>  Issue Type: New Feature
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> Sometimes when we have dependency changes it can be pretty unclear what 
> transitive set of things are changing. If we enumerate all of the 
> dependencies and put them in a source file in the repo, we can make it so 
> that it is very explicit what is changing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4123) Show dependency changes in pull requests

2015-08-31 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4123.

Resolution: Duplicate

I've proposed a slightly different approach in SPARK-10359, so I'm closing this 
since there is high overlap.

> Show dependency changes in pull requests
> 
>
> Key: SPARK-4123
> URL: https://issues.apache.org/jira/browse/SPARK-4123
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Brennon York
>Priority: Critical
>
> We should inspect the classpath of Spark's assembly jar for every pull 
> request. This only takes a few seconds in Maven and it will help weed out 
> dependency changes from the master branch. Ideally we'd post any dependency 
> changes in the pull request message.
> {code}
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > my-classpath
> $ git checkout apache/master
> $ mvn -Phive -Phadoop-2.4 dependency:build-classpath -pl assembly  | grep -v 
> INFO | tr : "\n" | awk -F/ '{print $NF}' | sort > master-classpath
> $ diff my-classpath master-classpath
> < chill-java-0.3.6.jar
> < chill_2.10-0.3.6.jar
> ---
> > chill-java-0.5.0.jar
> > chill_2.10-0.5.0.jar
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-31 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723851#comment-14723851
 ] 

Zhan Zhang commented on SPARK-10304:


[~yhuai] I tried to reproduce the problem with the same directory structure 
(table is saved in /tmp/table/peoplePartitioned), but didn't hit the problem. 
val table = sqlContext.read.format("orc").load("/tmp/table")
table.registerTempTable("table")
sqlContext.sql("SELECT * FROM table WHERE age = 19").show
sqlContext.sql("SELECT * FROM table").show
val table = sqlContext.read.format("orc").load("/tmp/table/peoplePartitioned")
table.registerTempTable("table")
sqlContext.sql("SELECT * FROM table WHERE age = 19").show
sqlContext.sql("SELECT * FROM table").show

I went through the partition parsing code, which is parsing the leaf directory, 
and thus starting point does not matter , and both /tmp/table and 
/tmp/table/peoplePartitioned are valid directory as long as it only have the 
files from the same orc table. 

Does your /tmp/table have some extra files not belonging to the same table? If 
not, can you please provide exact reproducing steps?

cc [~lian cheng]

> Partition discovery does not throw an exception if the dir structure is valid
> -
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-31 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723858#comment-14723858
 ] 

Yin Huai commented on SPARK-10304:
--

The dir struct I have is something like the following.

{code}
/tmp/tables
|-> /tmp/tables/partitionedTable
 |-> /tmp/tables/partitionedTable/p=1/
|-> /tmp/tables/nonPartitionedTable1
|-> /tmp/tables/nonPartitionedTable2
{code}

> Partition discovery does not throw an exception if the dir structure is valid
> -
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work

2015-08-31 Thread Thomas (JIRA)
Thomas created SPARK-10375:
--

 Summary: Setting the driver memory with 
SparkConf().set("spark.driver.memory","1g") does not work
 Key: SPARK-10375
 URL: https://issues.apache.org/jira/browse/SPARK-10375
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0
 Environment: Running with yarn
Reporter: Thomas
Priority: Minor


When running pyspark 1.3.0 with yarn, the following code has no effect:
pyspark.SparkConf().set("spark.driver.memory","1g")

The Environment tab in yarn shows that the driver has 1g, however, the 
Executors tab only shows 512 M (the default value) for the driver memory.  This 
issue goes away when the driver memory is specified via the command line (i.e. 
--driver-memory 1g)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10376) Once/When YARN permits it, only use POST for kill action

2015-08-31 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-10376:
--

 Summary: Once/When YARN permits it, only use POST for kill action
 Key: SPARK-10376
 URL: https://issues.apache.org/jira/browse/SPARK-10376
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.5.0
 Environment: YARN
Reporter: Steve Loughran
Priority: Minor


YARN's RM proxy doesn't currently support verbs other than GET; see YARN-2084 
and YARN-2031 cover this. 

Once REST APIs in YARN are supported, the SparkUI setup logic to allow job 
kills on a GET can be removed (either the minimum supported version of YARN has 
to be greater or some API probe implies the feature is present)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10369) Fix a bug that Receiver could not be started after deregistering

2015-08-31 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10369:
--
Fix Version/s: 1.5.0

> Fix a bug that Receiver could not be started after deregistering
> 
>
> Key: SPARK-10369
> URL: https://issues.apache.org/jira/browse/SPARK-10369
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10369) Fix a bug that Receiver could not be started after deregistering

2015-08-31 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-10369.
---
Resolution: Fixed
  Assignee: Shixiong Zhu

> Fix a bug that Receiver could not be started after deregistering
> 
>
> Key: SPARK-10369
> URL: https://issues.apache.org/jira/browse/SPARK-10369
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work

2015-08-31 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-10375.

Resolution: Not A Problem

You can't set the driver memory after the driver has already started. If you 
want to set it, you need to set if either in your config file or in the 
spark-submit command ({{--driver-memory 1g}} or {{--conf 
spark.driver.memory=1g").

The UI discrepancy is unfortunate but not easy (nor important enough) to fix, 
at least at the moment. It affects quite a lot of properties that can't really 
change after the context is initialized.

> Setting the driver memory with SparkConf().set("spark.driver.memory","1g") 
> does not work
> 
>
> Key: SPARK-10375
> URL: https://issues.apache.org/jira/browse/SPARK-10375
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Running with yarn
>Reporter: Thomas
>Priority: Minor
>
> When running pyspark 1.3.0 with yarn, the following code has no effect:
> pyspark.SparkConf().set("spark.driver.memory","1g")
> The Environment tab in yarn shows that the driver has 1g, however, the 
> Executors tab only shows 512 M (the default value) for the driver memory.  
> This issue goes away when the driver memory is specified via the command line 
> (i.e. --driver-memory 1g)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10375) Setting the driver memory with SparkConf().set("spark.driver.memory","1g") does not work

2015-08-31 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723902#comment-14723902
 ] 

Marcelo Vanzin edited comment on SPARK-10375 at 8/31/15 7:27 PM:
-

You can't set the driver memory after the driver has already started. If you 
want to set it, you need to set if either in your config file or in the 
spark-submit command ({{--driver-memory 1g}} or {{--conf 
spark.driver.memory=1g}}).

The UI discrepancy is unfortunate but not easy (nor important enough) to fix, 
at least at the moment. It affects quite a lot of properties that can't really 
change after the context is initialized.


was (Author: vanzin):
You can't set the driver memory after the driver has already started. If you 
want to set it, you need to set if either in your config file or in the 
spark-submit command ({{--driver-memory 1g}} or {{--conf 
spark.driver.memory=1g").

The UI discrepancy is unfortunate but not easy (nor important enough) to fix, 
at least at the moment. It affects quite a lot of properties that can't really 
change after the context is initialized.

> Setting the driver memory with SparkConf().set("spark.driver.memory","1g") 
> does not work
> 
>
> Key: SPARK-10375
> URL: https://issues.apache.org/jira/browse/SPARK-10375
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.3.0
> Environment: Running with yarn
>Reporter: Thomas
>Priority: Minor
>
> When running pyspark 1.3.0 with yarn, the following code has no effect:
> pyspark.SparkConf().set("spark.driver.memory","1g")
> The Environment tab in yarn shows that the driver has 1g, however, the 
> Executors tab only shows 512 M (the default value) for the driver memory.  
> This issue goes away when the driver memory is specified via the command line 
> (i.e. --driver-memory 1g)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10370) After a stages map outputs are registered, all running attempts should be marked as zombies

2015-08-31 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-10370:
-
Assignee: (was: Imran Rashid)

> After a stages map outputs are registered, all running attempts should be 
> marked as zombies
> ---
>
> Key: SPARK-10370
> URL: https://issues.apache.org/jira/browse/SPARK-10370
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>
> Follow up to SPARK-5259.  During stage retry, its possible for a stage to 
> "complete" by registering all its map output and starting the downstream 
> stages, before the latest task set has completed.  This will result in the 
> earlier task set continuing to submit tasks, that are both unnecessary and 
> increase the chance of hitting SPARK-8029.
> Spark should mark all tasks sets for a stage as zombie as soon as its map 
> output is registered.  Note that this involves coordination between the 
> various scheduler components ({{DAGScheduler}} and {{TaskSetManager}} at 
> least) which isn't easily testable with the current setup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10372) Tests for entire scheduler

2015-08-31 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-10372:
-
Fix Version/s: 1.6.0

> Tests for entire scheduler
> --
>
> Key: SPARK-10372
> URL: https://issues.apache.org/jira/browse/SPARK-10372
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
> Fix For: 1.6.0
>
>
> The current testing framework for the scheduler only tests individual classes 
> in isolation: {{DAGSchedulerSuite}}, {{TaskSchedulerImplSuite}}, etc.  Of 
> course that is useful, but we are missing tests which cover the interaction 
> between these components.  We also have larger tests which run entire spark 
> jobs, but that doesn't allow fine grained control of failures for verifying 
> spark's fault-tolerance.
> Adding a framework for testing the scheduler as a whole will:
> 1. Allow testing bugs which involve the interaction between multiple parts of 
> the scheduler, eg. SPARK-10370
> 2. Greater confidence in refactoring the scheduler as a whole.  Given the 
> tight coordination between the components its hard to consider any 
> refactoring, since it would be unlikely to be covered by any tests.
> 3. Make it easier to increase test coverage.  Writing tests for the 
> {{DAGScheduler}} now requires intimate knowledge of exactly how the 
> components fit together -- a lot of work goes into mimicking the appropriate 
> behavior of the other components.  Furthermore, it makes the tests harder to 
> understand for the un-initiated -- which parts are simulating some condition 
> of an external system (eg., losing an executor), and which parts are just 
> interaction with other parts of the scheduler (eg., task resubmission)?  
> These tests will allow to just work at the level of the interaction w/ the 
> executors -- tasks complete, tasks fail, executors are lost, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10170) Add DB2 JDBC dialect support

2015-08-31 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10170.
-
   Resolution: Fixed
 Assignee: Suresh Thalamati
Fix Version/s: 1.6.0

> Add DB2 JDBC dialect support
> 
>
> Key: SPARK-10170
> URL: https://issues.apache.org/jira/browse/SPARK-10170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Suresh Thalamati
>Assignee: Suresh Thalamati
> Fix For: 1.6.0
>
>
> Repro :
> -- start spark shell with classpath set to the db2 jdbc driver. 
> SPARK_CLASSPATH=~/myjars/db2jcc.jar ./spark-shell
>  
> // set connetion properties 
> val properties = new java.util.Properties()
> properties.setProperty("user" , "user")
> properties.setProperty("password" , "password")
> // load the driver.
> Class.forName("com.ibm.db2.jcc.DB2Driver").newInstance
> // create data frame with a String type
> val empdf = sc.parallelize( Array((1,"John"), (2,"Mike"))).toDF("id", "name" )
> // write the data frame.  this will fail with error.  
> empdf.write.jdbc("jdbc:db2://bdvs150.svl.ibm.com:6/SAMPLE:retrieveMessagesFromServerOnGetMessage=true;",
>  "emp_data", properties)
> Error :
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: TEXT
>   at com.ibm.db2.jcc.am.fd.a(fd.java:679)
>   at com.ibm.db2.jcc.am.fd.a(fd.java:60)
> ..
> // create data frame with String , and Boolean types 
> val empdf = sc.parallelize( Array((1,"true".toBoolean ), (2, 
> "false".toBoolean ))).toDF("id", "isManager")
> // write the data frame.  this will fail with error.  
> empdf.write.jdbc("jdbc:db2://: 
> /SAMPLE:retrieveMessagesFromServerOnGetMessage=true;", "emp_data", properties)
> Error :
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: TEXT
>   at com.ibm.db2.jcc.am.fd.a(fd.java:679)
>   at com.ibm.db2.jcc.am.fd.a(fd.java:60)
> Write is failing because by default JDBC data source implementation 
> generating table schema with unsupported data types TEXT  for String, and 
> BIT1(1)  for Boolean. I think String type should get mapped to CLOB/VARCHAR, 
> and boolean type should be mapped to CHAR(1) for DB2 database.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10170) Add DB2 JDBC dialect support

2015-08-31 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10170:

Summary: Add DB2 JDBC dialect support  (was: Writing from data frame into 
db2 database using jdbc data source api fails with error for string, and 
boolean column types.)

> Add DB2 JDBC dialect support
> 
>
> Key: SPARK-10170
> URL: https://issues.apache.org/jira/browse/SPARK-10170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Suresh Thalamati
> Fix For: 1.6.0
>
>
> Repro :
> -- start spark shell with classpath set to the db2 jdbc driver. 
> SPARK_CLASSPATH=~/myjars/db2jcc.jar ./spark-shell
>  
> // set connetion properties 
> val properties = new java.util.Properties()
> properties.setProperty("user" , "user")
> properties.setProperty("password" , "password")
> // load the driver.
> Class.forName("com.ibm.db2.jcc.DB2Driver").newInstance
> // create data frame with a String type
> val empdf = sc.parallelize( Array((1,"John"), (2,"Mike"))).toDF("id", "name" )
> // write the data frame.  this will fail with error.  
> empdf.write.jdbc("jdbc:db2://bdvs150.svl.ibm.com:6/SAMPLE:retrieveMessagesFromServerOnGetMessage=true;",
>  "emp_data", properties)
> Error :
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: TEXT
>   at com.ibm.db2.jcc.am.fd.a(fd.java:679)
>   at com.ibm.db2.jcc.am.fd.a(fd.java:60)
> ..
> // create data frame with String , and Boolean types 
> val empdf = sc.parallelize( Array((1,"true".toBoolean ), (2, 
> "false".toBoolean ))).toDF("id", "isManager")
> // write the data frame.  this will fail with error.  
> empdf.write.jdbc("jdbc:db2://: 
> /SAMPLE:retrieveMessagesFromServerOnGetMessage=true;", "emp_data", properties)
> Error :
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: TEXT
>   at com.ibm.db2.jcc.am.fd.a(fd.java:679)
>   at com.ibm.db2.jcc.am.fd.a(fd.java:60)
> Write is failing because by default JDBC data source implementation 
> generating table schema with unsupported data types TEXT  for String, and 
> BIT1(1)  for Boolean. I think String type should get mapped to CLOB/VARCHAR, 
> and boolean type should be mapped to CHAR(1) for DB2 database.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10377) Backported refactor cause

2015-08-31 Thread Frode Sormo (JIRA)
Frode Sormo created SPARK-10377:
---

 Summary: Backported refactor cause
 Key: SPARK-10377
 URL: https://issues.apache.org/jira/browse/SPARK-10377
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.2
Reporter: Frode Sormo


The backport in SPARK-7289 and SPARK-9949 includes the refactor of TakeOrdered 
to TakeOrderedAndProject, which breaks code that refers to TakeOrdered. In a 
minor version update that is perhaps not expected - specifically, the Cassandra 
connector refers to this by name and no longer works.

Example use case it to use the Cassandra connector in Scala and creating a 
CassandraSQLContext:
import com.datastax.spark.connector._
import sqlContext.implicits._
import org.apache.spark.sql.cassandra.CassandraSQLContext

val cassandraSQLContext = new CassandraSQLContext(sc);

java.lang.NoSuchMethodError: 
org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.TakeOrdered()Lorg/apache/spark/sql/execution/SparkStrategies$TakeOrdered$;
at 
org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.(CassandraSQLContext.scala:90)
at 
org.apache.spark.sql.cassandra.CassandraSQLContext.(CassandraSQLContext.scala:85)

(Source code link: 
https://github.com/datastax/spark-cassandra-connector/blob/v1.4.0-M3/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala)
 
This is with version 1.4.0M3 of the Datastax Cassandra connector, but affects 
other 1.4 versions as well.

Issue has also been reported to Datastax, here: 
https://datastax-oss.atlassian.net/browse/SPARKC-238




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10377) Cassandra connector affected by backport change

2015-08-31 Thread Frode Sormo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frode Sormo updated SPARK-10377:

Summary: Cassandra connector affected by backport change  (was: Backported 
refactor cause)

> Cassandra connector affected by backport change
> ---
>
> Key: SPARK-10377
> URL: https://issues.apache.org/jira/browse/SPARK-10377
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.2
>Reporter: Frode Sormo
>
> The backport in SPARK-7289 and SPARK-9949 includes the refactor of 
> TakeOrdered to TakeOrderedAndProject, which breaks code that refers to 
> TakeOrdered. In a minor version update that is perhaps not expected - 
> specifically, the Cassandra connector refers to this by name and no longer 
> works.
> Example use case it to use the Cassandra connector in Scala and creating a 
> CassandraSQLContext:
> import com.datastax.spark.connector._
> import sqlContext.implicits._
> import org.apache.spark.sql.cassandra.CassandraSQLContext
> val cassandraSQLContext = new CassandraSQLContext(sc);
> java.lang.NoSuchMethodError: 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.TakeOrdered()Lorg/apache/spark/sql/execution/SparkStrategies$TakeOrdered$;
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext$$anon$1.(CassandraSQLContext.scala:90)
>   at 
> org.apache.spark.sql.cassandra.CassandraSQLContext.(CassandraSQLContext.scala:85)
> (Source code link: 
> https://github.com/datastax/spark-cassandra-connector/blob/v1.4.0-M3/spark-cassandra-connector/src/main/scala/org/apache/spark/sql/cassandra/CassandraSQLContext.scala)
>  
> This is with version 1.4.0M3 of the Datastax Cassandra connector, but affects 
> other 1.4 versions as well.
> Issue has also been reported to Datastax, here: 
> https://datastax-oss.atlassian.net/browse/SPARKC-238



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6961) Cannot save data to parquet files when executing from Windows from a Maven Project

2015-08-31 Thread matt hoffman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723933#comment-14723933
 ] 

matt hoffman commented on SPARK-6961:
-

How is that an installation-side issue? There is no requirement that users 
install any Hadoop utilities along with Spark, correct?  WINUTILS.EXE is 
something that is bundled by Hadoop (although it isn't packaged in Hadoop by 
default, according to http://www.barik.net/archive/2015/01/19/172716/). 


> Cannot save data to parquet files when executing from Windows from a Maven 
> Project
> --
>
> Key: SPARK-6961
> URL: https://issues.apache.org/jira/browse/SPARK-6961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Bogdan Niculescu
>Priority: Critical
>
> I have setup a project where I am trying to save a DataFrame into a parquet 
> file. My project is a Maven one with Spark 1.3.0 and Scala 2.11.5 :
> {code:xml}
> 1.3.0
> 
> org.apache.spark
> spark-core_2.11
> ${spark.version}
> 
> 
> org.apache.spark
> spark-sql_2.11
> ${spark.version}
> 
> {code}
> A simple version of my code that reproduces consistently the problem that I 
> am seeing is :
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> case class Person(name: String, age: Int)
> object DataFrameTest extends App {
>   val conf = new SparkConf().setMaster("local[4]").setAppName("DataFrameTest")
>   val sc = new SparkContext(conf)
>   val sqlContext = new SQLContext(sc)
>   val persons = List(Person("a", 1), Person("b", 2))
>   val rdd = sc.parallelize(persons)
>   val dataFrame = sqlContext.createDataFrame(rdd)
>   dataFrame.saveAsParquetFile("test.parquet")
> }
> {code}
> All the time the exception that I am getting is :
> {code}
> Exception in thread "main" java.lang.NullPointerException
>   at java.lang.ProcessBuilder.start(ProcessBuilder.java:1010)
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
>   at org.apache.hadoop.util.Shell.run(Shell.java:379)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:589)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:678)
>   at org.apache.hadoop.util.Shell.execCommand(Shell.java:661)
>   at 
> org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:639)
>   at 
> org.apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.java:468)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:456)
>   at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:424)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:905)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:886)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:783)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:772)
>   at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:409)
>   at 
> parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:401)
>   at 
> org.apache.spark.sql.parquet.ParquetTypesConverter$.writeMetaData(ParquetTypes.scala:443)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.prepareMetadata(newParquet.scala:240)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:256)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$6.apply(newParquet.scala:251)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:251)
>   at 
> org.apache.spark.sql.parquet.ParquetRelation2.(newParquet.scala:370)
>   at 
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:96)
>   at 
> org.apache.spark.sql.parquet.DefaultSource.createRelation(newParquet.scala:125)
>   at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:308)
>   at org.apache.spark.sql.DataFrame.save(DataFrame.scala:1123)
>   at org.apache.spark.sql.DataFrame.saveAsParquetFile(DataFrame.scala:922)
>   at 
> sparkTest.DataFrameTest$.delayedEndpoint$sparkTest$DataFrameTest$1(DataFrameTest.scala:17)
>   at 

[jira] [Assigned] (SPARK-10291) Add statsByKey method to compute StatCounters for each key in an RDD

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10291:


Assignee: (was: Apache Spark)

> Add statsByKey method to compute StatCounters for each key in an RDD
> 
>
> Key: SPARK-10291
> URL: https://issues.apache.org/jira/browse/SPARK-10291
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Erik Shilts
>Priority: Minor
>
> A common task is to summarize numerical data for different groups. Having a 
> `statsByKey` method would simplify this so the user would not have to write 
> the aggregators for all the statistics or manage collecting by key and 
> computing individual StatCounters.
> This should be a straightforward addition to PySpark. I can look into adding 
> to Scala and R if we want to maintain feature parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10291) Add statsByKey method to compute StatCounters for each key in an RDD

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723944#comment-14723944
 ] 

Apache Spark commented on SPARK-10291:
--

User 'eshilts' has created a pull request for this issue:
https://github.com/apache/spark/pull/8539

> Add statsByKey method to compute StatCounters for each key in an RDD
> 
>
> Key: SPARK-10291
> URL: https://issues.apache.org/jira/browse/SPARK-10291
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Erik Shilts
>Priority: Minor
>
> A common task is to summarize numerical data for different groups. Having a 
> `statsByKey` method would simplify this so the user would not have to write 
> the aggregators for all the statistics or manage collecting by key and 
> computing individual StatCounters.
> This should be a straightforward addition to PySpark. I can look into adding 
> to Scala and R if we want to maintain feature parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10291) Add statsByKey method to compute StatCounters for each key in an RDD

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10291:


Assignee: Apache Spark

> Add statsByKey method to compute StatCounters for each key in an RDD
> 
>
> Key: SPARK-10291
> URL: https://issues.apache.org/jira/browse/SPARK-10291
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Erik Shilts
>Assignee: Apache Spark
>Priority: Minor
>
> A common task is to summarize numerical data for different groups. Having a 
> `statsByKey` method would simplify this so the user would not have to write 
> the aggregators for all the statistics or manage collecting by key and 
> computing individual StatCounters.
> This should be a straightforward addition to PySpark. I can look into adding 
> to Scala and R if we want to maintain feature parity.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-31 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14723994#comment-14723994
 ] 

Zhan Zhang commented on SPARK-10304:


[~yhuai] I think the NPE is caused by the directory has multiple table inside.
[~lian cheng] Does the design allow different partition spec. For example, one 
part of the table is partitioned by key1, key2,  the second part is partitioned 
by key1 only and the third part is not partitioned? If so, then it is hard to 
decide whether the directory is valid or not. Otherwise, we can simply check 
the consistency of partitionspec and throw exceptions if not consistent.

> Partition discovery does not throw an exception if the dir structure is valid
> -
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10373) Move @since annotator to pyspark to be shared by all components

2015-08-31 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724009#comment-14724009
 ] 

Davies Liu commented on SPARK-10373:


[~mengxr] Do we want to add @since for the MLLib APIs in  1.5 docs ?

> Move @since annotator to pyspark to be shared by all components
> ---
>
> Key: SPARK-10373
> URL: https://issues.apache.org/jira/browse/SPARK-10373
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Davies Liu
>
> Python's `@since` is defined under `pyspark.sql`. It would be nice to move it 
> under `pyspark` to be shared by all components.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Sudarshan Kadambi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724038#comment-14724038
 ] 

Sudarshan Kadambi commented on SPARK-10320:
---

Good questions Cody. 

When adding a topic after streaming context, we should at a minimum be able to 
start consumption from the beggining or end of each topic partition. When a 
topic is removed from subscription, no offsets should be retained. When it is 
added later, there is no difference from a brand new topic and the same options 
(beginning, end or specific offset) are available.

When the driver restarts, for all existing topics, the consumption should 
restart from the saved offsets by default, but jobs should have the flexibility 
to choose different consumption points (start, end, specific offset).  If you 
restart the job and specify a new offset, that is where consumption should 
start, in effect overriding any saved offsets.

Topics can be repartitioned in Kafka today. So we need to handle partition 
count increase or decrease even in the absence of dynamic topic registration in 
Spark Streaming. How is this handled? I expect the same solution to carry over.

The topic changes happen in the same thread of execution where the initial list 
of topics was provided before starting the streaming context. I'm not sure of 
the implication of doing it in the on batch completed handler.

> Support new topic subscriptions without requiring restart of the streaming 
> context
> --
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9957) Spark ML trees should filter out 1-category features

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9957:
---

Assignee: Apache Spark

> Spark ML trees should filter out 1-category features
> 
>
> Key: SPARK-9957
> URL: https://issues.apache.org/jira/browse/SPARK-9957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Spark ML trees should filter out 1-category categorical features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9957) Spark ML trees should filter out 1-category features

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9957:
---

Assignee: (was: Apache Spark)

> Spark ML trees should filter out 1-category features
> 
>
> Key: SPARK-9957
> URL: https://issues.apache.org/jira/browse/SPARK-9957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Spark ML trees should filter out 1-category categorical features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9957) Spark ML trees should filter out 1-category features

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724067#comment-14724067
 ] 

Apache Spark commented on SPARK-9957:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8540

> Spark ML trees should filter out 1-category features
> 
>
> Key: SPARK-9957
> URL: https://issues.apache.org/jira/browse/SPARK-9957
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Spark ML trees should filter out 1-category categorical features.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724074#comment-14724074
 ] 

Cody Koeninger commented on SPARK-10320:


" If you restart the job and specify a new offset, that is where consumption 
should start, in effect overriding any saved offsets."

That's not the way checkpoints work.  You're either restarting from a 
checkpoint, or you're not restarting from a checkpoint, the decision is up to 
you.  If you want to specify a new offset, start the job clean.

"The topic changes happen in the same thread of execution where the initial 
list of topics was provided before starting the streaming context."

Can you say a little more about what you're actually doing here?  How do you 
know when topics need to be modified?  Typically streaming jobs just call 
ssc.awaitTermination in their main thread, which seems incompatible with what 
you're describing.

> Support new topic subscriptions without requiring restart of the streaming 
> context
> --
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-10320:
---
Summary: Kafka Support new topic subscriptions without requiring restart of 
the streaming context  (was: Support new topic subscriptions without requiring 
restart of the streaming context)

> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9962) Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9962:
---

Assignee: Apache Spark

> Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training
> --
>
> Key: SPARK-9962
> URL: https://issues.apache.org/jira/browse/SPARK-9962
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of 
> training.
> This applies to both the ML and MLlib implementations, but it is Ok to skip 
> the MLlib implementation since it will eventually be replaced by the ML one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9962) Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9962:
---

Assignee: (was: Apache Spark)

> Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training
> --
>
> Key: SPARK-9962
> URL: https://issues.apache.org/jira/browse/SPARK-9962
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of 
> training.
> This applies to both the ML and MLlib implementations, but it is Ok to skip 
> the MLlib implementation since it will eventually be replaced by the ML one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9962) Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724096#comment-14724096
 ] 

Apache Spark commented on SPARK-9962:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/8541

> Decision Tree training: prevNodeIdsForInstances.unpersist() at end of training
> --
>
> Key: SPARK-9962
> URL: https://issues.apache.org/jira/browse/SPARK-9962
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of 
> training.
> This applies to both the ML and MLlib implementations, but it is Ok to skip 
> the MLlib implementation since it will eventually be replaced by the ML one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10320) Kafka Support new topic subscriptions without requiring restart of the streaming context

2015-08-31 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724113#comment-14724113
 ] 

Cody Koeninger commented on SPARK-10320:


It's possible this might be solvable with a user-supplied callback of the form

(Map[TopicAndPartition, LeaderOffset]) => Map[TopicAndPartition, LeaderOffset]
or maybe
(Time, Map[TopicAndPartition, LeaderOffset]) => Map[TopicAndPartition, 
LeaderOffset]

that got called in the compute method of DelayedKafkaInputDStream.  That would 
avoid threading issues, and also allow for more-or-less arbitrary modification 
of the topics and offsets, including some use cases that people have to 
subclass for currently.

Actually, that just handles the ending offsets of the batch, so it'd need to be 
a pair of maps, one for the beginning and the other for the ending.  But the 
basic idea remains.



> Kafka Support new topic subscriptions without requiring restart of the 
> streaming context
> 
>
> Key: SPARK-10320
> URL: https://issues.apache.org/jira/browse/SPARK-10320
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Sudarshan Kadambi
>
> Spark Streaming lacks the ability to subscribe to newer topics or unsubscribe 
> to current ones once the streaming context has been started. Restarting the 
> streaming context increases the latency of update handling.
> Consider a streaming application subscribed to n topics. Let's say 1 of the 
> topics is no longer needed in streaming analytics and hence should be 
> dropped. We could do this by stopping the streaming context, removing that 
> topic from the topic list and restarting the streaming context. Since with 
> some DStreams such as DirectKafkaStream, the per-partition offsets are 
> maintained by Spark, we should be able to resume uninterrupted (I think?) 
> from where we left off with a minor delay. However, in instances where 
> expensive state initialization (from an external datastore) may be needed for 
> datasets published to all topics, before streaming updates can be applied to 
> it, it is more convenient to only subscribe or unsubcribe to the incremental 
> changes to the topic list. Without such a feature, updates go unprocessed for 
> longer than they need to be, thus affecting QoS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-31 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724122#comment-14724122
 ] 

Zhan Zhang commented on SPARK-10304:


[~lian cheng] forget about my question. From the code, it is not allowed. Here 
it happens because the directory has nonPartitionedTableX, and the validation 
check does not handle this case.

> Partition discovery does not throw an exception if the dir structure is valid
> -
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10199) Avoid using reflections for parquet model save

2015-08-31 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724138#comment-14724138
 ] 

Feynman Liang commented on SPARK-10199:
---

CC [~mengxr] [~josephkb]

> Avoid using reflections for parquet model save
> --
>
> Key: SPARK-10199
> URL: https://issues.apache.org/jira/browse/SPARK-10199
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Feynman Liang
>Priority: Minor
>
> These items are not high priority since the overhead writing to Parquest is 
> much greater than for runtime reflections.
> Multiple model save/load in MLlib use case classes to infer a schema for the 
> data frame saved to Parquet. However, inferring a schema from case classes or 
> tuples uses [runtime 
> reflection|https://github.com/apache/spark/blob/d7b4c095271c36fcc7f9ded267ecf5ec66fac803/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L361]
>  which is unnecessary since the types are already known at the time `save` is 
> called.
> It would be better to just specify the schema for the data frame directly 
> using {{sqlContext.createDataFrame(dataRDD, schema)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10378) Remove HashJoinCompatibilitySuite

2015-08-31 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-10378:
---

 Summary: Remove HashJoinCompatibilitySuite
 Key: SPARK-10378
 URL: https://issues.apache.org/jira/browse/SPARK-10378
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


This will help reduce the test time.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10378) Remove HashJoinCompatibilitySuite

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10378:


Assignee: Apache Spark  (was: Reynold Xin)

> Remove HashJoinCompatibilitySuite
> -
>
> Key: SPARK-10378
> URL: https://issues.apache.org/jira/browse/SPARK-10378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> This will help reduce the test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10378) Remove HashJoinCompatibilitySuite

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10378:


Assignee: Reynold Xin  (was: Apache Spark)

> Remove HashJoinCompatibilitySuite
> -
>
> Key: SPARK-10378
> URL: https://issues.apache.org/jira/browse/SPARK-10378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This will help reduce the test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10378) Remove HashJoinCompatibilitySuite

2015-08-31 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14724159#comment-14724159
 ] 

Apache Spark commented on SPARK-10378:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/8542

> Remove HashJoinCompatibilitySuite
> -
>
> Key: SPARK-10378
> URL: https://issues.apache.org/jira/browse/SPARK-10378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This will help reduce the test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9954) Vector.hashCode should use nonzeros

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-9954.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8182
[https://github.com/apache/spark/pull/8182]

> Vector.hashCode should use nonzeros
> ---
>
> Key: SPARK-9954
> URL: https://issues.apache.org/jira/browse/SPARK-9954
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.6.0
>
>
> Using only zeros is likely to cause hash collision.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10378) Remove HashJoinCompatibilitySuite

2015-08-31 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10378:

Description: 
They don't bring much value since we now have better unit test coverage for 
hash joins. This will also help reduce the test time.


  was:
This will help reduce the test time.



> Remove HashJoinCompatibilitySuite
> -
>
> Key: SPARK-10378
> URL: https://issues.apache.org/jira/browse/SPARK-10378
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> They don't bring much value since we now have better unit test coverage for 
> hash joins. This will also help reduce the test time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8472) Python API for DCT

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8472:
-
Assignee: Yanbo Liang

> Python API for DCT
> --
>
> Key: SPARK-8472
> URL: https://issues.apache.org/jira/browse/SPARK-8472
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> We need to implement a wrapper for enabling the DCT feature transformer to be 
> used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8472) Python API for DCT

2015-08-31 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-8472.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8485
[https://github.com/apache/spark/pull/8485]

> Python API for DCT
> --
>
> Key: SPARK-8472
> URL: https://issues.apache.org/jira/browse/SPARK-8472
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> We need to implement a wrapper for enabling the DCT feature transformer to be 
> used from the Python API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3976) Detect block matrix partitioning schemes

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3976:
---

Assignee: (was: Apache Spark)

> Detect block matrix partitioning schemes
> 
>
> Key: SPARK-3976
> URL: https://issues.apache.org/jira/browse/SPARK-3976
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>
> Provide repartitioning methods for block matrices to repartition matrix for 
> add/multiply of non-identically partitioned matrices



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10341) SMJ fail with unable to acquire memory

2015-08-31 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-10341.
-
   Resolution: Fixed
Fix Version/s: 1.5.0

> SMJ fail with unable to acquire memory
> --
>
> Key: SPARK-10341
> URL: https://issues.apache.org/jira/browse/SPARK-10341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.5.0
>
>
> In SMJ, the first ExternalSorter could consume all the memory before 
> spilling, then the second can not even acquire the first page.
> {code}
> ava.io.IOException: Unable to acquire 16777216 bytes of memory
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:368)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.(UnsafeExternalSorter.java:138)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:106)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:68)
>   at 
> org.apache.spark.sql.execution.TungstenSort.org$apache$spark$sql$execution$TungstenSort$$preparePartition$1(sort.scala:146)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.sql.execution.TungstenSort$$anonfun$doExecute$3.apply(sort.scala:169)
>   at 
> org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:45)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3976) Detect block matrix partitioning schemes

2015-08-31 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3976:
---

Assignee: Apache Spark

> Detect block matrix partitioning schemes
> 
>
> Key: SPARK-3976
> URL: https://issues.apache.org/jira/browse/SPARK-3976
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Reza Zadeh
>Assignee: Apache Spark
>
> Provide repartitioning methods for block matrices to repartition matrix for 
> add/multiply of non-identically partitioned matrices



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >