[jira] [Updated] (SPARK-5180) Data source API improvement

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5180:

Target Version/s: 1.4.0  (was: 1.3.0)

 Data source API improvement
 ---

 Key: SPARK-5180
 URL: https://issues.apache.org/jira/browse/SPARK-5180
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4768:

Priority: Blocker  (was: Critical)

 Add Support For Impala Encoded Timestamp (INT96)
 

 Key: SPARK-4768
 URL: https://issues.apache.org/jira/browse/SPARK-4768
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Pat McDonough
Priority: Blocker
 Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, 
 string_timestamp.gz


 Impala is using INT96 for timestamps. Spark SQL should be able to read this 
 data despite the fact that it is not part of the spec.
 Perhaps adding a flag to act like impala when reading parquet (like we do for 
 strings already) would be useful.
 Here's an example of the error you might see:
 {code}
 Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
 convert INT96
 at scala.sys.package$.error(package.scala:27)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
 at 
 org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66)
 at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4768) Add Support For Impala Encoded Timestamp (INT96)

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4768:

Assignee: Yin Huai

 Add Support For Impala Encoded Timestamp (INT96)
 

 Key: SPARK-4768
 URL: https://issues.apache.org/jira/browse/SPARK-4768
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Pat McDonough
Assignee: Yin Huai
Priority: Blocker
 Attachments: 5e4481a02f951e29-651ee94ed14560bf_922627129_data.0.parq, 
 string_timestamp.gz


 Impala is using INT96 for timestamps. Spark SQL should be able to read this 
 data despite the fact that it is not part of the spec.
 Perhaps adding a flag to act like impala when reading parquet (like we do for 
 strings already) would be useful.
 Here's an example of the error you might see:
 {code}
 Caused by: java.lang.RuntimeException: Potential loss of precision: cannot 
 convert INT96
 at scala.sys.package$.error(package.scala:27)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.toPrimitiveDataType(ParquetTypes.scala:61)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.toDataType(ParquetTypes.scala:113)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:314)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$convertToAttributes$1.apply(ParquetTypes.scala:311)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertToAttributes(ParquetTypes.scala:310)
 at 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.readSchemaFromFile(ParquetTypes.scala:441)
 at 
 org.apache.spark.sql.parquet.ParquetRelation.init(ParquetRelation.scala:66)
 at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:141)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3851:

Priority: Blocker  (was: Critical)

 Support for reading parquet files with different but compatible schema
 --

 Key: SPARK-3851
 URL: https://issues.apache.org/jira/browse/SPARK-3851
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 Right now it is required that all of the parquet files have the same schema.  
 It would be nice to support some safe subset of cases where the schemas of 
 files is different.  For example:
  - Adding and removing nullable columns.
  - Widening types (a column that is of both Int and Long type)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5425) ConcurrentModificationException during SparkConf creation

2015-02-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5425:
--
Target Version/s: 1.2.2

I've merged [~jlewandowski]'s patch (https://github.com/apache/spark/pull/4222) 
to fix this in `master` (1.3.0) and `branch-1.1` (1.1.2), and I've added the 
{{backport-needed}} tag so we remember to merge it into 1.2.2.

 ConcurrentModificationException during SparkConf creation
 -

 Key: SPARK-5425
 URL: https://issues.apache.org/jira/browse/SPARK-5425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Jacek Lewandowski
Assignee: Jacek Lewandowski
 Fix For: 1.3.0, 1.1.2


 This fragment of code:
 {code}
   if (loadDefaults) {
 // Load any spark.* system properties
 for ((k, v) - System.getProperties.asScala if k.startsWith(spark.)) {
   settings(k) = v
 }
   }
 {code}
 causes 
 {noformat}
 ERROR 09:43:15  SparkMaster service caused error in state 
 STARTINGjava.util.ConcurrentModificationException: null
   at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) 
 ~[na:1.7.0_60]
   at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
  ~[scala-library-2.10.4.jar:na]
   at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
  ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
 ~[scala-library-2.10.4.jar:na]
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
  ~[scala-library-2.10.4.jar:na]
   at org.apache.spark.SparkConf.init(SparkConf.scala:53) 
 ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT]
   at org.apache.spark.SparkConf.init(SparkConf.scala:47) 
 ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT]
 {noformat}
 when there is another thread which modifies system properties at the same 
 time. 
 This bug https://issues.scala-lang.org/browse/SI-7775 is somehow related to 
 the issue and shows that the problem has been also found elsewhere. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values

2015-02-02 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302043#comment-14302043
 ] 

Joseph K. Bradley commented on SPARK-5534:
--

Note: This is needed for [https://github.com/apache/spark/pull/4047], which is 
a PR for this JIRA: [https://issues.apache.org/jira/browse/SPARK-1405]

 EdgeRDD, VertexRDD getStorageLevel return bad values
 

 Key: SPARK-5534
 URL: https://issues.apache.org/jira/browse/SPARK-5534
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 After caching a graph, its edge and vertex RDDs still return 
 StorageLevel.None.
 Reproduce error:
 {code}
 import org.apache.spark.graphx.{Edge, Graph}
 val edges = Seq(
   Edge[Double](0, 1, 0),
   Edge[Double](1, 2, 0),
   Edge[Double](2, 3, 0),
   Edge[Double](3, 4, 0))
 val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0)
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 g.cache()
 g.vertices.count()
 g.edges.count()
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values

2015-02-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-5534:


Assignee: Joseph K. Bradley

 EdgeRDD, VertexRDD getStorageLevel return bad values
 

 Key: SPARK-5534
 URL: https://issues.apache.org/jira/browse/SPARK-5534
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 After caching a graph, its edge and vertex RDDs still return 
 StorageLevel.None.
 Reproduce error:
 {code}
 import org.apache.spark.graphx.{Edge, Graph}
 val edges = Seq(
   Edge[Double](0, 1, 0),
   Edge[Double](1, 2, 0),
   Edge[Double](2, 3, 0),
   Edge[Double](3, 4, 0))
 val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0)
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 g.cache()
 g.vertices.count()
 g.edges.count()
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5505) ConsumerRebalanceFailedException from Kafka consumer

2015-02-02 Thread Tathagata Das (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302062#comment-14302062
 ] 

Tathagata Das commented on SPARK-5505:
--

Since this is a problem with the HighLevel consumer, solving this requires 
completely rearchitecting the KafkaReceiver. This is hard to do. Possible 
workarounds:
http://mail-archives.apache.org/mod_mbox/kafka-users/201312.mbox/%3CCAFbh0Q38qQ0aAg_cj=jzk-kbi8xwf+1m6xlj+fzf6eetj9z...@mail.gmail.com%3E

 ConsumerRebalanceFailedException from Kafka consumer
 

 Key: SPARK-5505
 URL: https://issues.apache.org/jira/browse/SPARK-5505
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
 Environment: CentOS6 / Linux 2.6.32-358.2.1.el6.x86_64
 java version 1.7.0_21
 Scala compiler version 2.9.3
 2 cores Intel(R) Xeon(R) CPU E5620  @ 2.40GHz / 16G RAM
 VMWare VM.
Reporter: Greg Temchenko
Priority: Critical

 From time to time Spark streaming produces a ConsumerRebalanceFailedException 
 and stops receiving messages. After that all consequential RDDs are empty.
 {code}
 15/01/30 18:18:36 ERROR consumer.ZookeeperConsumerConnector: 
 [terran_vmname-1422670149779-243b4e10], error during syncedRebalance
 kafka.common.ConsumerRebalanceFailedException: 
 terran_vmname-1422670149779-243b4e10 can't rebalance after 4 retries
   at 
 kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:432)
   at 
 kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener$$anon$1.run(ZookeeperConsumerConnector.scala:355)
 {code}
 The problem is also described in the mailing list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Error-when-Spark-streaming-consumes-from-Kafka-td19570.html
 As I understand it's a critical blocker for kafka-spark streaming production 
 use.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5514) collect should call executeCollect

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5514:

Assignee: Reynold Xin

 collect should call executeCollect
 --

 Key: SPARK-5514
 URL: https://issues.apache.org/jira/browse/SPARK-5514
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5491) Chi-square feature selection

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5491.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 1484
[https://github.com/apache/spark/pull/1484]

 Chi-square feature selection
 

 Key: SPARK-5491
 URL: https://issues.apache.org/jira/browse/SPARK-5491
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Alexander Ulanov
 Fix For: 1.3.0


 Implement chi-square feature selection. PR: 
 https://github.com/apache/spark/pull/1484



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5437) DriverSuite and SparkSubmitSuite incorrect timeout behavior

2015-02-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5437.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Target Version/s: 1.3.0

 DriverSuite and SparkSubmitSuite incorrect timeout behavior
 ---

 Key: SPARK-5437
 URL: https://issues.apache.org/jira/browse/SPARK-5437
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.3.0


 In DriverSuite, we currently set a timeout of 60 seconds. If after this time 
 the process has not terminated, we leak the process because we never destroy 
 it.
 In SparkSubmitSuite, we currently do not have a timeout so the test can hang 
 indefinitely.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

2015-02-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5388:
-
Summary: Provide a stable application submission gateway in standalone 
cluster mode  (was: Provide a stable application submission gateway)

 Provide a stable application submission gateway in standalone cluster mode
 --

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: Stable Spark Standalone Submission.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 The first cut implementation will target standalone cluster mode because 
 there are very few messages exchanged. The design, however, will be general 
 enough to eventually support this for other cluster managers too. Note that 
 this is not necessarily required in YARN because we already use YARN's stable 
 interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5388) Provide a stable application submission gateway in standalone cluster mode

2015-02-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5388:
-
Description: 
The existing submission gateway in standalone mode is not compatible across 
Spark versions. If you have a newer version of Spark submitting to an older 
version of the standalone Master, it is currently not guaranteed to work. The 
goal is to provide a stable REST interface to replace this channel.

The first cut implementation will target standalone cluster mode because there 
are very few messages exchanged. The design, however, should be general enough 
to potentially support this for other cluster managers too. Note that this is 
not necessarily required in YARN because we already use YARN's stable interface 
to submit applications there.

  was:
The existing submission gateway in standalone mode is not compatible across 
Spark versions. If you have a newer version of Spark submitting to an older 
version of the standalone Master, it is currently not guaranteed to work. The 
goal is to provide a stable REST interface to replace this channel.

The first cut implementation will target standalone cluster mode because there 
are very few messages exchanged. The design, however, will be general enough to 
eventually support this for other cluster managers too. Note that this is not 
necessarily required in YARN because we already use YARN's stable interface to 
submit applications there.


 Provide a stable application submission gateway in standalone cluster mode
 --

 Key: SPARK-5388
 URL: https://issues.apache.org/jira/browse/SPARK-5388
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Attachments: Stable Spark Standalone Submission.pdf


 The existing submission gateway in standalone mode is not compatible across 
 Spark versions. If you have a newer version of Spark submitting to an older 
 version of the standalone Master, it is currently not guaranteed to work. The 
 goal is to provide a stable REST interface to replace this channel.
 The first cut implementation will target standalone cluster mode because 
 there are very few messages exchanged. The design, however, should be general 
 enough to potentially support this for other cluster managers too. Note that 
 this is not necessarily required in YARN because we already use YARN's stable 
 interface to submit applications there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-02-02 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301792#comment-14301792
 ] 

Xiangrui Meng commented on SPARK-5226:
--

[~alitouka] Thanks for implementing DBSCAN on top of Spark! I'd like to 
recommend you registering it as a package on http://spark-packages.org, so it 
is more visible to the community.

 Add DBSCAN Clustering Algorithm to MLlib
 

 Key: SPARK-5226
 URL: https://issues.apache.org/jira/browse/SPARK-5226
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Muhammad-Ali A'rabi
Priority: Minor
  Labels: DBSCAN

 MLlib is all k-means now, and I think we should add some new clustering 
 algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4523) Improve handling of serialized schema information

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4523:

Priority: Critical  (was: Blocker)

 Improve handling of serialized schema information
 -

 Key: SPARK-4523
 URL: https://issues.apache.org/jira/browse/SPARK-4523
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical

 There are several issues with our current handling of metadata serialization, 
 which is especially troublesome since this is the only place that we persist 
 information directly using Spark SQL.  Moving forward we should do the 
 following:
  - Relax the parsing so that it does not fail when optional fields are 
 missing (i.e. containsNull or metadata)
  - Include a regression suite that attempts to read old parquet files written 
 by previous versions of Spark SQL.
  - Provide better warning messages when various forms of parsing fail (I 
 think that it is silent right now which makes tracking down bugs more 
 difficult than it needs to be).
  - Deprecate (display a warning) when reading data with the old case class 
 schema representation and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3851) Support for reading parquet files with different but compatible schema

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3851:

Assignee: Cheng Lian

 Support for reading parquet files with different but compatible schema
 --

 Key: SPARK-3851
 URL: https://issues.apache.org/jira/browse/SPARK-3851
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical

 Right now it is required that all of the parquet files have the same schema.  
 It would be nice to support some safe subset of cases where the schemas of 
 files is different.  For example:
  - Adding and removing nullable columns.
  - Widening types (a column that is of both Int and Long type)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3575:

Priority: Blocker  (was: Critical)

 Hive Schema is ignored when using convertMetastoreParquet
 -

 Key: SPARK-3575
 URL: https://issues.apache.org/jira/browse/SPARK-3575
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker

 This can cause problems when for example one of the columns is defined as 
 TINYINT.  A class cast exception will be thrown since the parquet table scan 
 produces INTs while the rest of the execution is expecting bytes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302027#comment-14302027
 ] 

Apache Spark commented on SPARK-3039:
-

User 'medale' has created a pull request for this issue:
https://github.com/apache/spark/pull/4315

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy
Assignee: Bertrand Bossy
 Fix For: 1.2.0


 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 {code}
 This error usually is a hint that there was a mix up of the old and the new 
 Hadoop API. As a work-around, if avro-mapred for hadoop2 is forced to 
 appear before the version that is bundled with Spark, reading avro files 
 works fine. 
 Also, if Spark is built using avro-mapred for hadoop2, it works fine as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4497) HiveThriftServer2 does not exit properly on failure

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4497:

Target Version/s: 1.4.0  (was: 1.3.0)

 HiveThriftServer2 does not exit properly on failure
 ---

 Key: SPARK-4497
 URL: https://issues.apache.org/jira/browse/SPARK-4497
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Yana Kadiyska
Priority: Critical

 start thriftserver with 
  sbin/start-thriftserver.sh --master ...
 If there is an error (in my case namenode is in standby mode) the driver 
 shuts down properly:
 14/11/19 16:32:58 ERROR HiveThriftServer2: Error starting HiveThriftServer2
 
 14/11/19 16:32:59 INFO SparkUI: Stopped Spark web UI at http://myip:4040
 14/11/19 16:32:59 INFO DAGScheduler: Stopping DAGScheduler
 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Shutting down all 
 executors
 14/11/19 16:32:59 INFO SparkDeploySchedulerBackend: Asking each executor to 
 shut down
 14/11/19 16:33:00 INFO MapOutputTrackerMasterActor: MapOutputTrackerActor 
 stopped!
 14/11/19 16:33:00 INFO MemoryStore: MemoryStore cleared
 14/11/19 16:33:00 INFO BlockManager: BlockManager stopped
 14/11/19 16:33:00 INFO BlockManagerMaster: BlockManagerMaster stopped
 14/11/19 16:33:00 INFO SparkContext: Successfully stopped SparkContext
 but trying to run  sbin/start-thriftserver.sh --master ... again results in 
 an error that Thrifserver is already running.
 ps -aef|grep offendingPID shows
 root 32334 1  0 16:32 ?00:00:00 /usr/local/bin/java 
 org.apache.spark.deploy.SparkSubmitDriverBootstrapper --class 
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 --master 
 spark://myip:7077 --conf -spark.executor.extraJavaOptions=-verbose:gc 
 -XX:-PrintGCDetails -XX:+PrintGCTimeStamps spark-internal --hiveconf 
 hive.root.logger=INFO,console
 This is problematic since we have a process that tries to restart the driver 
 if it dies



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5530) ApplicationMaster can't kill executor when using dynamicAllocation

2015-02-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5530:
-
Affects Version/s: 1.3.0

 ApplicationMaster can't kill executor when using dynamicAllocation
 --

 Key: SPARK-5530
 URL: https://issues.apache.org/jira/browse/SPARK-5530
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: meiyoula
Assignee: meiyoula
Priority: Critical
 Fix For: 1.3.0


 Yarn allocator logs Attempted to kill unknown executor 3, and executor 
 can't be killed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5514) collect should call executeCollect

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301953#comment-14301953
 ] 

Apache Spark commented on SPARK-5514:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4313

 collect should call executeCollect
 --

 Key: SPARK-5514
 URL: https://issues.apache.org/jira/browse/SPARK-5514
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5501) Write support for the data source API

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5501:

Assignee: Yin Huai

 Write support for the data source API
 -

 Key: SPARK-5501
 URL: https://issues.apache.org/jira/browse/SPARK-5501
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5463:

Assignee: Cheng Lian

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT

2015-02-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5532:
-
Description: 
Deterministic failure:
{code}
import org.apache.spark.mllib.linalg._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._
val data = sc.parallelize(Seq((1.0, 
Vectors.dense(1,2,3.toDataFrame(label, features)
data.repartition(1).saveAsParquetFile(blah)
{code}
If you remove the repartition, then this succeeds.

Here's the stack trace:
{code}
15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 
192.168.1.230): java.lang.ClassCastException: 
org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
org.apache.spark.sql.Row
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
aborting job
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 
7, 192.168.1.230): java.lang.ClassCastException: 
org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
org.apache.spark.sql.Row
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
at 
org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
at 
org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
at 
org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at 

[jira] [Updated] (SPARK-5463) Fix Parquet filter push-down

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5463:

Priority: Blocker  (was: Critical)

 Fix Parquet filter push-down
 

 Key: SPARK-5463
 URL: https://issues.apache.org/jira/browse/SPARK-5463
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.2.1, 1.2.2
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5184) Improve the performance of metadata operations

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5184.
-
Resolution: Won't Fix

 Improve the performance of metadata operations
 --

 Key: SPARK-5184
 URL: https://issues.apache.org/jira/browse/SPARK-5184
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-02 Thread Markus Dale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301780#comment-14301780
 ] 

Markus Dale edited comment on SPARK-3039 at 2/2/15 8:40 PM:


For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or
compiling the source with 

{code}
mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests clean package
{code}

still had the same problem:

{noformat}
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)

{noformat}

Starting the build with a clean .m2/repository, the repository afterwards 
contained:

* avro-mapred/1.7.5 (with the default jar - i.e. hadoop1)
* avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). 

Seemed that sharding these two dependencies into the spark-assembly-jar 
resulted in the error above
at least in the downloaded hadoop2.4 spark bin and my own build.

Running the following (after doing a mvn install and by-hand copy of all the 
spark artifacts into my local repo for spark-repl/yarn):

{code}
 mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree 
-Dincludes=org.apache.avro:avro-mapred
{code}

Showed that the culprit was in the Hive project, namely 
org.spark-project.hive:hive-exec's
dependency on 1.7.5.

{noformat}
Building Spark Project Hive 1.2.0
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]
{noformat}

Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec,
then recompile, fixed the problem and the resulting dist works well against
Avro/Hadoop2 code:

{code:xml}
dependency
  groupIdorg.spark-project.hive/groupId
  artifactIdhive-exec/artifactId
  version${hive.version}/version
  exclusions
exclusion
  groupIdcommons-logging/groupId
  artifactIdcommons-logging/artifactId
/exclusion
exclusion
  groupIdcom.esotericsoftware.kryo/groupId
  artifactIdkryo/artifactId
/exclusion
exclusion
  groupIdorg.apache.avro/groupId
  artifactIdavro-mapred/artifactId
/exclusion
  /exclusions
/dependency
{code}
   
Just the last exclusion added. Will try to do a pull-request if that's not 
already addressed in the latest code.


was (Author: medale):
For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or
compiling the source with 

{code}
mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package
{code}

still had the same problem:

{noformat}
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)

{noformat}

Starting the build with a clean .m2/repository, the repository afterwards 
contained:

* avro-mapred/1.7.5 (with the default jar - i.e. hadoop1)
* avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). 

Seemed that sharding these two dependencies into the spark-assembly-jar 
resulted in the error above
at least in the downloaded hadoop2.4 spark bin and my own build.

Running the following (after doing a mvn install and by-hand copy of all the 
spark artifacts into my local repo for spark-repl/yarn):

{code}
 mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree 
-Dincludes=org.apache.avro:avro-mapred
{code}

Showed that the culprit was in the Hive project, namely 
org.spark-project.hive:hive-exec's
dependency on 1.7.5.

{noformat}
Building Spark Project Hive 1.2.0
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]
{noformat}

Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec,
then recompile, fixed the problem and the resulting dist works well against
Avro/Hadoop2 code:

{code:xml}
dependency
  groupIdorg.spark-project.hive/groupId
  artifactIdhive-exec/artifactId
 

[jira] [Commented] (SPARK-3039) Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API

2015-02-02 Thread Markus Dale (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301780#comment-14301780
 ] 

Markus Dale commented on SPARK-3039:


For me, Spark 1.2.0 either downloading spark-1.2.0-bin-hadoop2.4.tgz or
compiling the source with 

{code}
mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests clean package
{code}

still had the same problem:

{noformat}
java.lang.IncompatibleClassChangeError: Found interface 
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at 
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135)

{noformat}

Starting the build with a clean .m2/repository, the repository afterwards 
contained:

* avro-mapred/1.7.5 (with the default jar - i.e. hadoop1)
* avro-mapred/1.7.6 with the avro-mapred-1.7.6-hadoop2.jar (the one we want). 

Seemed that sharding these two dependencies into the spark-assembly-jar 
resulted in the error above
at least in the downloaded hadoop2.4 spark bin and my own build.

Running the following (after doing a mvn install and by-hand copy of all the 
spark artifacts into my local repo for spark-repl/yarn):

{code}
 mvn -Pyarn -Phadoop-2.4 -Phive -DskipTests dependency:tree 
-Dincludes=org.apache.avro:avro-mapred
{code}

Showed that the culprit was in the Hive project, namely 
org.spark-project.hive:hive-exec's
dependency on 1.7.5.

{noformat}
Building Spark Project Hive 1.2.0
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 ---
[INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0
[INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile
[INFO] |  \- org.apache.avro:avro-mapred:jar:1.7.5:compile
[INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile
[INFO]
{noformat}

Editing spark-1.2.0/sql/hive/pom.xml and excluding avro-mapred from hive-exec,
then recompile, fixed the problem and the resulting dist works well against
Avro/Hadoop2 code:

{code:xml}
dependency
  groupIdorg.spark-project.hive/groupId
  artifactIdhive-exec/artifactId
  version${hive.version}/version
  exclusions
exclusion
  groupIdcommons-logging/groupId
  artifactIdcommons-logging/artifactId
/exclusion
exclusion
  groupIdcom.esotericsoftware.kryo/groupId
  artifactIdkryo/artifactId
/exclusion
exclusion
  groupIdorg.apache.avro/groupId
  artifactIdavro-mapred/artifactId
/exclusion
  /exclusions
/dependency
{code}
   
Just the last exclusion added. Will try to do a pull-request if that's not 
already addressed in the latest code.

 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 
 1 API
 --

 Key: SPARK-3039
 URL: https://issues.apache.org/jira/browse/SPARK-3039
 Project: Spark
  Issue Type: Bug
  Components: Build, Input/Output, Spark Core
Affects Versions: 0.9.1, 1.0.0, 1.1.0
 Environment: hadoop2, hadoop-2.4.0, HDP-2.1
Reporter: Bertrand Bossy
Assignee: Bertrand Bossy
 Fix For: 1.2.0


 The spark assembly contains the artifact org.apache.avro:avro-mapred as a 
 dependency of org.spark-project.hive:hive-serde.
 The avro-mapred package provides a hadoop FileInputFormat to read and write 
 avro files. There are two versions of this package, distinguished by a 
 classifier. avro-mapred for the new Hadoop API uses the classifier hadoop2. 
 avro-mapred for the old Hadoop API uses no classifier.
 E.g. when reading avro files using 
 {code}
 sc.newAPIHadoopFile[AvroKey[SomeClass]],NullWritable,AvroKeyInputFormat[SomeClass]](hdfs://path/to/file.avro)
 {code}
 The following error occurs:
 {code}
 java.lang.IncompatibleClassChangeError: Found interface 
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at 
 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:111)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:99)
 at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 at 

[jira] [Assigned] (SPARK-5518) Error messages for plans with invalid AttributeReferences

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-5518:
---

Assignee: Michael Armbrust

 Error messages for plans with invalid AttributeReferences
 -

 Key: SPARK-5518
 URL: https://issues.apache.org/jira/browse/SPARK-5518
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker

 It is now possible for users to put invalid attribute references into query 
 plans.  We should check for this case at the end of analysis.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3267:

Target Version/s: 1.4.0  (was: 1.3.0)

 Deadlock between ScalaReflectionLock and Data type initialization
 -

 Key: SPARK-3267
 URL: https://issues.apache.org/jira/browse/SPARK-3267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Priority: Critical

 Deadlock here:
 {code}
 Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 
 nid=0x27a in Object.wait() [0x7fab60c2e000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:202)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
 a:175)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:304)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:314)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:313)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 ...
 {code}
 and
 {code}
 Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 
 nid=0x27e in Object.wait() [0x7fab0eeec000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
 - locked 0x00064e5d9a48 (a 
 org.apache.spark.sql.catalyst.expressions.Cast)
 at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at 

[jira] [Updated] (SPARK-5258) Clean up exposed classes in sql.hive package

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5258:

Priority: Blocker  (was: Major)

 Clean up exposed classes in sql.hive package
 

 Key: SPARK-5258
 URL: https://issues.apache.org/jira/browse/SPARK-5258
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values

2015-02-02 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5534:


 Summary: EdgeRDD, VertexRDD getStorageLevel return bad values
 Key: SPARK-5534
 URL: https://issues.apache.org/jira/browse/SPARK-5534
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


After caching a graph, its edge and vertex RDDs still return StorageLevel.None.

Reproduce error:
{code}
import org.apache.spark.graphx.{Edge, Graph}
val edges = Seq(
  Edge[Double](0, 1, 0),
  Edge[Double](1, 2, 0),
  Edge[Double](2, 3, 0),
  Edge[Double](3, 4, 0))
val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0)
g.vertices.getStorageLevel  // returns value for StorageLevel.None
g.edges.getStorageLevel  // returns value for StorageLevel.None
g.cache()
g.vertices.count()
g.edges.count()
g.vertices.getStorageLevel  // returns value for StorageLevel.None
g.edges.getStorageLevel  // returns value for StorageLevel.None
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5531) Spark download .tgz file does not get unpacked

2015-02-02 Thread DeepakVohra (JIRA)
DeepakVohra created SPARK-5531:
--

 Summary: Spark download .tgz file does not get unpacked
 Key: SPARK-5531
 URL: https://issues.apache.org/jira/browse/SPARK-5531
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: Linux
Reporter: DeepakVohra


The spark-1.2.0-bin-cdh4.tgz file downloaded from 
http://spark.apache.org/downloads.html does not get unpacked.

tar xvf spark-1.2.0-bin-cdh4.tgz

gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4811) Custom UDTFs not working in Spark SQL

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4811:

Target Version/s: 1.4.0  (was: 1.3.0)

 Custom UDTFs not working in Spark SQL
 -

 Key: SPARK-4811
 URL: https://issues.apache.org/jira/browse/SPARK-4811
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.1.1
Reporter: Saurabh Santhosh
Priority: Critical

 I am using the Thrift srever interface to Spark SQL and using beeline to 
 connect to it.
 I tried Spark SQL versions 1.1.0 and 1.1.1 and both are throwing the 
 following exception when using any custom UDTF.
 These are the steps i did :
 *Created a UDTF 'com.x.y.xxx'.*
 Registered the UDTF using following query : 
 *create temporary function xxx as 'com.x.y.xxx'*
 The registration went through without any errors. But when i tried executing 
 the UDTF i got the following error.
 *java.lang.ClassNotFoundException: xxx*
 Funny thing is that Its trying to load the function name instead of the 
 funtion class. The exception is at *line no: 81 in hiveudfs.scala*
 I have been at it for quite a long time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT

2015-02-02 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-5532:


 Summary: Repartitioning DataFrame causes saveAsParquetFile to fail 
with VectorUDT
 Key: SPARK-5532
 URL: https://issues.apache.org/jira/browse/SPARK-5532
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Deterministic failure:
{code}
import org.apache.spark.mllib.linalg._
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
import sqlContext._
val data = sc.parallelize(Seq((1.0, 
Vectors.dense(1,2,3.toDataFrame(label, features)
data.repartition(1).saveAsParquetFile(blah)
{code}
If you remove the repartition, then this succeeds.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4553:

Assignee: Cheng Lian

 query for parquet table with string fields in spark sql hive get binary result
 --

 Key: SPARK-4553
 URL: https://issues.apache.org/jira/browse/SPARK-4553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Cheng Lian

 run 
 create table test_parquet(key int, value string) stored as parquet;
 insert into table test_parquet select * from src;
 select * from test_parquet;
 get result as follow
 ...
 282 [B@38fda3b
 138 [B@1407a24
 238 [B@12de6fb
 419 [B@6c97695
 15 [B@4885067
 118 [B@156a8d3
 72 [B@65d20dd
 90 [B@4c18906
 307 [B@60b24cc
 19 [B@59cf51b
 435 [B@39fdf37
 10 [B@4f799d7
 277 [B@3950951
 273 [B@596bf4b
 306 [B@3e91557
 224 [B@3781d61
 309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4553) query for parquet table with string fields in spark sql hive get binary result

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4553:

Priority: Blocker  (was: Major)

 query for parquet table with string fields in spark sql hive get binary result
 --

 Key: SPARK-4553
 URL: https://issues.apache.org/jira/browse/SPARK-4553
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei
Assignee: Cheng Lian
Priority: Blocker

 run 
 create table test_parquet(key int, value string) stored as parquet;
 insert into table test_parquet select * from src;
 select * from test_parquet;
 get result as follow
 ...
 282 [B@38fda3b
 138 [B@1407a24
 238 [B@12de6fb
 419 [B@6c97695
 15 [B@4885067
 118 [B@156a8d3
 72 [B@65d20dd
 90 [B@4c18906
 307 [B@60b24cc
 19 [B@59cf51b
 435 [B@39fdf37
 10 [B@4f799d7
 277 [B@3950951
 273 [B@596bf4b
 306 [B@3e91557
 224 [B@3781d61
 309 [B@2d0d128



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4585) Spark dynamic executor allocation shouldn't use maxExecutors as initial number

2015-02-02 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4585.

  Resolution: Fixed
   Fix Version/s: 1.3.0
Assignee: Sandy Ryza
Target Version/s: 1.3.0

 Spark dynamic executor allocation shouldn't use maxExecutors as initial number
 --

 Key: SPARK-4585
 URL: https://issues.apache.org/jira/browse/SPARK-4585
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Chengxiang Li
Assignee: Sandy Ryza
 Fix For: 1.3.0


 With SPARK-3174, one can configure a minimum and maximum number of executors 
 for a Spark application on Yarn. However, the application always starts with 
 the maximum. It seems more reasonable, at least for Hive on Spark, to start 
 from the minimum and scale up as needed up to the maximum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5425) ConcurrentModificationException during SparkConf creation

2015-02-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5425:
--
Assignee: Jacek Lewandowski

 ConcurrentModificationException during SparkConf creation
 -

 Key: SPARK-5425
 URL: https://issues.apache.org/jira/browse/SPARK-5425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Jacek Lewandowski
Assignee: Jacek Lewandowski
 Fix For: 1.3.0, 1.1.2


 This fragment of code:
 {code}
   if (loadDefaults) {
 // Load any spark.* system properties
 for ((k, v) - System.getProperties.asScala if k.startsWith(spark.)) {
   settings(k) = v
 }
   }
 {code}
 causes 
 {noformat}
 ERROR 09:43:15  SparkMaster service caused error in state 
 STARTINGjava.util.ConcurrentModificationException: null
   at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) 
 ~[na:1.7.0_60]
   at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
  ~[scala-library-2.10.4.jar:na]
   at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
  ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
 ~[scala-library-2.10.4.jar:na]
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
  ~[scala-library-2.10.4.jar:na]
   at org.apache.spark.SparkConf.init(SparkConf.scala:53) 
 ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT]
   at org.apache.spark.SparkConf.init(SparkConf.scala:47) 
 ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT]
 {noformat}
 when there is another thread which modifies system properties at the same 
 time. 
 This bug https://issues.scala-lang.org/browse/SI-7775 is somehow related to 
 the issue and shows that the problem has been also found elsewhere. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5425) ConcurrentModificationException during SparkConf creation

2015-02-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5425:
--
Fix Version/s: 1.1.2
   1.3.0

 ConcurrentModificationException during SparkConf creation
 -

 Key: SPARK-5425
 URL: https://issues.apache.org/jira/browse/SPARK-5425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: Jacek Lewandowski
Assignee: Jacek Lewandowski
 Fix For: 1.3.0, 1.1.2


 This fragment of code:
 {code}
   if (loadDefaults) {
 // Load any spark.* system properties
 for ((k, v) - System.getProperties.asScala if k.startsWith(spark.)) {
   settings(k) = v
 }
   }
 {code}
 causes 
 {noformat}
 ERROR 09:43:15  SparkMaster service caused error in state 
 STARTINGjava.util.ConcurrentModificationException: null
   at java.util.Hashtable$Enumerator.next(Hashtable.java:1167) 
 ~[na:1.7.0_60]
   at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:458)
  ~[scala-library-2.10.4.jar:na]
   at 
 scala.collection.convert.Wrappers$JPropertiesWrapper$$anon$3.next(Wrappers.scala:454)
  ~[scala-library-2.10.4.jar:na]
   at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) 
 ~[scala-library-2.10.4.jar:na]
   at scala.collection.AbstractIterable.foreach(Iterable.scala:54) 
 ~[scala-library-2.10.4.jar:na]
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
  ~[scala-library-2.10.4.jar:na]
   at org.apache.spark.SparkConf.init(SparkConf.scala:53) 
 ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT]
   at org.apache.spark.SparkConf.init(SparkConf.scala:47) 
 ~[spark-core_2.10-1.2.1_dse-20150121.075638-2.jar:1.2.1_dse-SNAPSHOT]
 {noformat}
 when there is another thread which modifies system properties at the same 
 time. 
 This bug https://issues.scala-lang.org/browse/SI-7775 is somehow related to 
 the issue and shows that the problem has been also found elsewhere. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode

2015-02-02 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4986:
-
Priority: Blocker  (was: Major)

 Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
 --

 Key: SPARK-4986
 URL: https://issues.apache.org/jira/browse/SPARK-4986
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jesper Lundgren
Priority: Blocker
 Fix For: 1.3.0


 When using the graceful stop API of Spark Streaming in Spark Standalone 
 cluster the stop signal never reaches the receivers. I have tested this with 
 Spark 1.2 and Kafka receivers. 
 ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl.
 In local mode ReceiverSupervisorImpl receives this message but in Standalone 
 cluster mode the message seems to be lost.
 (I have modified the code to send my own string message as a stop signal from 
 ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode

2015-02-02 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-4986:
-
Fix Version/s: (was: 1.2.1)

 Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
 --

 Key: SPARK-4986
 URL: https://issues.apache.org/jira/browse/SPARK-4986
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jesper Lundgren
Priority: Blocker
 Fix For: 1.3.0


 When using the graceful stop API of Spark Streaming in Spark Standalone 
 cluster the stop signal never reaches the receivers. I have tested this with 
 Spark 1.2 and Kafka receivers. 
 ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl.
 In local mode ReceiverSupervisorImpl receives this message but in Standalone 
 cluster mode the message seems to be lost.
 (I have modified the code to send my own string message as a stop signal from 
 ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5027) add SVMWithLBFGS interface in MLLIB

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5027:
-
Assignee: zhengbing li

 add SVMWithLBFGS interface in MLLIB
 ---

 Key: SPARK-5027
 URL: https://issues.apache.org/jira/browse/SPARK-5027
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: zhengbing li
Assignee: zhengbing li
   Original Estimate: 120h
  Remaining Estimate: 120h

   Our team has done the comparison test for ann. The test results are in   
 “https://github.com/apache/spark/pull/1290”
   And we find the performance of svm using LBFGS is higher than svm using 
 SGD, so I want to add SVMWithLBFGS interface to mllib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2206) Automatically infer the number of classification classes in multiclass classification

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2206:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 Automatically infer the number of classification classes in multiclass 
 classification
 -

 Key: SPARK-2206
 URL: https://issues.apache.org/jira/browse/SPARK-2206
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde
Assignee: Manish Amde

 Currently, the user needs to specify the numClassesForClassification 
 parameter explicitly during multiclass classification for decision trees. 
 This feature will automatically infer this information (and possibly class 
 histograms) from the training data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5027) add SVMWithLBFGS interface in MLLIB

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5027:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 add SVMWithLBFGS interface in MLLIB
 ---

 Key: SPARK-5027
 URL: https://issues.apache.org/jira/browse/SPARK-5027
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: zhengbing li
Assignee: zhengbing li
   Original Estimate: 120h
  Remaining Estimate: 120h

   Our team has done the comparison test for ann. The test results are in   
 “https://github.com/apache/spark/pull/1290”
   And we find the performance of svm using LBFGS is higher than svm using 
 SGD, so I want to add SVMWithLBFGS interface to mllib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5536) Wrap the old ALS to use the new ALS implementation.

2015-02-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5536:


 Summary: Wrap the old ALS to use the new ALS implementation.
 Key: SPARK-5536
 URL: https://issues.apache.org/jira/browse/SPARK-5536
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


The new implementation performs better. We should replace the old one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5542) Decouple publishing, packaging, and tagging in release script

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302578#comment-14302578
 ] 

Apache Spark commented on SPARK-5542:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/4319

 Decouple publishing, packaging, and tagging in release script
 -

 Key: SPARK-5542
 URL: https://issues.apache.org/jira/browse/SPARK-5542
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Patrick Wendell

 Our release script should make it easy to do these separately. I.e. it should 
 be possible to publish a release from a tag that we already cut. This would 
 help with things such as publishing nightly releases (SPARK-1517).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections

2015-02-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3883.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3571
[https://github.com/apache/spark/pull/3571]

 Provide SSL support for Akka and HttpServer based connections
 -

 Key: SPARK-3883
 URL: https://issues.apache.org/jira/browse/SPARK-3883
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jacek Lewandowski
 Fix For: 1.3.0


 Spark uses at least 4 logical communication channels:
 1. Control messages - Akka based
 2. JARs and other files - Jetty based (HttpServer)
 3. Computation results - Java NIO based
 4. Web UI - Jetty based
 The aim of this feature is to enable SSL for (1) and (2).
 Why:
 Spark configuration is sent through (1). Spark configuration may contain 
 sensitive information like credentials for accessing external data sources or 
 streams. Application JAR files (2) may include the application logic and 
 therefore they may include information about the structure of the external 
 data sources, and credentials as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5532:
-
Assignee: Cheng Lian

 Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
 

 Key: SPARK-5532
 URL: https://issues.apache.org/jira/browse/SPARK-5532
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Cheng Lian
Priority: Blocker

 Deterministic failure:
 {code}
 import org.apache.spark.mllib.linalg._
 import org.apache.spark.sql.SQLContext
 val sqlContext = new SQLContext(sc)
 import sqlContext._
 val data = sc.parallelize(Seq((1.0, 
 Vectors.dense(1,2,3.toDataFrame(label, features)
 data.repartition(1).saveAsParquetFile(blah)
 {code}
 If you remove the repartition, then this succeeds.
 Here's the stack trace:
 {code}
 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 
 192.168.1.230): java.lang.ClassCastException: 
 org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
 org.apache.spark.sql.Row
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
   at 
 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
 aborting job
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
 (TID 7, 192.168.1.230): java.lang.ClassCastException: 
 org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
 org.apache.spark.sql.Row
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
   at 
 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
   at 
 

[jira] [Commented] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn

2015-02-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302290#comment-14302290
 ] 

Patrick Wendell commented on SPARK-3778:


/cc [~hshreedharan]

 newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
 -

 Key: SPARK-3778
 URL: https://issues.apache.org/jira/browse/SPARK-3778
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Blocker

 The newAPIHadoopRDD routine doesn't properly add the credentials to the conf 
 to be able to access secure hdfs.
 Note that newAPIHadoopFile does handle these because the 
 org.apache.hadoop.mapreduce.Job automatically adds it for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3778) newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn

2015-02-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3778:
---
Priority: Blocker  (was: Critical)

 newAPIHadoopRDD doesn't properly pass credentials for secure hdfs on yarn
 -

 Key: SPARK-3778
 URL: https://issues.apache.org/jira/browse/SPARK-3778
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Blocker

 The newAPIHadoopRDD routine doesn't properly add the credentials to the conf 
 to be able to access secure hdfs.
 Note that newAPIHadoopFile does handle these because the 
 org.apache.hadoop.mapreduce.Job automatically adds it for you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4550:
---
Target Version/s: 1.4.0

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4550:
---
Priority: Critical  (was: Major)

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5537) Expand user guide for multinomial logistic regression

2015-02-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5537:


 Summary: Expand user guide for multinomial logistic regression
 Key: SPARK-5537
 URL: https://issues.apache.org/jira/browse/SPARK-5537
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: DB Tsai


We probably don't need to work out the math in the user guide. We can point 
users to wikipedia for details and focus on the public APIs and how to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4508) Native Date type for SQL92 Date

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4508.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3732
[https://github.com/apache/spark/pull/3732]

 Native Date type for SQL92 Date
 ---

 Key: SPARK-4508
 URL: https://issues.apache.org/jira/browse/SPARK-4508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.3.0


 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 
 bytes as Long) in catalyst row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5540) Hide ALS.solveLeastSquares.

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302495#comment-14302495
 ] 

Apache Spark commented on SPARK-5540:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4318

 Hide ALS.solveLeastSquares.
 ---

 Key: SPARK-5540
 URL: https://issues.apache.org/jira/browse/SPARK-5540
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 This method survived the code review and it has been there since v1.1.0. It 
 exposes jblas types. Let's remove it from the public API. I expect that no 
 one calls it directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5513) Add NMF option to the new ALS implementation

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5513.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4302
[https://github.com/apache/spark/pull/4302]

 Add NMF option to the new ALS implementation
 

 Key: SPARK-5513
 URL: https://issues.apache.org/jira/browse/SPARK-5513
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.0


 Then we can swap spark.mllib's implementation to use the new ALS impl.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5231) History Server shows wrong job submission time.

2015-02-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5231:
--
Target Version/s: 1.3.0, 1.2.2  (was: 1.3.0)

 History Server shows wrong job submission time.
 ---

 Key: SPARK-5231
 URL: https://issues.apache.org/jira/browse/SPARK-5231
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
  Labels: backport-needed
 Fix For: 1.3.0


 History Server doesn't show collect job submission time.
 It's because JobProgressListener updates job submission time every time 
 onJobStart method is invoked from ReplayListenerBus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.

2015-02-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5195:
---
Fix Version/s: (was: 1.2.1)

 when hive table is query with alias  the cache data  lose effectiveness.
 

 Key: SPARK-5195
 URL: https://issues.apache.org/jira/browse/SPARK-5195
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: yixiaohua
 Fix For: 1.3.0


 override the MetastoreRelation's sameresult method only compare databasename 
 and table name
 because in previous :
 cache table t1;
 select count() from t1;
 it will read data from memory but the sql below will not,instead it read from 
 hdfs:
 select count() from t1 t;
 because cache data is keyed by logical plan and compare with sameResult ,so 
 when table with alias the same table 's logicalplan is not the same logical 
 plan with out alias so modify the sameresult method only compare databasename 
 and table name



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5231) History Server shows wrong job submission time.

2015-02-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5231:
--
Labels: backport-needed  (was: )

 History Server shows wrong job submission time.
 ---

 Key: SPARK-5231
 URL: https://issues.apache.org/jira/browse/SPARK-5231
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
  Labels: backport-needed
 Fix For: 1.3.0


 History Server doesn't show collect job submission time.
 It's because JobProgressListener updates job submission time every time 
 onJobStart method is invoked from ReplayListenerBus.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5454) [SQL] Self join with ArrayType columns problems

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5454:

Priority: Blocker  (was: Major)

 [SQL] Self join with ArrayType columns problems
 ---

 Key: SPARK-5454
 URL: https://issues.apache.org/jira/browse/SPARK-5454
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Pierre Borckmans
Priority: Blocker

 Weird behaviour when performing self join on a table with some ArrayType 
 field.  (potential bug ?) 
 I have set up a minimal non working example here: 
 https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
 In a nutshell, if the ArrayType column used for the pivot is created manually 
 in the StructType definition, everything works as expected. 
 However, if the ArrayType pivot column is obtained by a sql query (be it by 
 using a array wrapper, or using a collect_list operator for instance), then 
 results are completely off. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5514) collect should call executeCollect

2015-02-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5514.

   Resolution: Fixed
Fix Version/s: 1.3.0

 collect should call executeCollect
 --

 Key: SPARK-5514
 URL: https://issues.apache.org/jira/browse/SPARK-5514
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.3.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4508) Native Date type for SQL92 Date

2015-02-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4508:
---
Fix Version/s: (was: 1.3.0)

 Native Date type for SQL92 Date
 ---

 Key: SPARK-4508
 URL: https://issues.apache.org/jira/browse/SPARK-4508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang

 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 
 bytes as Long) in catalyst row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5544) wholeTextFiles should recognize multiple input paths delimited by ,

2015-02-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5544:


 Summary: wholeTextFiles should recognize multiple input paths 
delimited by ,
 Key: SPARK-5544
 URL: https://issues.apache.org/jira/browse/SPARK-5544
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng


textFile takes delimited paths in a single path string. wholeTextFiles should 
behave the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5500) Document that feeding hadoopFile into a shuffle operation will cause problems

2015-02-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5500.

   Resolution: Fixed
Fix Version/s: 1.3.0

 Document that feeding hadoopFile into a shuffle operation will cause problems
 -

 Key: SPARK-5500
 URL: https://issues.apache.org/jira/browse/SPARK-5500
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
Priority: Critical
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2309.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3833
[https://github.com/apache/spark/pull/3833]

 Generalize the binary logistic regression into multinomial logistic regression
 --

 Key: SPARK-2309
 URL: https://issues.apache.org/jira/browse/SPARK-2309
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
Priority: Critical
 Fix For: 1.3.0


 Currently, there is no multi-class classifier in mllib. Logistic regression 
 can be extended to multinomial one straightforwardly. 
 The following formula will be implemented. 
 http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4980) Add decay factors to streaming linear methods

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4980:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 Add decay factors to streaming linear methods
 -

 Key: SPARK-4980
 URL: https://issues.apache.org/jira/browse/SPARK-4980
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Streaming
Reporter: Jeremy Freeman
Priority: Minor

 Our implementation of streaming k-means uses an decay factor that allows 
 users to control how quickly the model adjusts to new data: whether it treats 
 all data equally, or only bases its estimate on the most recent batch. It is 
 intuitively parameterized, and can be specified in units of either batches or 
 points. We should add a similar decay factor to the streaming linear methods 
 using SGD, including streaming linear regression (currently implemented) and 
 streaming logistic regression (in development).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5520) Make FP-Growth implementation take generic item types

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5520:
-
Priority: Critical  (was: Major)

 Make FP-Growth implementation take generic item types
 -

 Key: SPARK-5520
 URL: https://issues.apache.org/jira/browse/SPARK-5520
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Critical

 There is not technical restriction on the item types in the FP-Growth 
 implementation. We used String in the first PR for simplicity. Maybe we could 
 make the type generic before 1.3 (and specialize it for Int/Long).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4526) Gradient should be added batch computing interface

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4526:
-
Assignee: Guoqiang Li

 Gradient should be added batch computing interface
 --

 Key: SPARK-4526
 URL: https://issues.apache.org/jira/browse/SPARK-4526
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 If Gradient support batch computing, we can use some efficient numerical 
 libraries(eg, BLAS).
 In some cases, it can improve the performance of more than ten times as much.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4526) Gradient should be added batch computing interface

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4526:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 Gradient should be added batch computing interface
 --

 Key: SPARK-4526
 URL: https://issues.apache.org/jira/browse/SPARK-4526
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 If Gradient support batch computing, we can use some efficient numerical 
 libraries(eg, BLAS).
 In some cases, it can improve the performance of more than ten times as much.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2309) Generalize the binary logistic regression into multinomial logistic regression

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2309:
-
Priority: Critical  (was: Major)

 Generalize the binary logistic regression into multinomial logistic regression
 --

 Key: SPARK-2309
 URL: https://issues.apache.org/jira/browse/SPARK-2309
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: DB Tsai
Assignee: DB Tsai
Priority: Critical

 Currently, there is no multi-class classifier in mllib. Logistic regression 
 can be extended to multinomial one straightforwardly. 
 The following formula will be implemented. 
 http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297/25



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5539) User guide for LDA

2015-02-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5539:


 Summary: User guide for LDA
 Key: SPARK-5539
 URL: https://issues.apache.org/jira/browse/SPARK-5539
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley


Add a section for LDA in the user guide. We probably don't need to explain the 
algorithm in details but point people to the wikipedia page. The user guide 
should focus on public APIs and how to use LDA.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2005) Investigate linux container-based solution

2015-02-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302554#comment-14302554
 ] 

Nicholas Chammas commented on SPARK-2005:
-

[~mengxr] - Do you mind if I renamed this issue to Containerize execution of 
Spark tests? I'm thinking it would be helpful to convert {{dev/run-tests}} to 
execute tests within a container so we can run more test configurations in 
parallel on the same server without worrying about things like port or file 
collisions.

Or were you thinking we should develop a way to deploy full Spark clusters 
within containers?

 Investigate linux container-based solution
 --

 Key: SPARK-2005
 URL: https://issues.apache.org/jira/browse/SPARK-2005
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Xiangrui Meng

 We can set up container-based cluster environment and automatically test 
 against a deployment matrix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5454) [SQL] Self join with ArrayType columns problems

2015-02-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5454:

Target Version/s: 1.3.0

 [SQL] Self join with ArrayType columns problems
 ---

 Key: SPARK-5454
 URL: https://issues.apache.org/jira/browse/SPARK-5454
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Pierre Borckmans
Priority: Blocker

 Weird behaviour when performing self join on a table with some ArrayType 
 field.  (potential bug ?) 
 I have set up a minimal non working example here: 
 https://gist.github.com/pierre-borckmans/4853cd6d0b2f2388bf4f
 In a nutshell, if the ArrayType column used for the pivot is created manually 
 in the StructType definition, everything works as expected. 
 However, if the ArrayType pivot column is obtained by a sql query (be it by 
 using a array wrapper, or using a collect_list operator for instance), then 
 results are completely off. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5534.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4317
[https://github.com/apache/spark/pull/4317]

 EdgeRDD, VertexRDD getStorageLevel return bad values
 

 Key: SPARK-5534
 URL: https://issues.apache.org/jira/browse/SPARK-5534
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.3.0


 After caching a graph, its edge and vertex RDDs still return 
 StorageLevel.None.
 Reproduce error:
 {code}
 import org.apache.spark.graphx.{Edge, Graph}
 val edges = Seq(
   Edge[Double](0, 1, 0),
   Edge[Double](1, 2, 0),
   Edge[Double](2, 3, 0),
   Edge[Double](3, 4, 0))
 val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0)
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 g.cache()
 g.vertices.count()
 g.edges.count()
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1406) PMML model evaluation support via MLib

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1406:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 PMML model evaluation support via MLib
 --

 Key: SPARK-1406
 URL: https://issues.apache.org/jira/browse/SPARK-1406
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Thomas Darimont
Assignee: Vincenzo Selvaggio
 Attachments: MyJPMMLEval.java, SPARK-1406.pdf, SPARK-1406_v2.pdf, 
 kmeans.xml


 It would be useful if spark would provide support the evaluation of PMML 
 models (http://www.dmg.org/v4-2/GeneralStructure.html).
 This would allow to use analytical models that were created with a 
 statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
 would perform the actual model evaluation for a given input tuple. The PMML 
 model would then just contain the parameterization of an analytical model.
 Other projects like JPMML-Evaluator do a similar thing.
 https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5520) Make FP-Growth implementation take generic item types

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5520:
-
Assignee: Jacky Li

 Make FP-Growth implementation take generic item types
 -

 Key: SPARK-5520
 URL: https://issues.apache.org/jira/browse/SPARK-5520
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Jacky Li
Priority: Critical

 There is not technical restriction on the item types in the FP-Growth 
 implementation. We used String in the first PR for simplicity. Maybe we could 
 make the type generic before 1.3 (and specialize it for Int/Long).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5519) Add user guide for FP-Growth

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5519:
-
Assignee: Jacky Li

 Add user guide for FP-Growth
 

 Key: SPARK-5519
 URL: https://issues.apache.org/jira/browse/SPARK-5519
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Reporter: Xiangrui Meng
Assignee: Jacky Li

 We need to add a section for FP-Growth in the user guide after we merge the 
 FP-Growth PR is merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4285) Transpose RDD[Vector] to column store for ML

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4285:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 Transpose RDD[Vector] to column store for ML
 

 Key: SPARK-4285
 URL: https://issues.apache.org/jira/browse/SPARK-4285
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor

 For certain ML algorithms, a column store is more efficient than a row store 
 (which is currently used everywhere).  E.g., deep decision trees can be 
 faster to train when partitioning by features.
 Proposal: Provide a method with the following API (probably in util/):
 ```
 def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
 ```
 The input Vectors will be data rows/instances, and the output Vectors will be 
 columns/features paired with column/feature indices.
 **Question**: Is it important to maintain matrix structure?  That is, should 
 output Vectors in the same partition be adjacent columns in the matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-02 Thread Sandy Ryza (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-4550:
--
Attachment: SPARK-4550-design-v1.pdf

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3505) Augmenting SparkStreaming updateStateByKey API with timestamp

2015-02-02 Thread Xi Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xi Liu closed SPARK-3505.
-
Resolution: Won't Fix

Close this issue for now. Will re-open later when I find time to work on it.

 Augmenting SparkStreaming updateStateByKey API with timestamp
 -

 Key: SPARK-3505
 URL: https://issues.apache.org/jira/browse/SPARK-3505
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Xi Liu
Priority: Minor

 The current updateStateByKey API in Spark Streaming does not expose timestamp 
 to the application. 
 In our use case, the application need to know the batch timestamp to decide 
 whether to keep the state or not. And we do not want to use real system time 
 because we want to decouple the two (because the same code base is used for 
 streaming and offline processing).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5535) Add parameter for storage levels.

2015-02-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5535:


 Summary: Add parameter for storage levels.
 Key: SPARK-5535
 URL: https://issues.apache.org/jira/browse/SPARK-5535
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Add a special parameter type for storage levels that takes both StorageLevels 
and their string representation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5538) CachedTableSuite failure due to unpersisting RDDs in a non-blocking way

2015-02-02 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-5538:
-

 Summary: CachedTableSuite failure due to unpersisting RDDs in a 
non-blocking way
 Key: SPARK-5538
 URL: https://issues.apache.org/jira/browse/SPARK-5538
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian
Priority: Minor


[PR 
#4173|https://github.com/apache/spark/pull/4173/files#diff-726d84ece1e6f6197b98a5868c881ac7R164]
 introduced this, and introduced a race condition in {{CachedTableSuite}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5540) Hide ALS.solveLeastSquares.

2015-02-02 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-5540:


 Summary: Hide ALS.solveLeastSquares.
 Key: SPARK-5540
 URL: https://issues.apache.org/jira/browse/SPARK-5540
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


This method survived the code review and it has been there since v1.1.0. It 
exposes jblas types. Let's remove it from the public API. I expect that no one 
calls it directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5541) Allow running Maven or SBT in the Spark build

2015-02-02 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-5541:
--

 Summary: Allow running Maven or SBT in the Spark build
 Key: SPARK-5541
 URL: https://issues.apache.org/jira/browse/SPARK-5541
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Nicholas Chammas


It would be nice if we had a hook for the spark test scripts to run with Maven 
in addition to running with SBT. Right now it is difficult for us to test pull 
requests in maven and we get master build breaks because of it. A simple first 
step is to modify run-tests to allow building with maven. Then we can add a 
second PRB that invokes this maven build. I would just add an env var called 
SPARK_BUILD_TOOL that can be set to sbt or mvn. And make sure the 
associated logic works in either case. If we don't want to have the fancy SQL 
only stuff in Maven, that's fine too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5541) Allow running Maven or SBT in run-tests

2015-02-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5541:
---
Summary: Allow running Maven or SBT in run-tests  (was: Allow running Maven 
or SBT in the Spark build)

 Allow running Maven or SBT in run-tests
 ---

 Key: SPARK-5541
 URL: https://issues.apache.org/jira/browse/SPARK-5541
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Nicholas Chammas

 It would be nice if we had a hook for the spark test scripts to run with 
 Maven in addition to running with SBT. Right now it is difficult for us to 
 test pull requests in maven and we get master build breaks because of it. A 
 simple first step is to modify run-tests to allow building with maven. Then 
 we can add a second PRB that invokes this maven build. I would just add an 
 env var called SPARK_BUILD_TOOL that can be set to sbt or mvn. And make 
 sure the associated logic works in either case. If we don't want to have the 
 fancy SQL only stuff in Maven, that's fine too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4508) Native Date type for SQL92 Date

2015-02-02 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell reopened SPARK-4508:


This has caused several date-related test failures in the master and pull 
request builds, so I'm reverting it:

https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26560/testReport/org.apache.spark.sql/ScalaReflectionRelationSuite/query_case_class_RDD/

 Native Date type for SQL92 Date
 ---

 Key: SPARK-4508
 URL: https://issues.apache.org/jira/browse/SPARK-4508
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.3.0


 Store daysSinceEpoch as an Int(4 bytes), instead of using java.sql.Date(8 
 bytes as Long) in catalyst row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5536) Wrap the old ALS to use the new ALS implementation.

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302653#comment-14302653
 ] 

Apache Spark commented on SPARK-5536:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4321

 Wrap the old ALS to use the new ALS implementation.
 ---

 Key: SPARK-5536
 URL: https://issues.apache.org/jira/browse/SPARK-5536
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 The new implementation performs better. We should replace the old one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode

2015-02-02 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reopened SPARK-4986:
--

 Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
 --

 Key: SPARK-4986
 URL: https://issues.apache.org/jira/browse/SPARK-4986
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Jesper Lundgren
Priority: Blocker
 Fix For: 1.3.0


 When using the graceful stop API of Spark Streaming in Spark Standalone 
 cluster the stop signal never reaches the receivers. I have tested this with 
 Spark 1.2 and Kafka receivers. 
 ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl.
 In local mode ReceiverSupervisorImpl receives this message but in Standalone 
 cluster mode the message seems to be lost.
 (I have modified the code to send my own string message as a stop signal from 
 ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4588) Add API for feature attributes

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4588:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 Add API for feature attributes
 --

 Key: SPARK-4588
 URL: https://issues.apache.org/jira/browse/SPARK-4588
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng

 Feature attributes, e.g., continuous/categorical, feature names, feature 
 dimension, number of categories, number of nonzeros (support) could be useful 
 for ML algorithms.
 In SPARK-3569, we added metadata to schema, which can be used to store 
 feature attributes along with the dataset. We need to provide a wrapper over 
 the Metadata class for ML usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5534) EdgeRDD, VertexRDD getStorageLevel return bad values

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302285#comment-14302285
 ] 

Apache Spark commented on SPARK-5534:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4317

 EdgeRDD, VertexRDD getStorageLevel return bad values
 

 Key: SPARK-5534
 URL: https://issues.apache.org/jira/browse/SPARK-5534
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 After caching a graph, its edge and vertex RDDs still return 
 StorageLevel.None.
 Reproduce error:
 {code}
 import org.apache.spark.graphx.{Edge, Graph}
 val edges = Seq(
   Edge[Double](0, 1, 0),
   Edge[Double](1, 2, 0),
   Edge[Double](2, 3, 0),
   Edge[Double](3, 4, 0))
 val g = Graph.fromEdges[Double,Double](sc.parallelize(edges), 0)
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 g.cache()
 g.vertices.count()
 g.edges.count()
 g.vertices.getStorageLevel  // returns value for StorageLevel.None
 g.edges.getStorageLevel  // returns value for StorageLevel.None
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-02 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302326#comment-14302326
 ] 

Patrick Wendell commented on SPARK-4550:


Yeah, this is a good idea. I don't see why we don't serialize these immediately.

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5131) A typo in configuration doc

2015-02-02 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302342#comment-14302342
 ] 

Sean Owen commented on SPARK-5131:
--

Sometimes site changes reflect changes not in the current stable release, so in 
general it is updated with a release. Typos could be fixed directly in the 
interim. In this case the site will be updated very shortly for 1.2.1 anyway. 

 A typo in configuration doc
 ---

 Key: SPARK-5131
 URL: https://issues.apache.org/jira/browse/SPARK-5131
 Project: Spark
  Issue Type: Bug
Reporter: uncleGen
Assignee: uncleGen
Priority: Minor
 Fix For: 1.3.0, 1.2.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5543) Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core

2015-02-02 Thread Nathan M (JIRA)
Nathan M created SPARK-5543:
---

 Summary: Remove unused import JsonUtil from from 
org.apache.spark.util.JsonProtocol.scala which fails builds with older versions 
of hadoop-core
 Key: SPARK-5543
 URL: https://issues.apache.org/jira/browse/SPARK-5543
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Nathan M
Priority: Minor


There is an unused import in org.apache.spark.util.JsonProtocol.scala 

`import org.apache.hadoop.hdfs.web.JsonUtil`

This which fails builds with older versions of hadoop-core. In particular 
building mapr3 causes a compile error;

[ERROR] 
/var/lib/jenkins/workspace/cse-Apache-Spark/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala:35:
 object web is not a member of package org.apache.hadoop.hdfs
[ERROR] import org.apache.hadoop.hdfs.web.JsonUtil

This import is unused. It was introduced in PR #4029 
https://github.com/apache/spark/pull/4029 as a part of JIRA SPARK-5231 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3883) Provide SSL support for Akka and HttpServer based connections

2015-02-02 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3883:
--
Assignee: Jacek Lewandowski

 Provide SSL support for Akka and HttpServer based connections
 -

 Key: SPARK-3883
 URL: https://issues.apache.org/jira/browse/SPARK-3883
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jacek Lewandowski
Assignee: Jacek Lewandowski
 Fix For: 1.3.0


 Spark uses at least 4 logical communication channels:
 1. Control messages - Akka based
 2. JARs and other files - Jetty based (HttpServer)
 3. Computation results - Java NIO based
 4. Web UI - Jetty based
 The aim of this feature is to enable SSL for (1) and (2).
 Why:
 Spark configuration is sent through (1). Spark configuration may contain 
 sensitive information like credentials for accessing external data sources or 
 streams. Application JAR files (2) may include the application logic and 
 therefore they may include information about the structure of the external 
 data sources, and credentials as well. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5543) Remove unused import JsonUtil from from org.apache.spark.util.JsonProtocol.scala which fails builds with older versions of hadoop-core

2015-02-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302641#comment-14302641
 ] 

Apache Spark commented on SPARK-5543:
-

User 'nemccarthy' has created a pull request for this issue:
https://github.com/apache/spark/pull/4320

 Remove unused import JsonUtil from from 
 org.apache.spark.util.JsonProtocol.scala which fails builds with older 
 versions of hadoop-core
 --

 Key: SPARK-5543
 URL: https://issues.apache.org/jira/browse/SPARK-5543
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Nathan M
Priority: Minor
  Labels: easyfix

 There is an unused import in org.apache.spark.util.JsonProtocol.scala 
 `import org.apache.hadoop.hdfs.web.JsonUtil`
 This which fails builds with older versions of hadoop-core. In particular 
 building mapr3 causes a compile error;
 [ERROR] 
 /var/lib/jenkins/workspace/cse-Apache-Spark/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala:35:
  object web is not a member of package org.apache.hadoop.hdfs
 [ERROR] import org.apache.hadoop.hdfs.web.JsonUtil
 This import is unused. It was introduced in PR #4029 
 https://github.com/apache/spark/pull/4029 as a part of JIRA SPARK-5231 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5461) Graph should have isCheckpointed, getCheckpointFiles methods

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5461.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4253
[https://github.com/apache/spark/pull/4253]

 Graph should have isCheckpointed, getCheckpointFiles methods
 

 Key: SPARK-5461
 URL: https://issues.apache.org/jira/browse/SPARK-5461
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Minor
 Fix For: 1.3.0


 Graph has a checkpoint method but does not have other helper functionality 
 which RDD has.  Proposal:
 {code}
   /**
* Return whether this Graph has been checkpointed or not
*/
   def isCheckpointed: Boolean
   /**
* Gets the name of the files to which this Graph was checkpointed
*/
   def getCheckpointFiles: Seq[String]
 {code}
 I need this for [SPARK-1405].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5532) Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT

2015-02-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5532:
-
Priority: Critical  (was: Blocker)

 Repartitioning DataFrame causes saveAsParquetFile to fail with VectorUDT
 

 Key: SPARK-5532
 URL: https://issues.apache.org/jira/browse/SPARK-5532
 Project: Spark
  Issue Type: Bug
  Components: MLlib, SQL
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Cheng Lian
Priority: Critical

 Deterministic failure:
 {code}
 import org.apache.spark.mllib.linalg._
 import org.apache.spark.sql.SQLContext
 val sqlContext = new SQLContext(sc)
 import sqlContext._
 val data = sc.parallelize(Seq((1.0, 
 Vectors.dense(1,2,3.toDataFrame(label, features)
 data.repartition(1).saveAsParquetFile(blah)
 {code}
 If you remove the repartition, then this succeeds.
 Here's the stack trace:
 {code}
 15/02/02 12:10:53 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 4, 
 192.168.1.230): java.lang.ClassCastException: 
 org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
 org.apache.spark.sql.Row
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
   at 
 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 15/02/02 12:10:54 ERROR TaskSetManager: Task 0 in stage 2.0 failed 4 times; 
 aborting job
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 
 (TID 7, 192.168.1.230): java.lang.ClassCastException: 
 org.apache.spark.mllib.linalg.DenseVector cannot be cast to 
 org.apache.spark.sql.Row
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:177)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:166)
   at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:129)
   at 
 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
   at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable.org$apache$spark$sql$parquet$InsertIntoParquetTable$$writeShard$1(ParquetTableOperations.scala:315)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at 
 org.apache.spark.sql.parquet.InsertIntoParquetTable$$anonfun$saveAsHadoopFile$1.apply(ParquetTableOperations.scala:332)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:194)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
   at 
 

[jira] [Commented] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-02-02 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302303#comment-14302303
 ] 

Sandy Ryza commented on SPARK-4550:
---

Just posted a design doc. Would love to get feedback [~ilikerps] [~matei] 
[~jerryshao].

 In sort-based shuffle, store map outputs in serialized form
 ---

 Key: SPARK-4550
 URL: https://issues.apache.org/jira/browse/SPARK-4550
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Priority: Critical
 Attachments: SPARK-4550-design-v1.pdf


 One drawback with sort-based shuffle compared to hash-based shuffle is that 
 it ends up storing many more java objects in memory.  If Spark could store 
 map outputs in serialized form, it could
 * spill less often because the serialized form is more compact
 * reduce GC pressure
 This will only work when the serialized representations of objects are 
 independent from each other and occupy contiguous segments of memory.  E.g. 
 when Kryo reference tracking is left on, objects may contain pointers to 
 objects farther back in the stream, which means that the sort can't relocate 
 objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5541) Allow running Maven or SBT in run-tests

2015-02-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302525#comment-14302525
 ] 

Nicholas Chammas commented on SPARK-5541:
-

Dup of SPARK-3355?

 Allow running Maven or SBT in run-tests
 ---

 Key: SPARK-5541
 URL: https://issues.apache.org/jira/browse/SPARK-5541
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Patrick Wendell
Assignee: Nicholas Chammas

 It would be nice if we had a hook for the spark test scripts to run with 
 Maven in addition to running with SBT. Right now it is difficult for us to 
 test pull requests in maven and we get master build breaks because of it. A 
 simple first step is to modify run-tests to allow building with maven. Then 
 we can add a second PRB that invokes this maven build. I would just add an 
 env var called SPARK_BUILD_TOOL that can be set to sbt or mvn. And make 
 sure the associated logic works in either case. If we don't want to have the 
 fancy SQL only stuff in Maven, that's fine too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >