date:20150121


[ 
https://issues.apache.org/jira/browse/SPARK-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285960#comment-14285960
 ] 

Apache Spark commented on SPARK-5207:
-

User 'ogeagla' has created a pull request for this issue:
https://github.com/apache/spark/pull/4140

 StandardScalerModel mean and variance re-use
 

 Key: SPARK-5207
 URL: https://issues.apache.org/jira/browse/SPARK-5207
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Octavian Geagla
Assignee: Octavian Geagla

 From this discussion: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/Re-use-scaling-means-and-variances-from-StandardScalerModel-td10073.html
 Changing constructor to public would be a simple change, but a discussion is 
 needed to determine what args necessary for this change.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix


 [ 
https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5301.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4089
[https://github.com/apache/spark/pull/4089]

 Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
 -

 Key: SPARK-5301
 URL: https://issues.apache.org/jira/browse/SPARK-5301
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
 Fix For: 1.3.0


 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so 
 it should be there)
 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion 
 method to be added)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5352) Add getPartitionStrategy in Graph

Takeshi Yamamuro created SPARK-5352:
---

 Summary: Add getPartitionStrategy in Graph
 Key: SPARK-5352
 URL: https://issues.apache.org/jira/browse/SPARK-5352
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor


Graph remembers an applied partition strategy in paritionBy() and
returns it via getPartitionStrategy().
This is useful in case of the following situation;

val g1 = GraphLoader.edgeListFile(sc, graph.txt)
val g2 = g1.partitionBy(EdgePartition2D, 2)

// Modifiy (e.g., add, contract, ...) edges in g2
val newEdges = ...

// Re-build a new graph based on g2
val g3 = Graph(g1.vertices, newEdges)

// Partition edges in a similar way of g2
val g4 = g3.partitionBy(g2.getParitionStrategy, 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2015-01-21 Thread Tobias Schlatter (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286069#comment-14286069
 ] 

Tobias Schlatter commented on SPARK-2620:
-

It is a hack in an attempt to have anonymous functions (and other classes) 
close over REPL state and send them over the wire (rather than re-running the 
static initialization code on the executor). It's up to you if you want to take 
the [red 
pill|https://github.com/apache/spark/blob/b6cf1348170951396a6a5d8a65fb670382304f5b/repl/src/main/scala/org/apache/spark/repl/ExecutorClassLoader.scala#L100].

 case class cannot be used as key for reduce
 ---

 Key: SPARK-2620
 URL: https://issues.apache.org/jira/browse/SPARK-2620
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 1.0.0, 1.1.0
 Environment: reproduced on spark-shell local[4]
Reporter: Gerard Maas
Assignee: Tobias Schlatter
Priority: Critical
  Labels: case-class, core

 Using a case class as a key doesn't seem to work properly on Spark 1.0.0
 A minimal example:
 case class P(name:String)
 val ps = Array(P(alice), P(bob), P(charly), P(bob))
 sc.parallelize(ps).map(x= (x,1)).reduceByKey((x,y) = x+y).collect
 [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
 (P(bob),1), (P(abe),1), (P(charly),1))
 In contrast to the expected behavior, that should be equivalent to:
 sc.parallelize(ps).map(x= (x.name,1)).reduceByKey((x,y) = x+y).collect
 Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
 groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict


 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Target Version/s: 1.3.0  (was: 1.3.0, 1.1.2, 1.2.1)

 way to find assembly jar is too strict
 --

 Key: SPARK-4793
 URL: https://issues.apache.org/jira/browse/SPARK-4793
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.1.0
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4759) Deadlock in complex spark job in local mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4759:
-
Fix Version/s: (was: 1.2.0)
   1.2.1

 Deadlock in complex spark job in local mode
 ---

 Key: SPARK-4759
 URL: https://issues.apache.org/jira/browse/SPARK-4759
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0, 1.3.0
 Environment: Java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
 Mac OSX 10.10.1
 Using local spark context
Reporter: Davis Shepherd
Assignee: Andrew Or
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0, 1.1.2, 1.2.1

 Attachments: SparkBugReplicator.scala


 The attached test class runs two identical jobs that perform some iterative 
 computation on an RDD[(Int, Int)]. This computation involves 
   # taking new data merging it with the previous result
   # caching and checkpointing the new result
   # rinse and repeat
 The first time the job is run, it runs successfully, and the spark context is 
 shut down. The second time the job is run with a new spark context in the 
 same process, the job hangs indefinitely, only having scheduled a subset of 
 the necessary tasks for the final stage.
 Ive been able to produce a test case that reproduces the issue, and I've 
 added some comments where some knockout experimentation has left some 
 breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5355) SparkConf is not thread-safe

2015-01-21 Thread Davies Liu (JIRA)

Davies Liu created SPARK-5355:
-

 Summary: SparkConf is not thread-safe
 Key: SPARK-5355
 URL: https://issues.apache.org/jira/browse/SPARK-5355
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker


The SparkConf is not thread-safe, but is accessed by many threads. The getAll() 
could return parts of the configs if another thread is access it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5352) Add getPartitionStrategy in Graph


[ 
https://issues.apache.org/jira/browse/SPARK-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285898#comment-14285898
 ] 

Apache Spark commented on SPARK-5352:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/4138

 Add getPartitionStrategy in Graph
 -

 Key: SPARK-5352
 URL: https://issues.apache.org/jira/browse/SPARK-5352
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor

 Graph remembers an applied partition strategy in partitionBy() and
 returns it via getPartitionStrategy().
 This is useful in case of the following situation;
 val g1 = GraphLoader.edgeListFile(sc, graph.txt)
 val g2 = g1.partitionBy(EdgePartition2D, 2)
 // Modify (e.g., add, contract, ...) edges in g2
 val newEdges = ...
 // Re-build a new graph based on g2
 val g3 = Graph(g1.vertices, newEdges)
 // Partition edges in a similar way of g2
 val g4 = g3.partitionBy(g2.getPartitionStrategy, 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)

2015-01-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5346:
--
Description: 
When computing Parquet splits, reading Parquet metadata from executor side is 
more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to 
{{true}} by 
default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437].
 However, somehow this disables filter pushdown. 

To workaround this issue and enable Parquet filter pushdown, users can set 
{{spark.sql.parquet.filterPushdown}} to {{true}} and 
{{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files 
with a large number of part-files and/or columns, reading metadata from driver 
side eats lots of memory.

The following Spark shell snippet can be useful to reproduce this issue:
{code}
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sqlContext._

case class KeyValue(key: Int, value: String)

sc.
  parallelize(1 to 1024).
  flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))).
  saveAsParquetFile(large.parquet)

parquetFile(large.parquet).registerTempTable(large)

sql(SET spark.sql.parquet.filterPushdown=true)
sql(SELECT * FROM large).collect()
sql(SELECT * FROM large WHERE key  200).collect()
{code}
Users can verify this issue by checking the input size metrics from web UI. 
When filter pushdown is enabled, the second query reads fewer data.

Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ 
configuration (either via {{core-site.xml}} or 
{{SparkConf.hadoopConfiguration.set()}}), setting it in {{spark-defaults.conf}} 
or via {{SparkConf}} does NOT work.

  was:
When computing Parquet splits, reading Parquet metadata from executor side is 
more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to 
{{true}} by 
default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437].
 However, somehow this disables filter pushdown. 

To workaround this issue and enable Parquet filter pushdown, users can set 
{{spark.sql.parquet.filterPushdown}} to {{true}} and 
{{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files 
with a large number of part-files and/or columns, reading metadata from driver 
side eats lots of memory.

The following Spark shell snippet can be useful to reproduce this issue:
{code}
import org.apache.spark.sql.SQLContext

val sqlContext = new SQLContext(sc)
import sqlContext._

case class KeyValue(key: Int, value: String)

sc.
  parallelize(1 to 1024).
  flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))).
  saveAsParquetFile(large.parquet)

parquetFile(large.parquet).registerTempTable(large)

sql(SET spark.sql.parquet.filterPushdown=true)
sql(SELECT * FROM large).collect()
sql(SELECT * FROM large WHERE key  200).collect()
{code}
Users can verify this issue by checking the input size metrics from web UI. 
When filter pushdown is enabled, the second query reads fewer data.

Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ 
configuration files (e.g. core-site.xml), setting it in {{spark-defaults.conf}} 
or via {{SparkConf}} does NOT work.


 Parquet filter pushdown is not enabled when parquet.task.side.metadata is set 
 to true (default value)
 -

 Key: SPARK-5346
 URL: https://issues.apache.org/jira/browse/SPARK-5346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Cheng Lian
Priority: Critical

 When computing Parquet splits, reading Parquet metadata from executor side is 
 more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to 
 {{true}} by 
 default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437].
  However, somehow this disables filter pushdown. 
 To workaround this issue and enable Parquet filter pushdown, users can set 
 {{spark.sql.parquet.filterPushdown}} to {{true}} and 
 {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files 
 with a large number of part-files and/or columns, reading metadata from 
 driver side eats lots of memory.
 The following Spark shell snippet can be useful to reproduce this issue:
 {code}
 import org.apache.spark.sql.SQLContext
 val sqlContext = new SQLContext(sc)
 import sqlContext._
 case class KeyValue(key: Int, value: String)
 sc.
   parallelize(1 to 1024).
   flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))).
   saveAsParquetFile(large.parquet)
 parquetFile(large.parquet).registerTempTable(large)
 sql(SET spark.sql.parquet.filterPushdown=true)

[jira] [Closed] (SPARK-4215) Allow requesting executors only on Yarn (for now)


 [ 
https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4215.

  Resolution: Fixed
Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

 Allow requesting executors only on Yarn (for now)
 -

 Key: SPARK-4215
 URL: https://issues.apache.org/jira/browse/SPARK-4215
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.3.0


 Currently if the user attempts to call `sc.requestExecutors` or enables 
 dynamic allocation on, say, standalone mode, it just fails silently. We must 
 at the very least log a warning to say it's only available for Yarn 
 currently, or maybe even throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4759) Deadlock in complex spark job in local mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4759.

   Resolution: Fixed
Fix Version/s: 1.2.0

 Deadlock in complex spark job in local mode
 ---

 Key: SPARK-4759
 URL: https://issues.apache.org/jira/browse/SPARK-4759
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0, 1.3.0
 Environment: Java version 1.7.0_51
 Java(TM) SE Runtime Environment (build 1.7.0_51-b13)
 Java HotSpot(TM) 64-Bit Server VM (build 24.51-b03, mixed mode)
 Mac OSX 10.10.1
 Using local spark context
Reporter: Davis Shepherd
Assignee: Andrew Or
Priority: Critical
  Labels: backport-needed
 Fix For: 1.3.0, 1.1.2, 1.2.0

 Attachments: SparkBugReplicator.scala


 The attached test class runs two identical jobs that perform some iterative 
 computation on an RDD[(Int, Int)]. This computation involves 
   # taking new data merging it with the previous result
   # caching and checkpointing the new result
   # rinse and repeat
 The first time the job is run, it runs successfully, and the spark context is 
 shut down. The second time the job is run with a new spark context in the 
 same process, the job hangs indefinitely, only having scheduled a subset of 
 the necessary tasks for the final stage.
 Ive been able to produce a test case that reproduces the issue, and I've 
 added some comments where some knockout experimentation has left some 
 breadcrumbs as to where the issue might be.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-21 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285980#comment-14285980
 ] 

RJ Nowling commented on SPARK-4894:
---

[~mengxr] Since [~lmcguire] has submitted the patch, can we assign the JIRA to 
her so she gets credit for it?  Thanks!

 Add Bernoulli-variant of Naive Bayes
 

 Key: SPARK-4894
 URL: https://issues.apache.org/jira/browse/SPARK-4894
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: RJ Nowling
Assignee: RJ Nowling

 MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
 version of Naive Bayes is more useful for situations where the features are 
 binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4569) Rename externalSorting in Aggregator


 [ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4569.

   Resolution: Fixed
Fix Version/s: 1.2.1
   1.1.2
 Assignee: Ilya Ganelin

 Rename externalSorting in Aggregator
 --

 Key: SPARK-4569
 URL: https://issues.apache.org/jira/browse/SPARK-4569
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Ilya Ganelin
Priority: Trivial
  Labels: backport-needed
 Fix For: 1.3.0, 1.1.2, 1.2.1


 While technically all spilling in Spark does result in sorting, calling this 
 variable externalSorting makes it seem like ExternalSorter will be used, when 
 in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode


[ 
https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286153#comment-14286153
 ] 

Apache Spark commented on SPARK-2669:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4142

 Hadoop configuration is not localised when submitting job in yarn-cluster mode
 --

 Key: SPARK-2669
 URL: https://issues.apache.org/jira/browse/SPARK-2669
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Maxim Ivanov

 I'd like to propose a fix for a problem when Hadoop configuration is not 
 localized when job is submitted in yarn-cluster mode. Here is a description 
 from github pull request https://github.com/apache/spark/pull/1574
 This patch fixes a problem when Spark driver is run in the container
 managed by YARN ResourceManager it inherits configuration from a
 NodeManager process, which can be different from the Hadoop
 configuration present on the client (submitting machine). Problem is
 most vivid when fs.defaultFS property differs between these two.
 Hadoop MR solves it by serializing client's Hadoop configuration into
 job.xml in application staging directory and then making Application
 Master to use it. That guarantees that regardless of execution nodes
 configurations all application containers use same config identical to
 one on the client side.
 This patch uses similar approach. YARN ClientBase serializes
 configuration and adds it to ClientDistributedCacheManager under
 job.xml link name. ClientDistributedCacheManager is then utilizes
 Hadoop localizer to deliver it to whatever container is started by this
 application, including the one running Spark driver.
 YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM
 container request which is then used by SparkHadoopUtil.newConfiguration
 to trigger new behavior when machine-wide hadoop configuration is merged
 with application specific job.xml (exactly how it is done in Hadoop MR).
 SparkContext is then follows same approach, adding
 SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
 client-side Hadopo configuration.
 Also all the references to new Configuration() which might be executed
 on YARN cluster side are changed to use SparkHadoopUtil.get.conf
 Please note that it fixes only core Spark, the part which I am
 comfortable to test and verify the result. I didn't descend into
 steaming/shark directories, so things might need to be changed there too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4337) Add ability to cancel pending requests to YARN


[ 
https://issues.apache.org/jira/browse/SPARK-4337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286142#comment-14286142
 ] 

Apache Spark commented on SPARK-4337:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/4141

 Add ability to cancel pending requests to YARN
 --

 Key: SPARK-4337
 URL: https://issues.apache.org/jira/browse/SPARK-4337
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza

 This will be useful for things like SPARK-4136



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4793) way to find assembly jar is too strict


 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4793:
-
Labels:   (was: backport-needed)

 way to find assembly jar is too strict
 --

 Key: SPARK-4793
 URL: https://issues.apache.org/jira/browse/SPARK-4793
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.1.0
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5301) Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix


 [ 
https://issues.apache.org/jira/browse/SPARK-5301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5301:
-
Assignee: Reza Zadeh

 Add missing linear algebra utilities to IndexedRowMatrix and CoordinateMatrix
 -

 Key: SPARK-5301
 URL: https://issues.apache.org/jira/browse/SPARK-5301
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Reza Zadeh
Assignee: Reza Zadeh
 Fix For: 1.3.0


 1) Transpose is missing from CoordinateMatrix (this is cheap to compute, so 
 it should be there)
 2) IndexedRowMatrix should be convertable to CoordinateMatrix (conversion 
 method to be added)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4215) Allow requesting executors only on Yarn (for now)


 [ 
https://issues.apache.org/jira/browse/SPARK-4215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-4215:
-
Labels:   (was: backport-needed)

 Allow requesting executors only on Yarn (for now)
 -

 Key: SPARK-4215
 URL: https://issues.apache.org/jira/browse/SPARK-4215
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.3.0


 Currently if the user attempts to call `sc.requestExecutors` or enables 
 dynamic allocation on, say, standalone mode, it just fails silently. We must 
 at the very least log a warning to say it's only available for Yarn 
 currently, or maybe even throw an exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5353) Log failures in ExceutorClassLoader


[ 
https://issues.apache.org/jira/browse/SPARK-5353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285901#comment-14285901
 ] 

Apache Spark commented on SPARK-5353:
-

User 'gzm0' has created a pull request for this issue:
https://github.com/apache/spark/pull/4130

 Log failures in ExceutorClassLoader
 ---

 Key: SPARK-5353
 URL: https://issues.apache.org/jira/browse/SPARK-5353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Shell
Reporter: Tobias Schlatter
Priority: Minor

 When the ExecutorClassLoader tries to load classes compiled in the Spark 
 Shell and fails, it silently passes loading to the parent ClassLoader. It 
 should log these failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)

2015-01-21 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-5346:
--
Priority: Blocker  (was: Critical)

 Parquet filter pushdown is not enabled when parquet.task.side.metadata is set 
 to true (default value)
 -

 Key: SPARK-5346
 URL: https://issues.apache.org/jira/browse/SPARK-5346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Cheng Lian
Priority: Blocker

 When computing Parquet splits, reading Parquet metadata from executor side is 
 more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to 
 {{true}} by 
 default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437].
  However, somehow this disables filter pushdown. 
 To workaround this issue and enable Parquet filter pushdown, users can set 
 {{spark.sql.parquet.filterPushdown}} to {{true}} and 
 {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files 
 with a large number of part-files and/or columns, reading metadata from 
 driver side eats lots of memory.
 The following Spark shell snippet can be useful to reproduce this issue:
 {code}
 import org.apache.spark.sql.SQLContext
 val sqlContext = new SQLContext(sc)
 import sqlContext._
 case class KeyValue(key: Int, value: String)
 sc.
   parallelize(1 to 1024).
   flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))).
   saveAsParquetFile(large.parquet)
 parquetFile(large.parquet).registerTempTable(large)
 sql(SET spark.sql.parquet.filterPushdown=true)
 sql(SELECT * FROM large).collect()
 sql(SELECT * FROM large WHERE key  200).collect()
 {code}
 Users can verify this issue by checking the input size metrics from web UI. 
 When filter pushdown is enabled, the second query reads fewer data.
 Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ 
 configuration (either via {{core-site.xml}} or 
 {{SparkConf.hadoopConfiguration.set()}}), setting it in 
 {{spark-defaults.conf}} or via {{SparkConf}} does NOT work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-4793) way to find assembly jar is too strict


 [ 
https://issues.apache.org/jira/browse/SPARK-4793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4793.

Resolution: Fixed

 way to find assembly jar is too strict
 --

 Key: SPARK-4793
 URL: https://issues.apache.org/jira/browse/SPARK-4793
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.1.0
Reporter: Adrian Wang
Assignee: Adrian Wang
Priority: Minor
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4749) Allow initializing KMeans clusters using a seed


 [ 
https://issues.apache.org/jira/browse/SPARK-4749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-4749.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3610
[https://github.com/apache/spark/pull/3610]

 Allow initializing KMeans clusters using a seed
 ---

 Key: SPARK-4749
 URL: https://issues.apache.org/jira/browse/SPARK-4749
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.1.0
Reporter: Nate Crosswhite
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Add an optional seed to MLLib KMeans clustering to allow initial cluster 
 choices to be deterministic.  Update the pyspark mllib interface to also 
 allow an optional seed parameter to be supplie. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5354) When possible, correctly set outputPartitioning for leaf SparkPlans

2015-01-21 Thread Yin Huai (JIRA)

Yin Huai created SPARK-5354:
---

 Summary: When possible, correctly set outputPartitioning for leaf 
SparkPlans
 Key: SPARK-5354
 URL: https://issues.apache.org/jira/browse/SPARK-5354
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai


Right now, Spark SQL is not aware of the partitioning scheme of a leaf 
SparkPlan (e.g. an input table). So, even users want to re-partitioning the 
data in advance, Exchange operators will still be used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5352) Add getPartitionStrategy in Graph


 [ 
https://issues.apache.org/jira/browse/SPARK-5352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-5352:

Description: 
Graph remembers an applied partition strategy in partitionBy() and
returns it via getPartitionStrategy().
This is useful in case of the following situation;

val g1 = GraphLoader.edgeListFile(sc, graph.txt)
val g2 = g1.partitionBy(EdgePartition2D, 2)

// Modify (e.g., add, contract, ...) edges in g2
val newEdges = ...

// Re-build a new graph based on g2
val g3 = Graph(g1.vertices, newEdges)

// Partition edges in a similar way of g2
val g4 = g3.partitionBy(g2.getPartitionStrategy, 2)

  was:
Graph remembers an applied partition strategy in paritionBy() and
returns it via getPartitionStrategy().
This is useful in case of the following situation;

val g1 = GraphLoader.edgeListFile(sc, graph.txt)
val g2 = g1.partitionBy(EdgePartition2D, 2)

// Modifiy (e.g., add, contract, ...) edges in g2
val newEdges = ...

// Re-build a new graph based on g2
val g3 = Graph(g1.vertices, newEdges)

// Partition edges in a similar way of g2
val g4 = g3.partitionBy(g2.getParitionStrategy, 2)


 Add getPartitionStrategy in Graph
 -

 Key: SPARK-5352
 URL: https://issues.apache.org/jira/browse/SPARK-5352
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Reporter: Takeshi Yamamuro
Priority: Minor

 Graph remembers an applied partition strategy in partitionBy() and
 returns it via getPartitionStrategy().
 This is useful in case of the following situation;
 val g1 = GraphLoader.edgeListFile(sc, graph.txt)
 val g2 = g1.partitionBy(EdgePartition2D, 2)
 // Modify (e.g., add, contract, ...) edges in g2
 val newEdges = ...
 // Re-build a new graph based on g2
 val g3 = Graph(g1.vertices, newEdges)
 // Partition edges in a similar way of g2
 val g4 = g3.partitionBy(g2.getPartitionStrategy, 2)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5309) Reduce Binary/String conversion overhead when reading/writing Parquet files


[ 
https://issues.apache.org/jira/browse/SPARK-5309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285899#comment-14285899
 ] 

Apache Spark commented on SPARK-5309:
-

User 'MickDavies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4139

 Reduce Binary/String conversion overhead when reading/writing Parquet files
 ---

 Key: SPARK-5309
 URL: https://issues.apache.org/jira/browse/SPARK-5309
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.0
Reporter: MIchael Davies
Priority: Minor

 Converting between Parquet Binary and Java Strings can form a significant 
 proportion of query times.
 For columns which have repeated String values (which is common) the same 
 Binary will be repeatedly being converted. 
 A simple change to cache the last converted String per column was shown to 
 reduce query times by 25% when grouping on a data set of 66M rows on a column 
 with many repeated Strings.
 A possible optimisation would be to hand responsibility for Binary 
 encoding/decoding over to Parquet so that it could ensure that this was done 
 only once per Binary value. 
 Next step is to look at Parquet code and to discuss with that project, which 
 I will do.
 More details are available on this discussion:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Optimize-encoding-decoding-strings-when-using-Parquet-td10141.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4569) Rename externalSorting in Aggregator


 [ 
https://issues.apache.org/jira/browse/SPARK-4569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4569:
--
Labels:   (was: backport-needed)

It looks like all backports have been completed, so I'm removing the 
{{backport-needed}} label.

 Rename externalSorting in Aggregator
 --

 Key: SPARK-4569
 URL: https://issues.apache.org/jira/browse/SPARK-4569
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Ilya Ganelin
Priority: Trivial
 Fix For: 1.3.0, 1.1.2, 1.2.1


 While technically all spilling in Spark does result in sorting, calling this 
 variable externalSorting makes it seem like ExternalSorter will be used, when 
 in fact it just means whether spilling is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler

2015-01-21 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-1714.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
 

 Key: SPARK-1714
 URL: https://issues.apache.org/jira/browse/SPARK-1714
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5349) Multiple spark shells should be able to share resources

2015-01-21 Thread Tobias Bertelsen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tobias Bertelsen updated SPARK-5349:

Description:
The resource requirements of an interactive shell varies heavily. Sometimes
heavy commands are executed, and sometimes the user is thinking, getting
coffee, interrupted etc...

A spark shell allocates a fixed number of worker cores (at least in standalone
mode). A user thus has the choice to either block other users from the cluster
by allocating all cores (default behavior), or restrict him/herself to only a
few cores using the option {{--total-executor-cores}}. Either way the cores
allocated to the shell has low utilization, since they will be waiting for the
user a lot.

Instead the spark shell allocate resources directly required to run the driver,
and request worker cores only when computation is performed on the RDDs.

This should allow for multiple users, to use an interactive shell concurrently
while stille utilizing the entire cluster, when performing heavy operations.

was:
The documentation states

Multiple spark shells should be able to share resources
---

Key: SPARK-5349
URL: https://issues.apache.org/jira/browse/SPARK-5349
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 1.2.0
Reporter: Tobias Bertelsen

The resource requirements of an interactive shell varies heavily. Sometimes
heavy commands are executed, and sometimes the user is thinking, getting
coffee, interrupted etc...
A spark shell allocates a fixed number of worker cores (at least in
standalone mode). A user thus has the choice to either block other users from
the cluster by allocating all cores (default behavior), or restrict
him/herself to only a few cores using the option {{--total-executor-cores}}.
Either way the cores allocated to the shell has low utilization, since they
will be waiting for the user a lot.
Instead the spark shell allocate resources directly required to run the
driver, and request worker cores only when computation is performed on the
RDDs.
This should allow for multiple users, to use an interactive shell
concurrently while stille utilizing the entire cluster, when performing heavy
operations.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()


 [ 
https://issues.apache.org/jira/browse/SPARK-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-5351:

Description: 
If the value of 'spark.default.parallelism' does not match the number of 
partitoins in EdgePartition(EdgeRDDImpl), 
the following error occurs in ReplicatedVertexView.scala:72;

object GraphTest extends Logging {
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
graph.aggregateMessages[Int](
  ctx = {
ctx.sendToSrc(1)
ctx.sendToDst(2)
  },
  _ + _)
  }
}

val g = GraphLoader.edgeListFile(sc, graph.txt)
val rdd = GraphTest.run(g)

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:82)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
...


  was:
If the value of 'spark.default.parallelism' do not match
the number of partitoins in EdgePartition(EdgeRDDImpl), 
the following error occurs in ReplicatedVertexView.scala:72;

object GraphTest extends Logging {
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
graph.aggregateMessages[Int](
  ctx = {
ctx.sendToSrc(1)
ctx.sendToDst(2)
  },
  _ + _)
  }
}

val g = GraphLoader.edgeListFile(sc, graph.txt)
val rdd = GraphTest.run(g)

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:82)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
...



 Can't zip RDDs with unequal numbers of partitions in 
 ReplicatedVertexView.upgrade()
 ---

 Key: SPARK-5351
 URL: https://issues.apache.org/jira/browse/SPARK-5351
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Takeshi Yamamuro

 If the value of 'spark.default.parallelism' does not match the number of 
 partitoins in EdgePartition(EdgeRDDImpl), 
 the following error occurs in ReplicatedVertexView.scala:72;
 object GraphTest extends Logging {
   def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = 
 {
 graph.aggregateMessages[Int](
   ctx = {
 ctx.sendToSrc(1)
 ctx.sendToDst(2)
   },
   _ + _)
   }
 }
 val g = GraphLoader.edgeListFile(sc, graph.txt)
 val rdd = GraphTest.run(g)
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
   at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at

[jira] [Commented] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()


[ 
https://issues.apache.org/jira/browse/SPARK-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285820#comment-14285820
 ] 

Apache Spark commented on SPARK-5351:
-

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/4136

 Can't zip RDDs with unequal numbers of partitions in 
 ReplicatedVertexView.upgrade()
 ---

 Key: SPARK-5351
 URL: https://issues.apache.org/jira/browse/SPARK-5351
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Takeshi Yamamuro

 If the value of 'spark.default.parallelism' does not match the number of 
 partitoins in EdgePartition(EdgeRDDImpl), 
 the following error occurs in ReplicatedVertexView.scala:72;
 object GraphTest extends Logging {
   def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = 
 {
 graph.aggregateMessages[Int](
   ctx = {
 ctx.sendToSrc(1)
 ctx.sendToDst(2)
   },
   _ + _)
   }
 }
 val g = GraphLoader.edgeListFile(sc, graph.txt)
 val rdd = GraphTest.run(g)
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
   at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
   at org.apache.spark.ShuffleDependency.init(Dependency.scala:82)
   at 
 org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
 ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5176) Thrift server fails with confusing error message when deploy-mode is cluster


[ 
https://issues.apache.org/jira/browse/SPARK-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14285843#comment-14285843
 ] 

Apache Spark commented on SPARK-5176:
-

User 'tpanningnextcen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4137

 Thrift server fails with confusing error message when deploy-mode is cluster
 

 Key: SPARK-5176
 URL: https://issues.apache.org/jira/browse/SPARK-5176
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0, 1.2.0
Reporter: Tom Panning
  Labels: starter

 With Spark 1.2.0, when I try to run
 {noformat}
 $SPARK_HOME/sbin/start-thriftserver.sh --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077
 {noformat}
 The log output is
 {noformat}
 Spark assembly has been built with Hive, including Datanucleus jars on 
 classpath
 Spark Command: /usr/java/latest/bin/java -cp 
 ::/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/sbin/../conf:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/spark-assembly-1.2.0-hadoop2.4.0.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar:/home/tpanning/Projects/spark/spark-1.2.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar
  -XX:MaxPermSize=128m -Xms512m -Xmx512m org.apache.spark.deploy.SparkSubmit 
 --class org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 
 --deploy-mode cluster --master 
 spark://xd-spark.xdata.data-tactics-corp.com:7077 spark-internal
 
 Jar url 'spark-internal' is not in valid format.
 Must be a jar file path in URL format (e.g. hdfs://host:port/XX.jar, 
 file:///XX.jar)
 Usage: DriverClient [options] launch active-master jar-url main-class 
 [driver options]
 Usage: DriverClient kill active-master driver-id
 Options:
-c CORES, --cores CORESNumber of cores to request (default: 1)
-m MEMORY, --memory MEMORY Megabytes of memory to request (default: 
 512)
-s, --superviseWhether to restart the driver on failure
-v, --verbose  Print more debugging output
  
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 {noformat}
 I do not get this error if deploy-mode is set to client. The --deploy-mode 
 option is described by the --help output, so I expected it to work. I 
 checked, and this behavior seems to be present in Spark 1.1.0 as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()

Takeshi Yamamuro created SPARK-5351:
---

 Summary: Can't zip RDDs with unequal numbers of partitions in 
ReplicatedVertexView.upgrade()
 Key: SPARK-5351
 URL: https://issues.apache.org/jira/browse/SPARK-5351
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Takeshi Yamamuro


If the value of 'spark.default.parallelism' do not match
the number of partitoins in EdgePartition(EdgeRDDImpl), 
the following error occurs in ReplicatedVertexView.scala:72;

object GraphTest extends Logging {
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
graph.aggregateMessages[Int](
  ctx = {
ctx.sendToSrc(1)
ctx.sendToDst(2)
  },
  _ + _)
  }
}

val g = GraphLoader.edgeListFile(sc, graph.txt)
val rdd = GraphTest.run(g)

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:82)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
...




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task

2015-01-21 Thread Kay Ousterhout (JIRA)

Kay Ousterhout created SPARK-5360:
-

 Summary: For CoGroupedRDD, rdds for narrow dependencies and 
shuffle handles are included twice in serialized task
 Key: SPARK-5360
 URL: https://issues.apache.org/jira/browse/SPARK-5360
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor


CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that 
the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle.  
The partition is serialized separately from the RDD, so when the RDD and 
partition arrive on the worker, the references in the partition and in the RDD 
no longer point to the same object.

This is a relatively minor performance issue (the closure can be 2x larger than 
it needs to be because the rdds and partitions are serialized twice; see 
numbers below) but is more annoying as a developer issue (this is where I ran 
into): if any state is stored in the RDD or ShuffleHandle on the worker side, 
subtle bugs can appear due to the fact that the references to the RDD / 
ShuffleHandle in the RDD and in the partition point to separate objects.  I'm 
not sure if this is enough of a potential future problem to fix this old and 
central part of the code, so hoping to get input from others here.

I did some simple experiments to see how much this effects closure size.  For 
this example: 
$ val a = sc.parallelize(1 to 10).map((_, 1))
$ val b = sc.parallelize(1 to 2).map(x = (x, 2*x))
$ a.cogroup(b).collect()

the closure was 1902 bytes with current Spark, and 1129 bytes after my change.  
The difference comes from eliminating duplicate serialization of the shuffle 
handle.

For this example:
$ val sortedA = a.sortByKey()
$ val sortedB = b.sortByKey()
$ sortedA.cogroup(sortedB).collect()

the closure was 3491 bytes with current Spark, and 1333 bytes after my change. 
Here, the difference comes from eliminating duplicate serialization of the two 
RDDs for the narrow dependencies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5355) SparkConf is not thread-safe


[ 
https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286179#comment-14286179
 ] 

Apache Spark commented on SPARK-5355:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4143

 SparkConf is not thread-safe
 

 Key: SPARK-5355
 URL: https://issues.apache.org/jira/browse/SPARK-5355
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker

 The SparkConf is not thread-safe, but is accessed by many threads. The 
 getAll() could return parts of the configs if another thread is access it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4959) Attributes are case sensitive when using a select query from a projection


 [ 
https://issues.apache.org/jira/browse/SPARK-4959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-4959:

Fix Version/s: 1.2.1

 Attributes are case sensitive when using a select query from a projection
 -

 Key: SPARK-4959
 URL: https://issues.apache.org/jira/browse/SPARK-4959
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Andy Konwinski
Assignee: Cheng Hao
Priority: Blocker
  Labels: backport-needed
 Fix For: 1.3.0, 1.2.1


 Per [~marmbrus], see this line of code, where we should be using an attribute 
 map
  
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L147
 To reproduce, i ran the following in the Spark shell:
 {code}
 import sqlContext._
 sql(drop table if exists test)
 sql(create table test (col1 string))
 sql(insert into table test select hi from prejoined limit 1)
 val projection = col1.attr.as(Symbol(CaseSensitiveColName)) :: 
 col1.attr.as(Symbol(CaseSensitiveColName2)) :: Nil
 sqlContext.table(test).select(projection:_*).registerTempTable(test2)
 # This succeeds.
 sql(select CaseSensitiveColName from test2).first()
 # This fails with java.util.NoSuchElementException: key not found: 
 casesensitivecolname#23046
 sql(select casesensitivecolname from test2).first()
 {code}
 The full stack trace printed for the final command that is failing: 
 {code}
 java.util.NoSuchElementException: key not found: casesensitivecolname#23046
   at scala.collection.MapLike$class.default(MapLike.scala:228)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
   at scala.collection.MapLike$class.apply(MapLike.scala:141)
   at 
 org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.hive.execution.HiveTableScan.init(HiveTableScan.scala:57)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$$anonfun$14.apply(HiveStrategies.scala:221)
   at 
 org.apache.spark.sql.SQLContext$SparkPlanner.pruneFilterProject(SQLContext.scala:378)
   at 
 org.apache.spark.sql.hive.HiveStrategies$HiveTableScans$.apply(HiveStrategies.scala:217)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
   at 
 org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:285)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.sql.catalyst.planning.QueryPlanner.apply(QueryPlanner.scala:59)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:418)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:416)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:422)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:422)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:446)
   at org.apache.spark.sql.SchemaRDD.take(SchemaRDD.scala:108)
   at org.apache.spark.rdd.RDD.first(RDD.scala:1093)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-21 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286250#comment-14286250
 ] 

Yin Huai commented on SPARK-5260:
-

[~sonixbp] Unfortunately, I failed to come up with a proper name. Will try 
again:)

 Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 --

 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet

 I have found this method extremely useful when implementing my own strategy 
 for inferring a schema from parsed json. For now, I've actually copied the 
 method right out of the JsonRDD class into my own project but I think it 
 would be immensely useful to keep the code in Spark and expose it publicly 
 somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5009) allCaseVersions function in SqlLexical leads to StackOverflow Exception


 [ 
https://issues.apache.org/jira/browse/SPARK-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5009.
-
   Resolution: Fixed
Fix Version/s: (was: 1.2.1)

Issue resolved by pull request 3926
[https://github.com/apache/spark/pull/3926]

 allCaseVersions function in  SqlLexical  leads to StackOverflow Exception
 -

 Key: SPARK-5009
 URL: https://issues.apache.org/jira/browse/SPARK-5009
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1, 1.2.0
Reporter: shengli
 Fix For: 1.3.0

   Original Estimate: 96h
  Remaining Estimate: 96h

 Recently I found a bug when I add new feature in SqlParser. Which is :
 If I define a KeyWord that has a long name. Like：
  ```protected val ：SERDEPROPERTIES = Keyword(SERDEPROPERTIES)```
 Since the all case version is implement by recursive function, so when  
 ```implicit asParser`` function is called  and the stack memory is very 
 small, it will leads to SO Exception. 
 java.lang.StackOverflowError
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$$anon$3.apply(Parsers.scala:222)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)
   at 
 scala.util.parsing.combinator.Parsers$Parser$$anonfun$append$1.apply(Parsers.scala:254)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5064) GraphX rmatGraph hangs

2015-01-21 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-5064.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 3950
[https://github.com/apache/spark/pull/3950]

 GraphX rmatGraph hangs
 --

 Key: SPARK-5064
 URL: https://issues.apache.org/jira/browse/SPARK-5064
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
 Environment: CentOS 7 REPL (no HDFS). Also tried Cloudera 5.2.0 
 QuickStart standalone compiled Scala with spark-submit.
Reporter: Michael Malak
 Fix For: 1.3.0, 1.2.1


 org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8)
 It just outputs 0 edges and then locks up.
 A spark-user message reports similar behavior:
 http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408617621830-12570.p...@n3.nabble.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5275) pyspark.streaming is not included in assembly jar


 [ 
https://issues.apache.org/jira/browse/SPARK-5275?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-5275:
---
Fix Version/s: 1.2.1
   1.3.0

 pyspark.streaming is not included in assembly jar
 -

 Key: SPARK-5275
 URL: https://issues.apache.org/jira/browse/SPARK-5275
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 The pyspark.streaming is not included in assembly jar of spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4939) Python updateStateByKey example hang in local mode


 [ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-4939:
---
Target Version/s: 1.3.0  (was: 1.3.0, 1.2.1)

 Python updateStateByKey example hang in local mode
 --

 Key: SPARK-4939
 URL: https://issues.apache.org/jira/browse/SPARK-4939
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5244) add parser for COALESCE()


 [ 
https://issues.apache.org/jira/browse/SPARK-5244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5244.
-
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4040
[https://github.com/apache/spark/pull/4040]

 add parser for COALESCE()
 -

 Key: SPARK-5244
 URL: https://issues.apache.org/jira/browse/SPARK-5244
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode


[ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286234#comment-14286234
 ] 

Patrick Wendell commented on SPARK-4939:


[~tdas] [~davies] [~kayousterhout] Because there is still discussion about 
this, and this is modifying a very complex component in Spark, I'm not going to 
block on this for 1.2.1. Once we merge a patch we can decide whether to put it 
into 1.2 based on what the final patch looks like. It is definitely 
inconvenient that this doesn't work in local mode, but much less of a problem 
than introducing a bug in the scheduler for production cluster workloads.

As a workaround we could suggest running this example with local-cluster.

 Python updateStateByKey example hang in local mode
 --

 Key: SPARK-4939
 URL: https://issues.apache.org/jira/browse/SPARK-4939
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5357) Upgrade from commons-codec 1.5

2015-01-21 Thread Matthew Whelan (JIRA)

Matthew Whelan created SPARK-5357:
-

 Summary: Upgrade from commons-codec 1.5
 Key: SPARK-5357
 URL: https://issues.apache.org/jira/browse/SPARK-5357
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0, 1.1.0
Reporter: Matthew Whelan


Spark uses commons-codec 1.5, which has a race condition in Base64.  That race 
was introduced in commons-codec 1.4 and resolved in 1.7.  The current version 
of commons-codec is 1.10.

Code that runs in Workers and assumes that Base64 is thread-safe will break 
because spark is using a non-thread-safe version.  See CODEC-96

In addition, the spark.files.userClassPathFirst mechanism is currently broken, 
(bug to come), so there isn't a viable work around for this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5006) spark.port.maxRetries doesn't work


 [ 
https://issues.apache.org/jira/browse/SPARK-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5006:
--
Target Version/s: 1.3.0, 1.2.1  (was: 1.3.0)

 spark.port.maxRetries doesn't work
 --

 Key: SPARK-5006
 URL: https://issues.apache.org/jira/browse/SPARK-5006
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
 Fix For: 1.3.0, 1.2.1


 We normally config spark.port.maxRetries in properties file or SparkConf. But 
 in Utils.scala it read from SparkEnv's conf. As SparkEnv is an object whose 
 env need to be set after JVM is launched and Utils.scala is also an object. 
 So in most cases portMaxRetries will get the default value 16.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5006) spark.port.maxRetries doesn't work


 [ 
https://issues.apache.org/jira/browse/SPARK-5006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5006:
--
Fix Version/s: 1.2.1

 spark.port.maxRetries doesn't work
 --

 Key: SPARK-5006
 URL: https://issues.apache.org/jira/browse/SPARK-5006
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.1.0
Reporter: WangTaoTheTonic
Assignee: WangTaoTheTonic
 Fix For: 1.3.0, 1.2.1


 We normally config spark.port.maxRetries in properties file or SparkConf. But 
 in Utils.scala it read from SparkEnv's conf. As SparkEnv is an object whose 
 env need to be set after JVM is launched and Utils.scala is also an object. 
 So in most cases portMaxRetries will get the default value 16.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4587) Model export/import

2015-01-21 Thread Joseph K. Bradley (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-4587:
-
Description:
This is an umbrella JIRA for one of the most requested features on the user
mailing list. Model export/import can be done via Java serialization. But it
doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we
should provide save/load methods to every model. PMML is an option but it has
its limitations. There are couple things we need to discuss: 1) data format, 2)
how to preserve partitioning, 3) data compatibility between versions and
language APIs, etc.

UPDATE: [Design doc for model import/export |
https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]

This document sketches machine learning model import/export plans, including
goals, an API, and development plans.

The design doc proposes:
* Support our own Spark-specific format.
** This is needed to (a) support distributed models and (b) get model
import/export support into Spark quickly (while avoiding new dependencies).
* Also support PMML
** This is needed since it is the only thing approaching an industry standard.

was:This is an umbrella JIRA for one of the most requested features on the
user mailing list. Model export/import can be done via Java serialization. But
it doesn't work for models stored distributively, e.g., ALS and LDA. Ideally,
we should provide save/load methods to every model. PMML is an option but it
has its limitations. There are couple things we need to discuss: 1) data
format, 2) how to preserve partitioning, 3) data compatibility between versions
and language APIs, etc.

Model export/import
---

Key: SPARK-4587
URL: https://issues.apache.org/jira/browse/SPARK-4587
Project: Spark
Issue Type: New Feature
Components: ML, MLlib
Reporter: Xiangrui Meng
Priority: Critical

This is an umbrella JIRA for one of the most requested features on the user
mailing list. Model export/import can be done via Java serialization. But it
doesn't work for models stored distributively, e.g., ALS and LDA. Ideally, we
should provide save/load methods to every model. PMML is an option but it has
its limitations. There are couple things we need to discuss: 1) data format,
2) how to preserve partitioning, 3) data compatibility between versions and
language APIs, etc.
UPDATE: [Design doc for model import/export |
https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]
This document sketches machine learning model import/export plans, including
goals, an API, and development plans.
The design doc proposes:
* Support our own Spark-specific format.
** This is needed to (a) support distributed models and (b) get model
import/export support into Spark quickly (while avoiding new dependencies).
* Also support PMML
** This is needed since it is the only thing approaching an industry standard.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5142) Possibly data may be ruined in Spark Streaming's WAL mechanism.

[
https://issues.apache.org/jira/browse/SPARK-5142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286387#comment-14286387
]

Tathagata Das commented on SPARK-5142:
--

This is definitely a tricky issue. One thing we could try is stop the current
log file, open a new log file and then try again. The data may have got written
to the previous log file but it does not matter because if the second attempt
works, the metadata will have reference to the segment in the new log file. The
data in the old log file can persist around, does not matter too much.

Possibly data may be ruined in Spark Streaming's WAL mechanism.
---

Key: SPARK-5142
URL: https://issues.apache.org/jira/browse/SPARK-5142
Project: Spark
Issue Type: Sub-task
Components: Streaming
Affects Versions: 1.2.0
Reporter: Saisai Shao

Currently in Spark Streaming's WAL manager, data will be written into HDFS
with multiple tries when meeting failure, because of lacking of transactional
guarantee, previously partial-written data is not rolled back and the retried
data will be appended to the last, this will ruin the file and make the
WriteAheadLogReader to read data with failure.
Firstly I think this problem is hard to fix because HDFS do not support
truncate operation(HDFS-3107) or random write with specific offset.
Secondly, I think if we meet such write exception, it is better not to try
again, try again will ruin the file and make read abnormal.
Sorry if I misunderstand anything.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task


[ 
https://issues.apache.org/jira/browse/SPARK-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286386#comment-14286386
 ] 

Apache Spark commented on SPARK-5360:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/4145

 For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are 
 included twice in serialized task
 

 Key: SPARK-5360
 URL: https://issues.apache.org/jira/browse/SPARK-5360
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor

 CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that 
 the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle.  
 The partition is serialized separately from the RDD, so when the RDD and 
 partition arrive on the worker, the references in the partition and in the 
 RDD no longer point to the same object.
 This is a relatively minor performance issue (the closure can be 2x larger 
 than it needs to be because the rdds and partitions are serialized twice; see 
 numbers below) but is more annoying as a developer issue (this is where I ran 
 into): if any state is stored in the RDD or ShuffleHandle on the worker side, 
 subtle bugs can appear due to the fact that the references to the RDD / 
 ShuffleHandle in the RDD and in the partition point to separate objects.  I'm 
 not sure if this is enough of a potential future problem to fix this old and 
 central part of the code, so hoping to get input from others here.
 I did some simple experiments to see how much this effects closure size.  For 
 this example: 
 $ val a = sc.parallelize(1 to 10).map((_, 1))
 $ val b = sc.parallelize(1 to 2).map(x = (x, 2*x))
 $ a.cogroup(b).collect()
 the closure was 1902 bytes with current Spark, and 1129 bytes after my 
 change.  The difference comes from eliminating duplicate serialization of the 
 shuffle handle.
 For this example:
 $ val sortedA = a.sortByKey()
 $ val sortedB = b.sortByKey()
 $ sortedA.cogroup(sortedB).collect()
 the closure was 3491 bytes with current Spark, and 1333 bytes after my 
 change. Here, the difference comes from eliminating duplicate serialization 
 of the two RDDs for the narrow dependencies.
 The ShuffleHandle includes the ShuffleDependency, so this difference will get 
 larger if a ShuffleDependency includes a serializer, a key ordering, or an 
 aggregator (all set to None by default).  However, the difference is not 
 affected by the size of the function the user specifies, which (based on my 
 understanding) is typically the source of large task closures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3424) KMeans Plus Plus is too slow


[ 
https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286227#comment-14286227
 ] 

Apache Spark commented on SPARK-3424:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4144

 KMeans Plus Plus is too slow
 

 Key: SPARK-3424
 URL: https://issues.apache.org/jira/browse/SPARK-3424
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
Assignee: Derrick Burns

 The  KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is 
 the rounds of the KMeansParallel algorithm and k is the number of clusters.  
 This can be dramatically improved by maintaining the distance the closest 
 cluster center from round to round and then incrementally updating that value 
 for each point. This incremental update is O(1) time, this reduces the 
 running time for K Means Plus Plus to O( m k ).  For large k, this is 
 significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode

2015-01-21 Thread Kay Ousterhout (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286236#comment-14286236
 ] 

Kay Ousterhout commented on SPARK-4939:
---

[~pwendell] just want to make sure you understand that there's a simple but 
slightly hacky change here that only modifies the local scheduler.  Just wanted 
to point that out in case that changes the dynamics of whether this should be 
fixed for 1.2.

 Python updateStateByKey example hang in local mode
 --

 Key: SPARK-4939
 URL: https://issues.apache.org/jira/browse/SPARK-4939
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5346) Parquet filter pushdown is not enabled when parquet.task.side.metadata is set to true (default value)


 [ 
https://issues.apache.org/jira/browse/SPARK-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5346:

Target Version/s: 1.3.0, 1.2.2  (was: 1.3.0, 1.2.1)

 Parquet filter pushdown is not enabled when parquet.task.side.metadata is set 
 to true (default value)
 -

 Key: SPARK-5346
 URL: https://issues.apache.org/jira/browse/SPARK-5346
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
Reporter: Cheng Lian
Priority: Blocker

 When computing Parquet splits, reading Parquet metadata from executor side is 
 more memory efficient, thus Spark SQL [sets {{parquet.task.side.metadata}} to 
 {{true}} by 
 default|https://github.com/apache/spark/blob/v1.2.0/sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala#L437].
  However, somehow this disables filter pushdown. 
 To workaround this issue and enable Parquet filter pushdown, users can set 
 {{spark.sql.parquet.filterPushdown}} to {{true}} and 
 {{parquet.task.side.metadata}} to {{false}}. However, for large Parquet files 
 with a large number of part-files and/or columns, reading metadata from 
 driver side eats lots of memory.
 The following Spark shell snippet can be useful to reproduce this issue:
 {code}
 import org.apache.spark.sql.SQLContext
 val sqlContext = new SQLContext(sc)
 import sqlContext._
 case class KeyValue(key: Int, value: String)
 sc.
   parallelize(1 to 1024).
   flatMap(i = Seq.fill(1024)(KeyValue(i, i.toString))).
   saveAsParquetFile(large.parquet)
 parquetFile(large.parquet).registerTempTable(large)
 sql(SET spark.sql.parquet.filterPushdown=true)
 sql(SELECT * FROM large).collect()
 sql(SELECT * FROM large WHERE key  200).collect()
 {code}
 Users can verify this issue by checking the input size metrics from web UI. 
 When filter pushdown is enabled, the second query reads fewer data.
 Notice that {{parquet.task.side.metadata}} must be set in _Hadoop_ 
 configuration (either via {{core-site.xml}} or 
 {{SparkConf.hadoopConfiguration.set()}}), setting it in 
 {{spark-defaults.conf}} or via {{SparkConf}} does NOT work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5360) For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are included twice in serialized task

2015-01-21 Thread Kay Ousterhout (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-5360:
--
Description: 
CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that 
the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle.  
The partition is serialized separately from the RDD, so when the RDD and 
partition arrive on the worker, the references in the partition and in the RDD 
no longer point to the same object.

This is a relatively minor performance issue (the closure can be 2x larger than 
it needs to be because the rdds and partitions are serialized twice; see 
numbers below) but is more annoying as a developer issue (this is where I ran 
into): if any state is stored in the RDD or ShuffleHandle on the worker side, 
subtle bugs can appear due to the fact that the references to the RDD / 
ShuffleHandle in the RDD and in the partition point to separate objects.  I'm 
not sure if this is enough of a potential future problem to fix this old and 
central part of the code, so hoping to get input from others here.

I did some simple experiments to see how much this effects closure size.  For 
this example: 
$ val a = sc.parallelize(1 to 10).map((_, 1))
$ val b = sc.parallelize(1 to 2).map(x = (x, 2*x))
$ a.cogroup(b).collect()

the closure was 1902 bytes with current Spark, and 1129 bytes after my change.  
The difference comes from eliminating duplicate serialization of the shuffle 
handle.

For this example:
$ val sortedA = a.sortByKey()
$ val sortedB = b.sortByKey()
$ sortedA.cogroup(sortedB).collect()

the closure was 3491 bytes with current Spark, and 1333 bytes after my change. 
Here, the difference comes from eliminating duplicate serialization of the two 
RDDs for the narrow dependencies.

The ShuffleHandle includes the ShuffleDependency, so this difference will get 
larger if a ShuffleDependency includes a serializer, a key ordering, or an 
aggregator (all set to None by default).  However, the difference is not 
affected by the size of the function the user specifies, which (based on my 
understanding) is typically the source of large task closures.

  was:
CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that 
the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle.  
The partition is serialized separately from the RDD, so when the RDD and 
partition arrive on the worker, the references in the partition and in the RDD 
no longer point to the same object.

This is a relatively minor performance issue (the closure can be 2x larger than 
it needs to be because the rdds and partitions are serialized twice; see 
numbers below) but is more annoying as a developer issue (this is where I ran 
into): if any state is stored in the RDD or ShuffleHandle on the worker side, 
subtle bugs can appear due to the fact that the references to the RDD / 
ShuffleHandle in the RDD and in the partition point to separate objects.  I'm 
not sure if this is enough of a potential future problem to fix this old and 
central part of the code, so hoping to get input from others here.

I did some simple experiments to see how much this effects closure size.  For 
this example: 
$ val a = sc.parallelize(1 to 10).map((_, 1))
$ val b = sc.parallelize(1 to 2).map(x = (x, 2*x))
$ a.cogroup(b).collect()

the closure was 1902 bytes with current Spark, and 1129 bytes after my change.  
The difference comes from eliminating duplicate serialization of the shuffle 
handle.

For this example:
$ val sortedA = a.sortByKey()
$ val sortedB = b.sortByKey()
$ sortedA.cogroup(sortedB).collect()

the closure was 3491 bytes with current Spark, and 1333 bytes after my change. 
Here, the difference comes from eliminating duplicate serialization of the two 
RDDs for the narrow dependencies.


 For CoGroupedRDD, rdds for narrow dependencies and shuffle handles are 
 included twice in serialized task
 

 Key: SPARK-5360
 URL: https://issues.apache.org/jira/browse/SPARK-5360
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Kay Ousterhout
Assignee: Kay Ousterhout
Priority: Minor

 CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that 
 the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle.  
 The partition is serialized separately from the RDD, so when the RDD and 
 partition arrive on the worker, the references in the partition and in the 
 RDD no longer point to the same object.
 This is a relatively minor performance issue (the closure can be 2x larger 
 than it needs to be because the rdds and partitions are serialized twice; see 
 numbers below) but is more annoying as a developer issue (this is where I ran

[jira] [Closed] (SPARK-5359) ML model import/export

2015-01-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-5359.

Resolution: Duplicate

 ML model import/export
 --

 Key: SPARK-5359
 URL: https://issues.apache.org/jira/browse/SPARK-5359
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 ML model import/export is a key component of any ML library.  This JIRA is 
 for creating a long-term plan to support model import/export.
 _From the design doc linked below:_
 This document sketches machine learning model import/export plans, including 
 goals, an API, and development plans.
 The design doc proposes:
 * Support our own Spark-specific format.
 ** This is needed to (a) support distributed models and (b) get model 
 import/export support into Spark quickly (while avoiding new dependencies).
 * Also support PMML
 ** This is needed since it is the only thing approaching an industry standard.
 [Design doc for model import/export | 
 https://docs.google.com/document/d/1kABFz1ssKJxLGMkboreSl3-I2CdLAOjNh5IQCrnDN3g/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3702) Standardize MLlib classes for learners, models

2015-01-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3702:
-
Description: 
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the requires links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]

  was:
Summary: Create a class hierarchy for learning algorithms and the models those 
algorithms produce.

This is a super-task of several sub-tasks (but JIRA does not allow subtasks of 
subtasks).  See the requires links below for subtasks.

Goals:
* give intuitive structure to API, both for developers and for generated 
documentation
* support meta-algorithms (e.g., boosting)
* support generic functionality (e.g., evaluation)
* reduce code duplication across classes

[Design doc for class hierarchy | 
https://docs.google.com/document/d/1I-8PD0DSLEZzzXURYZwmqAFn_OMBc08hgDL1FZnVBmw/]


 Standardize MLlib classes for learners, models
 --

 Key: SPARK-3702
 URL: https://issues.apache.org/jira/browse/SPARK-3702
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Blocker

 Summary: Create a class hierarchy for learning algorithms and the models 
 those algorithms produce.
 This is a super-task of several sub-tasks (but JIRA does not allow subtasks 
 of subtasks).  See the requires links below for subtasks.
 Goals:
 * give intuitive structure to API, both for developers and for generated 
 documentation
 * support meta-algorithms (e.g., boosting)
 * support generic functionality (e.g., evaluation)
 * reduce code duplication across classes
 [Design doc for class hierarchy | 
 https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler

2015-01-21 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286407#comment-14286407
 ] 

Ted Yu commented on SPARK-1714:
---

{code}
if (completedContainer.getExitStatus == -103) { // vmem limit exceeded
{code}
Should ContainerExitStatus#KILLED_EXCEEDED_VMEM be referenced above ?

 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
 

 Key: SPARK-1714
 URL: https://issues.apache.org/jira/browse/SPARK-1714
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1714) Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler

2015-01-21 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286416#comment-14286416
 ] 

Ted Yu commented on SPARK-1714:
---

allocatedHostToContainersMap.synchronized is absent for the following operation 
in runAllocatedContainers():
{code}
  val containerSet = 
allocatedHostToContainersMap.getOrElseUpdate(executorHostname,
new HashSet[ContainerId])

  containerSet += containerId
  allocatedContainerToHostMap.put(containerId, executorHostname)
{code}
Is that intentional ?

 Take advantage of AMRMClient APIs to simplify logic in YarnAllocationHandler
 

 Key: SPARK-1714
 URL: https://issues.apache.org/jira/browse/SPARK-1714
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.2.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.3.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3958) Possible stream-corruption issues in TorrentBroadcast


 [ 
https://issues.apache.org/jira/browse/SPARK-3958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3958.

  Resolution: Fixed
Target Version/s:   (was: 1.2.1)

At this point I'm not aware of people still hitting this set of issues in newer 
releases, so per discussion with [~joshrosen], I'd like to close this. Please 
comment on this JIRA if you are having some variant of this issue in a newer 
version of Spark, and we'll continue to investigate.

 Possible stream-corruption issues in TorrentBroadcast
 -

 Key: SPARK-3958
 URL: https://issues.apache.org/jira/browse/SPARK-3958
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Attachments: spark_ex.logs


 TorrentBroadcast deserialization sometimes fails with decompression errors, 
 which are most likely caused by stream-corruption exceptions.  For example, 
 this can manifest itself as a Snappy PARSING_ERROR when deserializing a 
 broadcasted task:
 {code}
 14/10/14 17:20:55.016 DEBUG BlockManager: Getting local block broadcast_8
 14/10/14 17:20:55.016 DEBUG BlockManager: Block broadcast_8 not registered 
 locally
 14/10/14 17:20:55.016 INFO TorrentBroadcast: Started reading broadcast 
 variable 8
 14/10/14 17:20:55.017 INFO TorrentBroadcast: Reading broadcast variable 8 
 took 5.3433E-5 s
 14/10/14 17:20:55.017 ERROR Executor: Exception in task 2.0 in stage 8.0 (TID 
 18)
 java.io.IOException: PARSING_ERROR(2)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:84)
   at org.xerial.snappy.SnappyNative.uncompressedLength(Native Method)
   at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:594)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:125)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
   at 
 org.apache.spark.broadcast.TorrentBroadcast.readObject(TorrentBroadcast.scala:170)
   at sun.reflect.GeneratedMethodAccessor92.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:164)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 SPARK-3630 is an umbrella ticket for investigating all causes of these Kryo 
 and Snappy deserialization errors.  This ticket is for a more 
 narrowly-focused exploration of the TorrentBroadcast version of these errors, 
 since the similar errors that we've seen in sort-based shuffle seem to be 
 explained by a different cause (see SPARK-3948).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-4105.

  Resolution: Fixed
Target Version/s:   (was: 1.2.1)

At this point I'm not aware of people still hitting this set of issues in newer 
releases, so per discussion with [~joshrosen], I'd like to close this. Please 
comment on this JIRA if you are having some variant of this issue in a newer 
version of Spark, and we'll continue to investigate.

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at

[jira] [Updated] (SPARK-5361) add in tuple handling for converting python RDD back to JavaRDD

2015-01-21 Thread Winston Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Winston Chen updated SPARK-5361:

Description: 
Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: 
Pyrolite `python tuple` = `java Object[]`.

So with the following data:

{noformat}
[
(u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', 
u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), 
(u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', 
u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert 
Ford', u'year': 2007})
]
{noformat}

Exceptions happen with the `genres` part:

{noformat}
15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
java.util.ArrayList
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
{noformat}

There is already a pull-request for this bug:
https://github.com/apache/spark/pull/4146

  was:
Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: 
Pyrolite `python tuple` = `java Object[]`.

So with the following data:

```
[
(u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', 
u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), 
(u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', 
u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert 
Ford', u'year': 2007})
]
```

Exceptions happen with the `genres` part:

```
15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
java.util.ArrayList
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
```

This pull request adds in tuple handling both in `SerDeUtil.pythonToJava` and 
`JavaToWritableConverter.convertToWritable`.


 add in tuple handling for converting python RDD back to JavaRDD
 ---

 Key: SPARK-5361
 URL: https://issues.apache.org/jira/browse/SPARK-5361
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Winston Chen

 Existing `SerDeUtil.pythonToJava` implementation does not count in tuple 
 cases: Pyrolite `python tuple` = `java Object[]`.
 So with the following data:
 {noformat}
 [
 (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', 
 u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), 
 (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', 
 u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert 
 Ford', u'year': 2007})
 ]
 {noformat}
 Exceptions happen with the `genres` part:
 {noformat}
 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 java.util.ArrayList
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 {noformat}
 There is already a pull-request for this bug:
 https://github.com/apache/spark/pull/4146



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-01-21 Thread Victor Tso (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286465#comment-14286465
 ] 

Victor Tso commented on SPARK-4105:
---

What's the fix version?

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Here's another occurrence of a similar error:
 {code}
 java.io.IOException: failed to read chunk
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:348)

[jira] [Closed] (SPARK-944) Give example of writing to HBase from Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das closed SPARK-944.
---
Resolution: Not a Problem

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Tathagata Das
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5361) python tuple not supported while converting PythonRDD back to JavaRDD

2015-01-21 Thread Winston Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Winston Chen updated SPARK-5361:

Summary: python tuple not supported while converting PythonRDD back to 
JavaRDD  (was: add in tuple handling for converting python RDD back to JavaRDD)

 python tuple not supported while converting PythonRDD back to JavaRDD
 -

 Key: SPARK-5361
 URL: https://issues.apache.org/jira/browse/SPARK-5361
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Winston Chen

 Existing `SerDeUtil.pythonToJava` implementation does not count in tuple 
 cases: Pyrolite `python tuple` = `java Object[]`.
 So with the following data:
 {noformat}
 [
 (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', 
 u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), 
 (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', 
 u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert 
 Ford', u'year': 2007})
 ]
 {noformat}
 Exceptions happen with the `genres` part:
 {noformat}
 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 java.util.ArrayList
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 {noformat}
 There is already a pull-request for this bug:
 https://github.com/apache/spark/pull/4146



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4520) SparkSQL exception when reading certain columns from a parquet file


[ 
https://issues.apache.org/jira/browse/SPARK-4520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286578#comment-14286578
 ] 

Apache Spark commented on SPARK-4520:
-

User 'sadhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/4148

 SparkSQL exception when reading certain columns from a parquet file
 ---

 Key: SPARK-4520
 URL: https://issues.apache.org/jira/browse/SPARK-4520
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: sadhan sood
Assignee: sadhan sood
Priority: Critical
 Attachments: part-r-0.parquet


 I am seeing this issue with spark sql throwing an exception when trying to 
 read selective columns from a thrift parquet file and also when caching them.
 On some further digging, I was able to narrow it down to at-least one 
 particular column type: mapstring, setstring to be causing this issue. To 
 reproduce this I created a test thrift file with a very basic schema and 
 stored some sample data in a parquet file:
 Test.thrift
 ===
 {code}
 typedef binary SomeId
 enum SomeExclusionCause {
   WHITELIST = 1,
   HAS_PURCHASE = 2,
 }
 struct SampleThriftObject {
   10: string col_a;
   20: string col_b;
   30: string col_c;
   40: optional mapSomeExclusionCause, setSomeId col_d;
 }
 {code}
 =
 And loading the data in spark through schemaRDD:
 {code}
 import org.apache.spark.sql.SchemaRDD
 val sqlContext = new org.apache.spark.sql.SQLContext(sc);
 val parquetFile = /path/to/generated/parquet/file
 val parquetFileRDD = sqlContext.parquetFile(parquetFile)
 parquetFileRDD.printSchema
 root
  |-- col_a: string (nullable = true)
  |-- col_b: string (nullable = true)
  |-- col_c: string (nullable = true)
  |-- col_d: map (nullable = true)
  ||-- key: string
  ||-- value: array (valueContainsNull = true)
  |||-- element: string (containsNull = false)
 parquetFileRDD.registerTempTable(test)
 sqlContext.cacheTable(test)
 sqlContext.sql(select col_a from test).collect() -- see the exception 
 stack here 
 {code}
 {code}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
 (TID 0, localhost): parquet.io.ParquetDecodingException: Can not read value 
 at 0 in block -1 in file file:/tmp/xyz/part-r-0.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
   at 
 scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
   at 
 scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at 
 org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1223)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:195)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
   at java.util.ArrayList.elementData(ArrayList.java:418)
   at

[jira] [Commented] (SPARK-5347) InputMetrics bug when inputSplit is not instanceOf FileSplit


[ 
https://issues.apache.org/jira/browse/SPARK-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286688#comment-14286688
 ] 

Apache Spark commented on SPARK-5347:
-

User 'shenh062326' has created a pull request for this issue:
https://github.com/apache/spark/pull/4150

 InputMetrics bug when inputSplit is not instanceOf FileSplit
 

 Key: SPARK-5347
 URL: https://issues.apache.org/jira/browse/SPARK-5347
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 When inputFormatClass is set to CombineFileInputFormat, input metrics show 
 that input is empty. It don't appear is spark-1.1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations


 [ 
https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5063:
--
Description: 
Spark does not support nested RDDs or performing Spark actions inside of 
transformations; this usually leads to NullPointerExceptions (see SPARK-718 as 
one example).  The confusing NPE is one of the most common sources of Spark 
questions on StackOverflow:

- 
https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
- 
https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
- 
https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674

(those are just a sample of the ones that I've answered personally; there are 
many others).

I think we can detect these errors by adding logic to {{RDD}} to check whether 
{{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to 
add a better error message.

In PySpark, these errors manifest themselves slightly differently.  Attempting 
to nest RDDs or perform actions inside of transformations results in 
pickle-time errors:

{code}
rdd1 = sc.parallelize(range(100))
rdd2 = sc.parallelize(range(100))
rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
{code}

produces

{code}
[...]
  File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save
rv = reduce(self.proto)
  File 
/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. 
Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
{code}

We get the same error when attempting to broadcast an RDD in PySpark.  For 
Python, improved error reporting could be as simple as overriding the 
{{getnewargs}} method to throw a more useful UnsupportedOperation exception 
with a more helpful error message.

Users may also see confusing NPEs when calling methods on stopped 
SparkContexts, so I've added checks for that as well.

  was:
Spark does not support nested RDDs or performing Spark actions inside of 
transformations; this usually leads to NullPointerExceptions (see SPARK-718 as 
one example).  The confusing NPE is one of the most common sources of Spark 
questions on StackOverflow:

- 
https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
- 
https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
- 
https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674

(those are just a sample of the ones that I've answered personally; there are 
many others).

I think we can detect these errors by adding logic to {{RDD}} to check whether 
{{sc}} is null (e.g. turn {{sc}} into a getter function); we can use this to 
add a better error message.

In PySpark, these errors manifest themselves slightly differently.  Attempting 
to nest RDDs or perform actions inside of transformations results in 
pickle-time errors:

{code}
rdd1 = sc.parallelize(range(100))
rdd2 = sc.parallelize(range(100))
rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
{code}

produces

{code}
[...]
  File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save
rv = reduce(self.proto)
  File 
/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
 line 538, in __call__
  File 
/Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
 line 304, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. 
Trace:
py4j.Py4JException: Method __getnewargs__([]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at

[jira] [Commented] (SPARK-4506) Update documentation to clarify whether standalone-cluster mode is now officially supported

2015-01-21 Thread Asim Jalis (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286466#comment-14286466
 ] 

Asim Jalis commented on SPARK-4506:
---

This bug should be reopened. The doc needs some more changes.
Doc: https://github.com/apache/spark/blob/master/docs/submitting-applications.md
Current text: Note that cluster mode is currently not supported for standalone 
clusters, Mesos clusters, or Python applications.
Proposed text: Note that cluster mode is currently not supported for Mesos 
clusters, or Python applications.

 Update documentation to clarify whether standalone-cluster mode is now 
 officially supported
 ---

 Key: SPARK-4506
 URL: https://issues.apache.org/jira/browse/SPARK-4506
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0, 1.1.1, 1.2.0
Reporter: Josh Rosen
Assignee: Andrew Or
 Fix For: 1.1.1, 1.2.0


 The Launching Compiled Spark Applications section of the Spark Standalone 
 docs claims that standalone mode only supports {{client}} deploy mode:
 {quote}
 The spark-submit script provides the most straightforward way to submit a 
 compiled Spark application to the cluster. For standalone clusters, Spark 
 currently only supports deploying the driver inside the client process that 
 is submitting the application (client deploy mode).
 {quote}
 It looks like {{standalone-cluster}} mode actually works (I've used it and 
 have heard from users that are successfully using it, too).
 The current line was added in SPARK-2259 when {{standalone-cluster}} mode 
 wasn't officially supported.  It looks like SPARK-2260 fixed a number of bugs 
 in {{standalone-cluster}} mode, so we should update the documentation if 
 we're now ready to officially support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4984) add a pop-up containing the full for job description when it is very long


 [ 
https://issues.apache.org/jira/browse/SPARK-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4984.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3819
[https://github.com/apache/spark/pull/3819]

 add a pop-up containing the full for job description when it is very long
 -

 Key: SPARK-4984
 URL: https://issues.apache.org/jira/browse/SPARK-4984
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: wangfei
 Fix For: 1.3.0


 add a pop-up containing the full for job description when it is very long



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4984) add a pop-up containing the full for job description when it is very long


 [ 
https://issues.apache.org/jira/browse/SPARK-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4984:
--
Assignee: wangfei

 add a pop-up containing the full for job description when it is very long
 -

 Key: SPARK-4984
 URL: https://issues.apache.org/jira/browse/SPARK-4984
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.3.0


 add a pop-up containing the full for job description when it is very long



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4984) add a pop-up containing the full for job description when it is very long


 [ 
https://issues.apache.org/jira/browse/SPARK-4984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4984:
--
Component/s: (was: Spark Core)
 Web UI

 add a pop-up containing the full for job description when it is very long
 -

 Key: SPARK-4984
 URL: https://issues.apache.org/jira/browse/SPARK-4984
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: wangfei
Assignee: wangfei
 Fix For: 1.3.0


 add a pop-up containing the full for job description when it is very long



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles


 [ 
https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5227:
--
Priority: Blocker  (was: Major)

 InputOutputMetricsSuite input metrics when reading text file with multiple 
 splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 
 profiles
 -

 Key: SPARK-5227
 URL: https://issues.apache.org/jira/browse/SPARK-5227
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Josh Rosen
Priority: Blocker
  Labels: flaky-test

 The InputOutputMetricsSuite  input metrics when reading text file with 
 multiple splits test has been failing consistently in our new {{branch-1.2}} 
 Jenkins SBT build: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/
 Here's the error message
 {code}
 ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32,

[jira] [Commented] (SPARK-5227) InputOutputMetricsSuite input metrics when reading text file with multiple splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 profiles


[ 
https://issues.apache.org/jira/browse/SPARK-5227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286681#comment-14286681
 ] 

Josh Rosen commented on SPARK-5227:
---

I've bumped this up to a 1.2.1 blocker to see if we can find a fix, since this 
is preventing the unit tests from running in SBT under certain Hadoop 
configurations.

 InputOutputMetricsSuite input metrics when reading text file with multiple 
 splits test fails in branch-1.2 SBT Jenkins build w/hadoop1.0 and hadoop2.0 
 profiles
 -

 Key: SPARK-5227
 URL: https://issues.apache.org/jira/browse/SPARK-5227
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
Reporter: Josh Rosen
Priority: Blocker
  Labels: flaky-test

 The InputOutputMetricsSuite  input metrics when reading text file with 
 multiple splits test has been failing consistently in our new {{branch-1.2}} 
 Jenkins SBT build: 
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.2-SBT/14/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/testReport/junit/org.apache.spark.metrics/InputOutputMetricsSuite/input_metrics_when_reading_text_file_with_multiple_splits/
 Here's the error message
 {code}
 ArrayBuffer(32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 
 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32, 32,

[jira] [Created] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-01-21 Thread Alexander Ulanov (JIRA)

Alexander Ulanov created SPARK-5362:
---

 Summary: Gradient and Optimizer to support generic output (instead 
of label) and data batches
 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0


Currently, Gradient and Optimizer interfaces support data in form of 
RDD[Double, Vector] which refers to label and features. This limits its 
application to classification problems. For example, artificial neural network 
demands Vector as output (instead of label: Double). Moreover, current 
interface does not support data batches. I propose to replace label: Double 
with output: Vector. It enables passing generic output instead of label and 
also passing data and output batches stored in corresponding vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5361) add in tuple handling for converting python RDD back to JavaRDD

2015-01-21 Thread Winston Chen (JIRA)

Winston Chen created SPARK-5361:
---

 Summary: add in tuple handling for converting python RDD back to 
JavaRDD
 Key: SPARK-5361
 URL: https://issues.apache.org/jira/browse/SPARK-5361
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Winston Chen


Existing `SerDeUtil.pythonToJava` implementation does not count in tuple cases: 
Pyrolite `python tuple` = `java Object[]`.

So with the following data:

```
[
(u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', 
u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), 
(u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', 
u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert 
Ford', u'year': 2007})
]
```

Exceptions happen with the `genres` part:

```
15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
java.util.ArrayList
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
at 
org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
```

This pull request adds in tuple handling both in `SerDeUtil.pythonToJava` and 
`JavaToWritableConverter.convertToWritable`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-21 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286543#comment-14286543
 ] 

Hari Shreedharan commented on SPARK-5342:
-

Looks like SPARK-3883 is adding SSL support to even Akka - in which case we 
could simply use Akka for sending the new delegation tokens (and avoid the HTTP 
Server route).

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called


[ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286596#comment-14286596
 ] 

Apache Spark commented on SPARK-5147:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/4149

 write ahead logs from streaming receiver are not purged because 
 cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
 --

 Key: SPARK-5147
 URL: https://issues.apache.org/jira/browse/SPARK-5147
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Max Xu
Priority: Blocker

 Hi all,
 We are running a Spark streaming application with ReliableKafkaReceiver. We 
 have spark.streaming.receiver.writeAheadLog.enable set to true so write 
 ahead logs (WALs) for received data are created under receivedData/streamId 
 folder in the checkpoint directory. 
 However, old WALs are never purged by time. receivedBlockMetadata and 
 checkpoint files are purged correctly though. I went through the code, 
 WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
 responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
 which is never called by any class. ReceiverSupervisorImpl class holds a 
 WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
 method to create WALs but never calls cleanupOldBlocks method to purge old 
 WALs.
 The size of the WAL folder increases constantly on HDFS. This is preventing 
 us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
 look.
 Thanks,
 Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5355) SparkConf is not thread-safe


 [ 
https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5355.
---
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 4143
[https://github.com/apache/spark/pull/4143]

 SparkConf is not thread-safe
 

 Key: SPARK-5355
 URL: https://issues.apache.org/jira/browse/SPARK-5355
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 The SparkConf is not thread-safe, but is accessed by many threads. The 
 getAll() could return parts of the configs if another thread is access it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5355) SparkConf is not thread-safe


 [ 
https://issues.apache.org/jira/browse/SPARK-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5355:
--
Assignee: Davies Liu

 SparkConf is not thread-safe
 

 Key: SPARK-5355
 URL: https://issues.apache.org/jira/browse/SPARK-5355
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker
 Fix For: 1.3.0, 1.2.1


 The SparkConf is not thread-safe, but is accessed by many threads. The 
 getAll() could return parts of the configs if another thread is access it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches


[ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286701#comment-14286701
 ] 

Apache Spark commented on SPARK-5362:
-

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/4152

 Gradient and Optimizer to support generic output (instead of label) and data 
 batches
 

 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, Gradient and Optimizer interfaces support data in form of 
 RDD[Double, Vector] which refers to label and features. This limits its 
 application to classification problems. For example, artificial neural 
 network demands Vector as output (instead of label: Double). Moreover, 
 current interface does not support data batches. I propose to replace label: 
 Double with output: Vector. It enables passing generic output instead of 
 label and also passing data and output batches stored in corresponding 
 vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2015-01-21 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286703#comment-14286703
 ] 

Alexander Ulanov commented on SPARK-5362:
-

https://github.com/apache/spark/pull/4152

 Gradient and Optimizer to support generic output (instead of label) and data 
 batches
 

 Key: SPARK-5362
 URL: https://issues.apache.org/jira/browse/SPARK-5362
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Alexander Ulanov
 Fix For: 1.3.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 Currently, Gradient and Optimizer interfaces support data in form of 
 RDD[Double, Vector] which refers to label and features. This limits its 
 application to classification problems. For example, artificial neural 
 network demands Vector as output (instead of label: Double). Moreover, 
 current interface does not support data batches. I propose to replace label: 
 Double with output: Vector. It enables passing generic output instead of 
 label and also passing data and output batches stored in corresponding 
 vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5361) python tuple not supported while converting PythonRDD back to JavaRDD


[ 
https://issues.apache.org/jira/browse/SPARK-5361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286459#comment-14286459
 ] 

Apache Spark commented on SPARK-5361:
-

User 'wingchen' has created a pull request for this issue:
https://github.com/apache/spark/pull/4146

 python tuple not supported while converting PythonRDD back to JavaRDD
 -

 Key: SPARK-5361
 URL: https://issues.apache.org/jira/browse/SPARK-5361
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Winston Chen

 Existing `SerDeUtil.pythonToJava` implementation does not count in tuple 
 cases: Pyrolite `python tuple` = `java Object[]`.
 So with the following data:
 {noformat}
 [
 (u'2', {u'director': u'David Lean', u'genres': (u'Adventure', u'Biography', 
 u'Drama'), u'title': u'Lawrence of Arabia', u'year': 1962}), 
 (u'7', {u'director': u'Andrew Dominik', u'genres': (u'Biography', u'Crime', 
 u'Drama'), u'title': u'The Assassination of Jesse James by the Coward Robert 
 Ford', u'year': 2007})
 ]
 {noformat}
 Exceptions happen with the `genres` part:
 {noformat}
 15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 java.util.ArrayList
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
   at 
 org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
 {noformat}
 There is already a pull-request for this bug:
 https://github.com/apache/spark/pull/4146



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4631) Add real unit test for MQTT


 [ 
https://issues.apache.org/jira/browse/SPARK-4631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-4631.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

 Add real unit test for MQTT 
 

 Key: SPARK-4631
 URL: https://issues.apache.org/jira/browse/SPARK-4631
 Project: Spark
  Issue Type: Test
  Components: Streaming
Reporter: Tathagata Das
Priority: Critical
 Fix For: 1.3.0, 1.2.1


 A real unit test that actually transfers data to ensure that the MQTTUtil is 
 functional



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations


 [ 
https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5063:
--
Target Version/s: 1.2.1

 Display more helpful error messages for several invalid operations
 --

 Key: SPARK-5063
 URL: https://issues.apache.org/jira/browse/SPARK-5063
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 Spark does not support nested RDDs or performing Spark actions inside of 
 transformations; this usually leads to NullPointerExceptions (see SPARK-718 
 as one example).  The confusing NPE is one of the most common sources of 
 Spark questions on StackOverflow:
 - 
 https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
 - 
 https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
 - 
 https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674
 (those are just a sample of the ones that I've answered personally; there are 
 many others).
 I think we can detect these errors by adding logic to {{RDD}} to check 
 whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use 
 this to add a better error message.
 In PySpark, these errors manifest themselves slightly differently.  
 Attempting to nest RDDs or perform actions inside of transformations results 
 in pickle-time errors:
 {code}
 rdd1 = sc.parallelize(range(100))
 rdd2 = sc.parallelize(range(100))
 rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
 {code}
 produces
 {code}
 [...]
   File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save
 rv = reduce(self.proto)
   File 
 /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 304, in get_return_value
 py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. 
 Trace:
 py4j.Py4JException: Method __getnewargs__([]) does not exist
   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
   at py4j.Gateway.invoke(Gateway.java:252)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 We get the same error when attempting to broadcast an RDD in PySpark.  For 
 Python, improved error reporting could be as simple as overriding the 
 {{getnewargs}} method to throw a more useful UnsupportedOperation exception 
 with a more helpful error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode

2015-01-21 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286447#comment-14286447
 ] 

Davies Liu commented on SPARK-4939:
---

Sent out a PR to change the local scheduler to revive offers periodically, it 
should be safer to be merged into 1.2.

 Python updateStateByKey example hang in local mode
 --

 Key: SPARK-4939
 URL: https://issues.apache.org/jira/browse/SPARK-4939
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4939) Python updateStateByKey example hang in local mode


[ 
https://issues.apache.org/jira/browse/SPARK-4939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286445#comment-14286445
 ] 

Apache Spark commented on SPARK-4939:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/4147

 Python updateStateByKey example hang in local mode
 --

 Key: SPARK-4939
 URL: https://issues.apache.org/jira/browse/SPARK-4939
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core, Streaming
Affects Versions: 1.2.0, 1.3.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-944) Give example of writing to HBase from Spark Streaming


[ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286639#comment-14286639
 ] 

Tathagata Das commented on SPARK-944:
-

I am closing this JIRA because this is not relevant any more. For examples, any 
reader take a look at 
https://github.com/cloudera-labs/SparkOnHBase/blob/cdh5-0.0.1/src/main/java/com/cloudera/spark/hbase/example/JavaHBaseStreamingBulkPutExample.java

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Tathagata Das
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4586) Python API for ML Pipeline


[ 
https://issues.apache.org/jira/browse/SPARK-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286693#comment-14286693
 ] 

Apache Spark commented on SPARK-4586:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4151

 Python API for ML Pipeline
 --

 Key: SPARK-4586
 URL: https://issues.apache.org/jira/browse/SPARK-4586
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical

 Add Python API to the newly added ML pipeline and parameters. The initial 
 design doc is posted here: 
 https://docs.google.com/document/d/1vL-4f5Xm-7t-kwVSaBylP_ZPrktPZjaOb2dWONtZU2s/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5063) Display more helpful error messages for several invalid operations


 [ 
https://issues.apache.org/jira/browse/SPARK-5063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5063:
--
Summary: Display more helpful error messages for several invalid operations 
 (was: Raise more helpful errors when RDD actions or transformations are called 
inside of transformations)

 Display more helpful error messages for several invalid operations
 --

 Key: SPARK-5063
 URL: https://issues.apache.org/jira/browse/SPARK-5063
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Josh Rosen
Assignee: Josh Rosen

 Spark does not support nested RDDs or performing Spark actions inside of 
 transformations; this usually leads to NullPointerExceptions (see SPARK-718 
 as one example).  The confusing NPE is one of the most common sources of 
 Spark questions on StackOverflow:
 - 
 https://stackoverflow.com/questions/13770218/call-of-distinct-and-map-together-throws-npe-in-spark-library/14130534#14130534
 - 
 https://stackoverflow.com/questions/23793117/nullpointerexception-in-scala-spark-appears-to-be-caused-be-collection-type/23793399#23793399
 - 
 https://stackoverflow.com/questions/25997558/graphx-ive-got-nullpointerexception-inside-mapvertices/26003674#26003674
 (those are just a sample of the ones that I've answered personally; there are 
 many others).
 I think we can detect these errors by adding logic to {{RDD}} to check 
 whether {{sc}} is null (e.g. turn {{sc}} into a getter function); we can use 
 this to add a better error message.
 In PySpark, these errors manifest themselves slightly differently.  
 Attempting to nest RDDs or perform actions inside of transformations results 
 in pickle-time errors:
 {code}
 rdd1 = sc.parallelize(range(100))
 rdd2 = sc.parallelize(range(100))
 rdd1.mapPartitions(lambda x: [rdd2.map(lambda x: x)])
 {code}
 produces
 {code}
 [...]
   File /Users/joshrosen/anaconda/lib/python2.7/pickle.py, line 306, in save
 rv = reduce(self.proto)
   File 
 /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py,
  line 538, in __call__
   File 
 /Users/joshrosen/Documents/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py,
  line 304, in get_return_value
 py4j.protocol.Py4JError: An error occurred while calling o21.__getnewargs__. 
 Trace:
 py4j.Py4JException: Method __getnewargs__([]) does not exist
   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
   at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
   at py4j.Gateway.invoke(Gateway.java:252)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 We get the same error when attempting to broadcast an RDD in PySpark.  For 
 Python, improved error reporting could be as simple as overriding the 
 {{getnewargs}} method to throw a more useful UnsupportedOperation exception 
 with a more helpful error message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-21 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286706#comment-14286706
 ] 

Alexander Ulanov commented on SPARK-5256:
-

I've implemented my proposition with Vector as output in 
https://issues.apache.org/jira/browse/SPARK-5362

 Improving MLlib optimization APIs
 -

 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 *Goal*: Improve APIs for optimization
 *Motivation*: There have been several disjoint mentions of improving the 
 optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
 a place to discuss what API changes are necessary for the long term, and to 
 provide links to other relevant JIRAs.
 Eventually, I hope this leads to a design doc outlining:
 * current issues
 * requirements such as supporting many types of objective functions, 
 optimization algorithms, and parameters to those algorithms
 * ideal API
 * breakdown of smaller JIRAs needed to achieve that API
 I will soon create an initial design doc, and I will try to watch this JIRA 
 and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2546) Configuration object thread safety issue

2015-01-21 Thread Tsuyoshi OZAWA (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286971#comment-14286971
 ] 

Tsuyoshi OZAWA commented on SPARK-2546:
---

Now HADOOP-11209, the problem reported by [~joshrosen], is resolved by 
[~varun_saxena]'s contribution. Thanks for your reporting.

 Configuration object thread safety issue
 

 Key: SPARK-2546
 URL: https://issues.apache.org/jira/browse/SPARK-2546
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.2, 1.1.0, 1.2.0
Reporter: Andrew Ash
Assignee: Josh Rosen
Priority: Critical
 Fix For: 1.1.1, 1.2.0, 1.0.3


 // observed in 0.9.1 but expected to exist in 1.0.1 as well
 This ticket is copy-pasted from a thread on the dev@ list:
 {quote}
 We discovered a very interesting bug in Spark at work last week in Spark 
 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
 thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
 Let me explain:
 Observations
  - Was running a relatively simple job (read from Avro files, do a map, do 
 another map, write back to Avro files)
  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
  - The 412 successful tasks completed in median time 3.4s
  - The last hung task didn't finish even in 20 hours
  - The executor with the hung task was responsible for 100% of one core of 
 CPU usage
  - Jstack of the executor attached (relevant thread pasted below)
 Diagnosis
 After doing some code spelunking, we determined the issue was concurrent use 
 of a Configuration object for each task on an executor.  In Hadoop each task 
 runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
 the single-threaded access assumptions of the Configuration object no longer 
 hold in Spark.
 The specific issue is that the AvroRecordReader actually _modifies_ the 
 JobConf it's given when it's instantiated!  It adds a key for the RPC 
 protocol engine in the process of connecting to the Hadoop FileSystem.  When 
 many tasks start at the same time (like at the start of a job), many tasks 
 are adding this configuration item to the one Configuration object at once.  
 Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
 The below post is an excellent explanation of what happens in the situation 
 where multiple threads insert into a HashMap at the same time.
 http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
 The gist is that you have a thread following a cycle of linked list nodes 
 indefinitely.  This exactly matches our observations of the 100% CPU core and 
 also the final location in the stack trace.
 So it seems the way Spark shares a Configuration object between task threads 
 in an executor is incorrect.  We need some way to prevent concurrent access 
 to a single Configuration object.
 Proposed fix
 We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
 its own JobConf object (and thus Configuration object).  The optimization of 
 broadcasting the Configuration object across the cluster can remain, but on 
 the other side I think it needs to be cloned for each task to allow for 
 concurrent access.  I'm not sure the performance implications, but the 
 comments suggest that the Configuration object is ~10KB so I would expect a 
 clone on the object to be relatively speedy.
 Has this been observed before?  Does my suggested fix make sense?  I'd be 
 happy to file a Jira ticket and continue discussion there for the right way 
 to fix.
 Thanks!
 Andrew
 P.S.  For others seeing this issue, our temporary workaround is to enable 
 spark.speculation, which retries failed (or hung) tasks on other machines.
 {noformat}
 Executor task launch worker-6 daemon prio=10 tid=0x7f91f01fe000 
 nid=0x54b1 runnable [0x7f92d74f1000]
java.lang.Thread.State: RUNNABLE
 at java.util.HashMap.transfer(HashMap.java:601)
 at java.util.HashMap.resize(HashMap.java:581)
 at java.util.HashMap.addEntry(HashMap.java:879)
 at java.util.HashMap.put(HashMap.java:505)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
 at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
 at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
 at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168)
 at 
 org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
 at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:436)
 at

[jira] [Resolved] (SPARK-3424) KMeans Plus Plus is too slow


 [ 
https://issues.apache.org/jira/browse/SPARK-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3424.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4144
[https://github.com/apache/spark/pull/4144]

 KMeans Plus Plus is too slow
 

 Key: SPARK-3424
 URL: https://issues.apache.org/jira/browse/SPARK-3424
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
Assignee: Derrick Burns
 Fix For: 1.3.0


 The  KMeansPlusPlus algorithm is implemented in time O( m k^2), where m is 
 the rounds of the KMeansParallel algorithm and k is the number of clusters.  
 This can be dramatically improved by maintaining the distance the closest 
 cluster center from round to round and then incrementally updating that value 
 for each point. This incremental update is O(1) time, this reduces the 
 running time for K Means Plus Plus to O( m k ).  For large k, this is 
 significant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5297) File Streams do not work with custom key/values


[ 
https://issues.apache.org/jira/browse/SPARK-5297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286750#comment-14286750
 ] 

Apache Spark commented on SPARK-5297:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/4154

 File Streams do not work with custom key/values
 ---

 Key: SPARK-5297
 URL: https://issues.apache.org/jira/browse/SPARK-5297
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
Reporter: Leonidas Fegaras
Assignee: Saisai Shao
  Labels: backport-needed
 Fix For: 1.3.0


 The following code:
 {code}
 stream_context.K,V,SequenceFileInputFormatK,VfileStream(directory)
 .foreachRDD(new FunctionJavaPairRDDK,V,Void() {
  public Void call ( JavaPairRDDK,V rdd ) throws Exception {
  for ( Tuple2K,V x: rdd.collect() )
  System.out.println(# +x._1+ +x._2);
  return null;
  }
   });
 stream_context.start();
 stream_context.awaitTermination();
 {code}
 for custom (serializable) classes K and V compiles fine but gives an error
 when I drop a new hadoop sequence file in the directory:
 {quote}
 15/01/17 09:13:59 ERROR scheduler.JobScheduler: Error generating jobs for 
 time 1421507639000 ms
 java.lang.ClassCastException: java.lang.Object cannot be cast to 
 org.apache.hadoop.mapreduce.InputFormat
   at 
 org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:91)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:236)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream$$anonfun$3.apply(FileInputDStream.scala:234)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream.org$apache$spark$streaming$dstream$FileInputDStream$$filesToRDD(FileInputDStream.scala:234)
   at 
 org.apache.spark.streaming.dstream.FileInputDStream.compute(FileInputDStream.scala:128)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:296)
   at 
 org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:288)
   at scala.Option.orElse(Option.scala:257)
 {quote}
 The same classes K and V work fine for non-streaming Spark:
 {code}
 spark_context.newAPIHadoopFile(path,F.class,K.class,SequenceFileInputFormat.class,conf)
 {code}
 also streaming works fine for TextFileInputFormat.
 The issue is that class manifests are erased to object in the Java file 
 stream constructor, but those are relied on downstream when creating the 
 Hadoop RDD that backs each batch of the file stream.
 https://github.com/apache/spark/blob/v1.2.0/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala#L263
 https://github.com/apache/spark/blob/v1.2.0/core/src/main/scala/org/apache/spark/SparkContext.scala#L753



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types

2015-01-21 Thread Yash Datta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287004#comment-14287004
 ] 

Yash Datta commented on SPARK-4786:
---

https://github.com/apache/spark/pull/4156

 Parquet filter pushdown for BYTE and SHORT types
 

 Key: SPARK-4786
 URL: https://issues.apache.org/jira/browse/SPARK-4786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian

 Among all integral types, currently only INT and LONG predicates can be 
 converted to Parquet filter predicate. BYTE and SHORT predicates can be 
 covered by INT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types


[ 
https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287003#comment-14287003
 ] 

Apache Spark commented on SPARK-4786:
-

User 'saucam' has created a pull request for this issue:
https://github.com/apache/spark/pull/4156

 Parquet filter pushdown for BYTE and SHORT types
 

 Key: SPARK-4786
 URL: https://issues.apache.org/jira/browse/SPARK-4786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian

 Among all integral types, currently only INT and LONG predicates can be 
 converted to Parquet filter predicate. BYTE and SHORT predicates can be 
 covered by INT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4786) Parquet filter pushdown for BYTE and SHORT types

2015-01-21 Thread Yash Datta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yash Datta updated SPARK-4786:
--
Comment: was deleted

(was: https://github.com/apache/spark/pull/4156)

 Parquet filter pushdown for BYTE and SHORT types
 

 Key: SPARK-4786
 URL: https://issues.apache.org/jira/browse/SPARK-4786
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Lian

 Among all integral types, currently only INT and LONG predicates can be 
 converted to Parquet filter predicate. BYTE and SHORT predicates can be 
 covered by INT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5342) Allow long running Spark apps to run on secure YARN/HDFS

2015-01-21 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14287018#comment-14287018
 ] 

Hari Shreedharan commented on SPARK-5342:
-

I am considering just copying the keytab and principal to staging directory for 
the application, instead of using the distributed cache. Any suggestions on 
which is better?

 Allow long running Spark apps to run on secure YARN/HDFS
 

 Key: SPARK-5342
 URL: https://issues.apache.org/jira/browse/SPARK-5342
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Hari Shreedharan
 Attachments: SparkYARN.pdf


 Currently, Spark apps cannot write to HDFS after the delegation tokens reach 
 their expiry, which maxes out at 7 days. We must find a way to ensure that we 
 can run applications for longer - for example, spark streaming apps are 
 expected to run forever.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5363) Spark 1.2 freeze without error notification

2015-01-21 Thread Tassilo Klein (JIRA)

Tassilo Klein created SPARK-5363:


 Summary: Spark 1.2 freeze without error notification
 Key: SPARK-5363
 URL: https://issues.apache.org/jira/browse/SPARK-5363
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Tassilo Klein
Priority: Critical
 Fix For: 1.2.1, 1.2.2


After a number of calls to a map().collect() statement Spark freezes without 
reporting any error.  Within the map a large broadcast variable is used.

The freezing can be avoided by setting 'spark.python.worker.reuse = false' 
(Spark 1.2) or using an earlier version, however, at the prize of low speed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5347) InputMetrics bug when inputSplit is not instanceOf FileSplit

2015-01-21 Thread Hong Shen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14286738#comment-14286738
 ] 

Hong Shen commented on SPARK-5347:
--

In addition, we will use some other inputFormat and inputSplit that not 
instance of FileSplit，for example，CombineFileSplit:
{code}
 public class CombineFileSplit implements InputSplit
{code}
To this case, we can just get the bytesRead when close it.

 InputMetrics bug when inputSplit is not instanceOf FileSplit
 

 Key: SPARK-5347
 URL: https://issues.apache.org/jira/browse/SPARK-5347
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Hong Shen

 When inputFormatClass is set to CombineFileInputFormat, input metrics show 
 that input is empty. It don't appear is spark-1.1.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-01-21 Thread Victor Tso (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Victor Tso updated SPARK-4105:
--
Comment: was deleted

(was: What's the fix version?)

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker

 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:56)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}
 Here's another occurrence of a similar error:
 {code}
 java.io.IOException: failed to read chunk
 
 org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:348)

[jira] [Updated] (SPARK-5351) Can't zip RDDs with unequal numbers of partitions in ReplicatedVertexView.upgrade()


 [ 
https://issues.apache.org/jira/browse/SPARK-5351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-5351:

Description: 
If the value of 'spark.default.parallelism' does not match the number of 
partitoins in EdgePartition(EdgeRDDImpl), 
the following error occurs in ReplicatedVertexView.scala:72;

object GraphTest extends Logging {
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
graph.aggregateMessages[Int](
  ctx = {

ctx.sendToSrc(1)
ctx.sendToDst(2)

  },
  _ + _)
  }
}

val g = GraphLoader.edgeListFile(sc, graph.txt)
val rdd = GraphTest.run(g)

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:82)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
...


  was:
If the value of 'spark.default.parallelism' does not match the number of 
partitoins in EdgePartition(EdgeRDDImpl), 
the following error occurs in ReplicatedVertexView.scala:72;

object GraphTest extends Logging {
  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
graph.aggregateMessages[Int](
  ctx = {
ctx.sendToSrc(1)
ctx.sendToDst(2)
  },
  _ + _)
  }
}

val g = GraphLoader.edgeListFile(sc, graph.txt)
val rdd = GraphTest.run(g)

java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
partitions
at 
org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at org.apache.spark.ShuffleDependency.init(Dependency.scala:82)
at 
org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
...



 Can't zip RDDs with unequal numbers of partitions in 
 ReplicatedVertexView.upgrade()
 ---

 Key: SPARK-5351
 URL: https://issues.apache.org/jira/browse/SPARK-5351
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Takeshi Yamamuro

 If the value of 'spark.default.parallelism' does not match the number of 
 partitoins in EdgePartition(EdgeRDDImpl), 
 the following error occurs in ReplicatedVertexView.scala:72;
 object GraphTest extends Logging {
   def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = 
 {
 graph.aggregateMessages[Int](
   ctx = {
 ctx.sendToSrc(1)
 ctx.sendToDst(2)
   },
   _ + _)
   }
 }
 val g = GraphLoader.edgeListFile(sc, graph.txt)
 val rdd = GraphTest.run(g)
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
   at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
   at

[jira] [Commented] (SPARK-5357) Upgrade from commons-codec 1.5