[jira] [Commented] (SPARK-3481) HiveComparisonTest throws exception of "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default"

2014-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132567#comment-14132567
 ] 

Apache Spark commented on SPARK-3481:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2377

> HiveComparisonTest throws exception of 
> "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
> default"
> ---
>
> Key: SPARK-3481
> URL: https://issues.apache.org/jira/browse/SPARK-3481
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Minor
>
> In local test, lots of exception raised like:
> {panel}
> 11:08:01.746 ERROR hive.ql.exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
> default
>   at 
> org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
>   at org.scalatest.Suite$class.run(Suite.scala:1423)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(HiveComparisonTest.scala:41)
>   at 
> org.scalatest.Before

[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-09-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2883:

Target Version/s: 1.2.0

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-09-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2883:

Priority: Blocker  (was: Major)

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3455) **HotFix** Unit test failed due to can not resolve the attribute references

2014-09-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3455.
-
Resolution: Fixed

> **HotFix** Unit test failed due to can not resolve the attribute references
> ---
>
> Key: SPARK-3455
> URL: https://issues.apache.org/jira/browse/SPARK-3455
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Priority: Blocker
>
> The test case "SPARK-3349 partitioning after limit" failed, the exception as :
> {panel}
> 23:10:04.117 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in stage 
> 274.0 failed 1 times; aborting job
> [info] - SPARK-3349 partitioning after limit *** FAILED ***
> [info]   Exception thrown while executing query:
> [info]   == Parsed Logical Plan ==
> [info]   Project [*]
> [info]Join Inner, Some(('subset1.n = 'lowerCaseData.n))
> [info] UnresolvedRelation None, lowerCaseData, None
> [info] UnresolvedRelation None, subset1, None
> [info]   
> [info]   == Analyzed Logical Plan ==
> [info]   Project [n#605,l#606,n#12]
> [info]Join Inner, Some((n#12 = n#605))
> [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] 
> at mapPartitions at basicOperators.scala:219)
> [info] Limit 2
> [info]  Sort [n#12 DESC]
> [info]   Distinct 
> [info]Project [n#12]
> [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
> MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
> [info]   
> [info]   == Optimized Logical Plan ==
> [info]   Project [n#605,l#606,n#12]
> [info]Join Inner, Some((n#12 = n#605))
> [info] SparkLogicalPlan (ExistingRdd [n#605,l#606], MapPartitionsRDD[13] 
> at mapPartitions at basicOperators.scala:219)
> [info] Limit 2
> [info]  Sort [n#12 DESC]
> [info]   Distinct 
> [info]Project [n#12]
> [info] SparkLogicalPlan (ExistingRdd [n#607,l#608], 
> MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:219)
> [info]   
> [info]   == Physical Plan ==
> [info]   Project [n#605,l#606,n#12]
> [info]ShuffledHashJoin [n#605], [n#12], BuildRight
> [info] Exchange (HashPartitioning [n#605], 10)
> [info]  ExistingRdd [n#605,l#606], MapPartitionsRDD[13] at mapPartitions 
> at basicOperators.scala:219
> [info] Exchange (HashPartitioning [n#12], 10)
> [info]  TakeOrdered 2, [n#12 DESC]
> [info]   Distinct false
> [info]Exchange (HashPartitioning [n#12], 10)
> [info] Distinct true
> [info]  Project [n#12]
> [info]   ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
> mapPartitions at basicOperators.scala:219
> [info]   
> [info]   Code Generation: false
> [info]   == RDD ==
> [info]   == Exception ==
> [info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> execute, tree:
> [info]   Exchange (HashPartitioning [n#12], 10)
> [info]TakeOrdered 2, [n#12 DESC]
> [info] Distinct false
> [info]  Exchange (HashPartitioning [n#12], 10)
> [info]   Distinct true
> [info]Project [n#12]
> [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
> mapPartitions at basicOperators.scala:219
> [info]   
> [info]   org.apache.spark.sql.catalyst.errors.package$TreeNodeException: 
> execute, tree:
> [info]   Exchange (HashPartitioning [n#12], 10)
> [info]TakeOrdered 2, [n#12 DESC]
> [info] Distinct false
> [info]  Exchange (HashPartitioning [n#12], 10)
> [info]   Distinct true
> [info]Project [n#12]
> [info] ExistingRdd [n#607,l#608], MapPartitionsRDD[13] at 
> mapPartitions at basicOperators.scala:219
> [info]   
> [info]at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47)
> [info]at 
> org.apache.spark.sql.execution.Exchange.execute(Exchange.scala:44)
> [info]at 
> org.apache.spark.sql.execution.ShuffledHashJoin.execute(joins.scala:354)
> [info]at 
> org.apache.spark.sql.execution.Project.execute(basicOperators.scala:42)
> [info]at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:85)
> [info]at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)
> [info]at 
> org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:40)
> [info]at 
> org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply$mcV$sp(SQLQuerySuite.scala:369)
> [info]at 
> org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
> [info]at 
> org.apache.spark.sql.SQLQuerySuite$$anonfun$31.apply(SQLQuerySuite.scala:362)
> [info]at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
> [info]at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
> [info]at org

[jira] [Updated] (SPARK-3500) coalesce() and repartition() of SchemaRDD is broken

2014-09-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3500:
--
Description: 
{code}
>>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
>>> '{"foo":"baz"}'])).coalesce(1)
Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
java.lang.Boolean]) does not exist
{code}

repartition() is also missing too.

  was:
{code}
>>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
>>> '{"foo":"baz"}'])).coalesce(1)
Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
java.lang.Boolean]) does not exist
{code}

repartition() and distinct(N) are also missing too.


> coalesce() and repartition() of SchemaRDD is broken
> ---
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}
> repartition() is also missing too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3500) coalesce() and repartition() of SchemaRDD is broken

2014-09-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3500:
--
Summary: coalesce() and repartition() of SchemaRDD is broken  (was: 
SchemaRDD from jsonRDD() has not coalesce() method)

> coalesce() and repartition() of SchemaRDD is broken
> ---
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.1.1, 1.2.0
>
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}
> repartition() and distinct(N) are also missing too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3469) All TaskCompletionListeners should be called even if some of them fail

2014-09-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3469.

  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0  (was: 1.1.1, 1.2.0)

> All TaskCompletionListeners should be called even if some of them fail
> --
>
> Key: SPARK-3469
> URL: https://issues.apache.org/jira/browse/SPARK-3469
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.2.0
>
>
> If there are multiple TaskCompletionListeners, and any one of them misbehaves 
> (e.g. throws an exception), then we will skip executing the rest of them.
> As we are increasingly relying on TaskCompletionListener for cleaning up of 
> resources, we should make sure they are always called, even if the previous 
> ones fail. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3517) mapPartitions is not correct clearing up the closure

2014-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132489#comment-14132489
 ] 

Apache Spark commented on SPARK-3517:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/2376

> mapPartitions is not correct clearing up the closure
> 
>
> Key: SPARK-3517
> URL: https://issues.apache.org/jira/browse/SPARK-3517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Priority: Blocker
>
> {code}
>  for (iter <- 1 to totalIter) {
>   logInfo("Start Gibbs sampling (Iteration %d/%d)".format(iter, 
> totalIter))
>   val broadcastModel = data.context.broadcast(topicModel)
>   val previousCorpus = corpus
>   corpus = corpus.mapPartitions { docs =>
> val rand = new Random
> val topicModel = broadcastModel.value
> val topicThisTerm = BDV.zeros[Double](numTopics)
> docs.map { doc =>
>   val content = doc.content
>   val topics = doc.topics
>   val topicsDist = doc.topicsDist
>   for (i <- 0 until content.length) {
> val term = content(i)
> val topic = topics(i)
> val chosenTopic = topicModel.dropOneDistSampler(topicsDist, 
> topicThisTerm,
>   rand, term, topic)
> if (topic != chosenTopic) {
>   topics(i) = chosenTopic
>   topicsDist(topic) += -1
>   topicsDist(chosenTopic) += 1
>   topicModel.update(term, topic, -1)
>   topicModel.update(term, chosenTopic, 1)
> }
>   }
>   doc
> }
>   }.setName(s"LDA-$iter").persist(StorageLevel.MEMORY_AND_DISK)
>   }
> {code}
> The serialized corpus RDD and serialized topicModel broadcast almost as big.
> {cat spark.log | grep 'stored as values in memory'} =>
> {noformat}
> .
> 14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in 
> memory (estimated size 68.6 KB, free 2.8 GB)
> 14/09/13 00:48:45 INFO MemoryStore: Block broadcast_10 stored as values in 
> memory (estimated size 41.7 KB, free 2.8 GB)
> 14/09/13 00:49:21 INFO MemoryStore: Block broadcast_11 stored as values in 
> memory (estimated size 197.5 MB, free 2.6 GB)
> 14/09/13 00:49:24 INFO MemoryStore: Block broadcast_12 stored as values in 
> memory (estimated size 197.7 MB, free 2.3 GB)
> 14/09/13 00:53:25 INFO MemoryStore: Block broadcast_13 stored as values in 
> memory (estimated size 163.9 MB, free 2.1 GB)
> 14/09/13 00:53:28 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 164.0 MB, free 1878.0 MB)
> 14/09/13 00:57:34 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 149.7 MB, free 1658.5 MB)
> 14/09/13 00:57:36 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 150.0 MB, free 1444.0 MB)
> 14/09/13 01:01:34 INFO MemoryStore: Block broadcast_17 stored as values in 
> memory (estimated size 141.1 MB, free 1238.3 MB)
> 14/09/13 01:01:36 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 141.2 MB, free 1036.2 MB)
> 14/09/13 01:05:12 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 134.5 MB, free 840.7 MB)
> 14/09/13 01:05:14 INFO MemoryStore: Block broadcast_20 stored as values in 
> memory (estimated size 134.7 MB, free 647.8 MB)
> 14/09/13 01:08:39 INFO MemoryStore: Block broadcast_21 stored as values in 
> memory (estimated size 218.3 KB, free 589.5 MB)
> 14/09/13 01:08:39 INFO MemoryStore: Block broadcast_22 stored as values in 
> memory (estimated size 218.3 KB, free 589.2 MB)
> 14/09/13 01:08:40 INFO MemoryStore: Block broadcast_23 stored as values in 
> memory (estimated size 134.6 MB, free 454.6 MB)
> 14/09/13 01:08:53 INFO MemoryStore: Block broadcast_24 stored as values in 
> memory (estimated size 129.3 MB, free 267.1 MB)
> 14/09/13 01:08:55 INFO MemoryStore: Block broadcast_25 stored as values in 
> memory (estimated size 129.4 MB, free 82.0 MB)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3500.
---
   Resolution: Fixed
Fix Version/s: 1.1.1
   1.2.0

Issue resolved by pull request 2369
[https://github.com/apache/spark/pull/2369]

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.2.0, 1.1.1
>
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}
> repartition() and distinct(N) are also missing too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2883) Spark Support for ORCFile format

2014-09-12 Thread Fi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fi updated SPARK-2883:
--
Attachment: 2014-09-12 07.07.19 pm jobtracker.png
2014-09-12 07.05.24 pm Spark UI.png

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-09-12 Thread Fi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132472#comment-14132472
 ] 

Fi commented on SPARK-2883:
---

I was able to run a simple query and access my ORC hive table through the 
PySpark shell.

Happily, things seemed to work.

However, the problem is I/O efficiency.

Looking at the Spark:4040 UI, it is clear that the Spark is reading the entire 
ORC file from HDFS, instead of taking advantage of the columnar format and 
reading just the columns I was requesting.

The physical size of the partition I am querying is 57GB in size.
The Spark UI shows that it read the full 57GB, even though I queried a handful 
of columns.

I tried the equivalent query in standard hive. The Job Tracker showed that it 
only read 1GB of data from HDFS, which is closer to what I was expecting (and 
results in 57x less network/disk I/O).

Would be great if Spark SQL would perform similarly to regular Hive.

I will attach a couple screenshots from the Spark UI and the Job Tracker.

Here was my PySpark command

sqlc = HiveContext(sc)

rdd = sqlc.sql("SELECT "
   "  r.ts, "
   "  r.pid, r.sid, r.aid, r.netid "
   "FROM "
   "  orc_rt.orc_ai_6000 r "
   "WHERE "
   "  r.year=2014 and r.MONTH=9 and r.DAY=11 and hour(ts) == 12 "
   "LIMIT 1000")
rdd.collect()

and here was what I ran in hive

select ts, pid, sid, aid, netid from orc_rt.orc_ai_6000 where year=2014 and 
month=9 and day=11 and hour(ts) == 12 limit 1000;


Software
===
Spark 1.1.0
Mesos 0.18.2
Hive 0.12
IPython 1.2.1
Python 2.7.6
MaprV3
Docker 1.1.2   (mesos/spark running in a docker container)
Kernel: Linux 2.6.32-431.23.3.el6.x86_64 #1 SMP Thu Jul 31 17:20:51 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux
Host OS: CentOS release 6.5 (Final)  (running under a XEN Hypervisor)


> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3517) mapPartitions is not correct clearing up the closure

2014-09-12 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3517:
---
Summary: mapPartitions is not correct clearing up the closure  (was: 
mapPartitions is not correct clearing closure)

> mapPartitions is not correct clearing up the closure
> 
>
> Key: SPARK-3517
> URL: https://issues.apache.org/jira/browse/SPARK-3517
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Priority: Blocker
>
> {code}
>  for (iter <- 1 to totalIter) {
>   logInfo("Start Gibbs sampling (Iteration %d/%d)".format(iter, 
> totalIter))
>   val broadcastModel = data.context.broadcast(topicModel)
>   val previousCorpus = corpus
>   corpus = corpus.mapPartitions { docs =>
> val rand = new Random
> val topicModel = broadcastModel.value
> val topicThisTerm = BDV.zeros[Double](numTopics)
> docs.map { doc =>
>   val content = doc.content
>   val topics = doc.topics
>   val topicsDist = doc.topicsDist
>   for (i <- 0 until content.length) {
> val term = content(i)
> val topic = topics(i)
> val chosenTopic = topicModel.dropOneDistSampler(topicsDist, 
> topicThisTerm,
>   rand, term, topic)
> if (topic != chosenTopic) {
>   topics(i) = chosenTopic
>   topicsDist(topic) += -1
>   topicsDist(chosenTopic) += 1
>   topicModel.update(term, topic, -1)
>   topicModel.update(term, chosenTopic, 1)
> }
>   }
>   doc
> }
>   }.setName(s"LDA-$iter").persist(StorageLevel.MEMORY_AND_DISK)
>   }
> {code}
> The serialized corpus RDD and serialized topicModel broadcast almost as big.
> {cat spark.log | grep 'stored as values in memory'} =>
> {noformat}
> .
> 14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in 
> memory (estimated size 68.6 KB, free 2.8 GB)
> 14/09/13 00:48:45 INFO MemoryStore: Block broadcast_10 stored as values in 
> memory (estimated size 41.7 KB, free 2.8 GB)
> 14/09/13 00:49:21 INFO MemoryStore: Block broadcast_11 stored as values in 
> memory (estimated size 197.5 MB, free 2.6 GB)
> 14/09/13 00:49:24 INFO MemoryStore: Block broadcast_12 stored as values in 
> memory (estimated size 197.7 MB, free 2.3 GB)
> 14/09/13 00:53:25 INFO MemoryStore: Block broadcast_13 stored as values in 
> memory (estimated size 163.9 MB, free 2.1 GB)
> 14/09/13 00:53:28 INFO MemoryStore: Block broadcast_14 stored as values in 
> memory (estimated size 164.0 MB, free 1878.0 MB)
> 14/09/13 00:57:34 INFO MemoryStore: Block broadcast_15 stored as values in 
> memory (estimated size 149.7 MB, free 1658.5 MB)
> 14/09/13 00:57:36 INFO MemoryStore: Block broadcast_16 stored as values in 
> memory (estimated size 150.0 MB, free 1444.0 MB)
> 14/09/13 01:01:34 INFO MemoryStore: Block broadcast_17 stored as values in 
> memory (estimated size 141.1 MB, free 1238.3 MB)
> 14/09/13 01:01:36 INFO MemoryStore: Block broadcast_18 stored as values in 
> memory (estimated size 141.2 MB, free 1036.2 MB)
> 14/09/13 01:05:12 INFO MemoryStore: Block broadcast_19 stored as values in 
> memory (estimated size 134.5 MB, free 840.7 MB)
> 14/09/13 01:05:14 INFO MemoryStore: Block broadcast_20 stored as values in 
> memory (estimated size 134.7 MB, free 647.8 MB)
> 14/09/13 01:08:39 INFO MemoryStore: Block broadcast_21 stored as values in 
> memory (estimated size 218.3 KB, free 589.5 MB)
> 14/09/13 01:08:39 INFO MemoryStore: Block broadcast_22 stored as values in 
> memory (estimated size 218.3 KB, free 589.2 MB)
> 14/09/13 01:08:40 INFO MemoryStore: Block broadcast_23 stored as values in 
> memory (estimated size 134.6 MB, free 454.6 MB)
> 14/09/13 01:08:53 INFO MemoryStore: Block broadcast_24 stored as values in 
> memory (estimated size 129.3 MB, free 267.1 MB)
> 14/09/13 01:08:55 INFO MemoryStore: Block broadcast_25 stored as values in 
> memory (estimated size 129.4 MB, free 82.0 MB)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3517) mapPartitions is not correct clearing closure

2014-09-12 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3517:
---
Description: 
{code}
 for (iter <- 1 to totalIter) {
  logInfo("Start Gibbs sampling (Iteration %d/%d)".format(iter, totalIter))
  val broadcastModel = data.context.broadcast(topicModel)
  val previousCorpus = corpus
  corpus = corpus.mapPartitions { docs =>
val rand = new Random
val topicModel = broadcastModel.value
val topicThisTerm = BDV.zeros[Double](numTopics)
docs.map { doc =>
  val content = doc.content
  val topics = doc.topics
  val topicsDist = doc.topicsDist
  for (i <- 0 until content.length) {
val term = content(i)
val topic = topics(i)
val chosenTopic = topicModel.dropOneDistSampler(topicsDist, 
topicThisTerm,
  rand, term, topic)
if (topic != chosenTopic) {
  topics(i) = chosenTopic
  topicsDist(topic) += -1
  topicsDist(chosenTopic) += 1
  topicModel.update(term, topic, -1)
  topicModel.update(term, chosenTopic, 1)
}
  }
  doc
}
  }.setName(s"LDA-$iter").persist(StorageLevel.MEMORY_AND_DISK)
{code}
The serialized corpus RDD and serialized topicModel broadcast almost as big.


{cat spark.log | grep 'stored as values in memory'} =>
{noformat}
.
14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in 
memory (estimated size 68.6 KB, free 2.8 GB)
14/09/13 00:48:45 INFO MemoryStore: Block broadcast_10 stored as values in 
memory (estimated size 41.7 KB, free 2.8 GB)
14/09/13 00:49:21 INFO MemoryStore: Block broadcast_11 stored as values in 
memory (estimated size 197.5 MB, free 2.6 GB)
14/09/13 00:49:24 INFO MemoryStore: Block broadcast_12 stored as values in 
memory (estimated size 197.7 MB, free 2.3 GB)
14/09/13 00:53:25 INFO MemoryStore: Block broadcast_13 stored as values in 
memory (estimated size 163.9 MB, free 2.1 GB)
14/09/13 00:53:28 INFO MemoryStore: Block broadcast_14 stored as values in 
memory (estimated size 164.0 MB, free 1878.0 MB)
14/09/13 00:57:34 INFO MemoryStore: Block broadcast_15 stored as values in 
memory (estimated size 149.7 MB, free 1658.5 MB)
14/09/13 00:57:36 INFO MemoryStore: Block broadcast_16 stored as values in 
memory (estimated size 150.0 MB, free 1444.0 MB)
14/09/13 01:01:34 INFO MemoryStore: Block broadcast_17 stored as values in 
memory (estimated size 141.1 MB, free 1238.3 MB)
14/09/13 01:01:36 INFO MemoryStore: Block broadcast_18 stored as values in 
memory (estimated size 141.2 MB, free 1036.2 MB)
14/09/13 01:05:12 INFO MemoryStore: Block broadcast_19 stored as values in 
memory (estimated size 134.5 MB, free 840.7 MB)
14/09/13 01:05:14 INFO MemoryStore: Block broadcast_20 stored as values in 
memory (estimated size 134.7 MB, free 647.8 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_21 stored as values in 
memory (estimated size 218.3 KB, free 589.5 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_22 stored as values in 
memory (estimated size 218.3 KB, free 589.2 MB)
14/09/13 01:08:40 INFO MemoryStore: Block broadcast_23 stored as values in 
memory (estimated size 134.6 MB, free 454.6 MB)
14/09/13 01:08:53 INFO MemoryStore: Block broadcast_24 stored as values in 
memory (estimated size 129.3 MB, free 267.1 MB)
14/09/13 01:08:55 INFO MemoryStore: Block broadcast_25 stored as values in 
memory (estimated size 129.4 MB, free 82.0 MB)
{noformat}

  was:
{code}
 val broadcastModel = data.context.broadcast(topicModel)
  val previousCorpus = corpus
  corpus = corpus.mapPartitions { docs =>
val rand = new Random
val topicModel = broadcastModel.value
val topicThisTerm = BDV.zeros[Double](numTopics)
docs.map { doc =>
  val content = doc.content
  val topics = doc.topics
  val topicsDist = doc.topicsDist
  for (i <- 0 until content.length) {
val term = content(i)
val topic = topics(i)
val chosenTopic = topicModel.dropOneDistSampler(topicsDist, 
topicThisTerm,
  rand, term, topic)
if (topic != chosenTopic) {
  topics(i) = chosenTopic
  topicsDist(topic) += -1
  topicsDist(chosenTopic) += 1
  topicModel.update(term, topic, -1)
  topicModel.update(term, chosenTopic, 1)
}
  }
  doc
}
  }.setName(s"LDA-$iter").persist(StorageLevel.MEMORY_AND_DISK)
{code}
The serialized corpus RDD and serialized topicModel broadcast almost as big.


{cat spark.log | grep 'stored as values in memory'} =>
{noformat}
.
14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in 
memory (estimated size 68.6 KB, free 2.8 GB)
14/09/13 00:48:45 INFO MemoryStore: Block broadcas

[jira] [Updated] (SPARK-3517) mapPartitions is not correct clearing closure

2014-09-12 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3517:
---
Description: 
{code}
 for (iter <- 1 to totalIter) {
  logInfo("Start Gibbs sampling (Iteration %d/%d)".format(iter, totalIter))
  val broadcastModel = data.context.broadcast(topicModel)
  val previousCorpus = corpus
  corpus = corpus.mapPartitions { docs =>
val rand = new Random
val topicModel = broadcastModel.value
val topicThisTerm = BDV.zeros[Double](numTopics)
docs.map { doc =>
  val content = doc.content
  val topics = doc.topics
  val topicsDist = doc.topicsDist
  for (i <- 0 until content.length) {
val term = content(i)
val topic = topics(i)
val chosenTopic = topicModel.dropOneDistSampler(topicsDist, 
topicThisTerm,
  rand, term, topic)
if (topic != chosenTopic) {
  topics(i) = chosenTopic
  topicsDist(topic) += -1
  topicsDist(chosenTopic) += 1
  topicModel.update(term, topic, -1)
  topicModel.update(term, chosenTopic, 1)
}
  }
  doc
}
  }.setName(s"LDA-$iter").persist(StorageLevel.MEMORY_AND_DISK)
  }
{code}
The serialized corpus RDD and serialized topicModel broadcast almost as big.


{cat spark.log | grep 'stored as values in memory'} =>
{noformat}
.
14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in 
memory (estimated size 68.6 KB, free 2.8 GB)
14/09/13 00:48:45 INFO MemoryStore: Block broadcast_10 stored as values in 
memory (estimated size 41.7 KB, free 2.8 GB)
14/09/13 00:49:21 INFO MemoryStore: Block broadcast_11 stored as values in 
memory (estimated size 197.5 MB, free 2.6 GB)
14/09/13 00:49:24 INFO MemoryStore: Block broadcast_12 stored as values in 
memory (estimated size 197.7 MB, free 2.3 GB)
14/09/13 00:53:25 INFO MemoryStore: Block broadcast_13 stored as values in 
memory (estimated size 163.9 MB, free 2.1 GB)
14/09/13 00:53:28 INFO MemoryStore: Block broadcast_14 stored as values in 
memory (estimated size 164.0 MB, free 1878.0 MB)
14/09/13 00:57:34 INFO MemoryStore: Block broadcast_15 stored as values in 
memory (estimated size 149.7 MB, free 1658.5 MB)
14/09/13 00:57:36 INFO MemoryStore: Block broadcast_16 stored as values in 
memory (estimated size 150.0 MB, free 1444.0 MB)
14/09/13 01:01:34 INFO MemoryStore: Block broadcast_17 stored as values in 
memory (estimated size 141.1 MB, free 1238.3 MB)
14/09/13 01:01:36 INFO MemoryStore: Block broadcast_18 stored as values in 
memory (estimated size 141.2 MB, free 1036.2 MB)
14/09/13 01:05:12 INFO MemoryStore: Block broadcast_19 stored as values in 
memory (estimated size 134.5 MB, free 840.7 MB)
14/09/13 01:05:14 INFO MemoryStore: Block broadcast_20 stored as values in 
memory (estimated size 134.7 MB, free 647.8 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_21 stored as values in 
memory (estimated size 218.3 KB, free 589.5 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_22 stored as values in 
memory (estimated size 218.3 KB, free 589.2 MB)
14/09/13 01:08:40 INFO MemoryStore: Block broadcast_23 stored as values in 
memory (estimated size 134.6 MB, free 454.6 MB)
14/09/13 01:08:53 INFO MemoryStore: Block broadcast_24 stored as values in 
memory (estimated size 129.3 MB, free 267.1 MB)
14/09/13 01:08:55 INFO MemoryStore: Block broadcast_25 stored as values in 
memory (estimated size 129.4 MB, free 82.0 MB)
{noformat}

  was:
{code}
 for (iter <- 1 to totalIter) {
  logInfo("Start Gibbs sampling (Iteration %d/%d)".format(iter, totalIter))
  val broadcastModel = data.context.broadcast(topicModel)
  val previousCorpus = corpus
  corpus = corpus.mapPartitions { docs =>
val rand = new Random
val topicModel = broadcastModel.value
val topicThisTerm = BDV.zeros[Double](numTopics)
docs.map { doc =>
  val content = doc.content
  val topics = doc.topics
  val topicsDist = doc.topicsDist
  for (i <- 0 until content.length) {
val term = content(i)
val topic = topics(i)
val chosenTopic = topicModel.dropOneDistSampler(topicsDist, 
topicThisTerm,
  rand, term, topic)
if (topic != chosenTopic) {
  topics(i) = chosenTopic
  topicsDist(topic) += -1
  topicsDist(chosenTopic) += 1
  topicModel.update(term, topic, -1)
  topicModel.update(term, chosenTopic, 1)
}
  }
  doc
}
  }.setName(s"LDA-$iter").persist(StorageLevel.MEMORY_AND_DISK)
{code}
The serialized corpus RDD and serialized topicModel broadcast almost as big.


{cat spark.log | grep 'stored as values in memory'} =>
{noformat}
.
14/09/13 00:48:44 INFO MemoryStore: Block broadca

[jira] [Created] (SPARK-3517) mapPartitions is not correct clearing closure

2014-09-12 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-3517:
--

 Summary: mapPartitions is not correct clearing closure
 Key: SPARK-3517
 URL: https://issues.apache.org/jira/browse/SPARK-3517
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Guoqiang Li
Priority: Blocker


{code}
 val broadcastModel = data.context.broadcast(topicModel)
  val previousCorpus = corpus
  corpus = corpus.mapPartitions { docs =>
val rand = new Random
val topicModel = broadcastModel.value
val topicThisTerm = BDV.zeros[Double](numTopics)
docs.map { doc =>
  val content = doc.content
  val topics = doc.topics
  val topicsDist = doc.topicsDist
  for (i <- 0 until content.length) {
val term = content(i)
val topic = topics(i)
val chosenTopic = topicModel.dropOneDistSampler(topicsDist, 
topicThisTerm,
  rand, term, topic)
if (topic != chosenTopic) {
  topics(i) = chosenTopic
  topicsDist(topic) += -1
  topicsDist(chosenTopic) += 1
  topicModel.update(term, topic, -1)
  topicModel.update(term, chosenTopic, 1)
}
  }
  doc
}
  }.setName(s"LDA-$iter").persist(StorageLevel.MEMORY_AND_DISK)
{code}
The serialized corpus RDD and serialized topicModel broadcast almost as big.


{cat spark.log | grep 'stored as values in memory'} =>
{noformat}
.
14/09/13 00:48:44 INFO MemoryStore: Block broadcast_9 stored as values in 
memory (estimated size 68.6 KB, free 2.8 GB)
14/09/13 00:48:45 INFO MemoryStore: Block broadcast_10 stored as values in 
memory (estimated size 41.7 KB, free 2.8 GB)
14/09/13 00:49:21 INFO MemoryStore: Block broadcast_11 stored as values in 
memory (estimated size 197.5 MB, free 2.6 GB)
14/09/13 00:49:24 INFO MemoryStore: Block broadcast_12 stored as values in 
memory (estimated size 197.7 MB, free 2.3 GB)
14/09/13 00:53:25 INFO MemoryStore: Block broadcast_13 stored as values in 
memory (estimated size 163.9 MB, free 2.1 GB)
14/09/13 00:53:28 INFO MemoryStore: Block broadcast_14 stored as values in 
memory (estimated size 164.0 MB, free 1878.0 MB)
14/09/13 00:57:34 INFO MemoryStore: Block broadcast_15 stored as values in 
memory (estimated size 149.7 MB, free 1658.5 MB)
14/09/13 00:57:36 INFO MemoryStore: Block broadcast_16 stored as values in 
memory (estimated size 150.0 MB, free 1444.0 MB)
14/09/13 01:01:34 INFO MemoryStore: Block broadcast_17 stored as values in 
memory (estimated size 141.1 MB, free 1238.3 MB)
14/09/13 01:01:36 INFO MemoryStore: Block broadcast_18 stored as values in 
memory (estimated size 141.2 MB, free 1036.2 MB)
14/09/13 01:05:12 INFO MemoryStore: Block broadcast_19 stored as values in 
memory (estimated size 134.5 MB, free 840.7 MB)
14/09/13 01:05:14 INFO MemoryStore: Block broadcast_20 stored as values in 
memory (estimated size 134.7 MB, free 647.8 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_21 stored as values in 
memory (estimated size 218.3 KB, free 589.5 MB)
14/09/13 01:08:39 INFO MemoryStore: Block broadcast_22 stored as values in 
memory (estimated size 218.3 KB, free 589.2 MB)
14/09/13 01:08:40 INFO MemoryStore: Block broadcast_23 stored as values in 
memory (estimated size 134.6 MB, free 454.6 MB)
14/09/13 01:08:53 INFO MemoryStore: Block broadcast_24 stored as values in 
memory (estimated size 129.3 MB, free 267.1 MB)
14/09/13 01:08:55 INFO MemoryStore: Block broadcast_25 stored as values in 
memory (estimated size 129.4 MB, free 82.0 MB)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3094) Support run pyspark in PyPy

2014-09-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3094.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2144
[https://github.com/apache/spark/pull/2144]

> Support run pyspark in PyPy
> ---
>
> Key: SPARK-3094
> URL: https://issues.apache.org/jira/browse/SPARK-3094
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.2.0
>
>
> PyPy is much faster than CPython (about 5x), run PySpark in PyPy will also be 
> useful for pure Python heavy computation applications.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3456) YarnAllocator can lose container requests to RM

2014-09-12 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3456.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Thomas Graves

> YarnAllocator can lose container requests to RM
> ---
>
> Key: SPARK-3456
> URL: https://issues.apache.org/jira/browse/SPARK-3456
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
> Fix For: 1.2.0
>
>
> I haven't actually tested this yet, but I believe that spark on yarn can lose 
> container requests to the RM.  The reason is that we ask for the total number 
> upfront (say x) but then we don't ask for anymore unless some are missing and 
> if we do then we could erase the original request.
> For example
> - ask for 3 containers
> - 1 is allocated
> - ask for 0 containers since asked for 3 originally (2 left)
> - the 1 allocated dies
> - We now ask for 1 since its missing, this will override whatever is on the 
> yarn side (in this case 2).
> Then we lose the 2 more we need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3516) DecisionTree Python support for params maxInstancesPerNode, maxInfoGain

2014-09-12 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-3516:


 Summary: DecisionTree Python support for params 
maxInstancesPerNode, maxInfoGain
 Key: SPARK-3516
 URL: https://issues.apache.org/jira/browse/SPARK-3516
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


Add DecisionTree parameters to Python API:
* maxInstancesPerNode
* maxInfoGain



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132420#comment-14132420
 ] 

Josh Rosen commented on SPARK-3500:
---

This feels like a bug, not a missing feature, since SchemaRDD instances have a 
public method that always throws an exception when called.  It seems fair to 
include this fix in 1.1.1.

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}
> repartition() and distinct(N) are also missing too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven

2014-09-12 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132406#comment-14132406
 ] 

Cheng Lian commented on SPARK-3515:
---

The bug SPARK-3481 fixed actually covered up the bug mentioned in this ticket.

> ParquetMetastoreSuite fails when executed together with other suites under 
> Maven
> 
>
> Key: SPARK-3515
> URL: https://issues.apache.org/jira/browse/SPARK-3515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>
> Reproduction step:
> {code}
> mvn -Phive,hadoop-2.4 
> -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite
>  -pl core,sql/catalyst,sql/core,sql/hive test
> {code}
> Maven instantiates all discovered test suite object at first, and then starts 
> executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary 
> tables in constructor, but these tables are deleted immediately since 
> {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}.
> To fix this issue, we shouldn't put this kind of side effect in constructor, 
> but in {{beforeAll}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven

2014-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132378#comment-14132378
 ] 

Apache Spark commented on SPARK-3515:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2375

> ParquetMetastoreSuite fails when executed together with other suites under 
> Maven
> 
>
> Key: SPARK-3515
> URL: https://issues.apache.org/jira/browse/SPARK-3515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.1
>Reporter: Cheng Lian
>
> Reproduction step:
> {code}
> mvn -Phive,hadoop-2.4 
> -DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite
>  -pl core,sql/catalyst,sql/core,sql/hive test
> {code}
> Maven instantiates all discovered test suite object at first, and then starts 
> executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary 
> tables in constructor, but these tables are deleted immediately since 
> {{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}.
> To fix this issue, we shouldn't put this kind of side effect in constructor, 
> but in {{beforeAll}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3515) ParquetMetastoreSuite fails when executed together with other suites under Maven

2014-09-12 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-3515:
-

 Summary: ParquetMetastoreSuite fails when executed together with 
other suites under Maven
 Key: SPARK-3515
 URL: https://issues.apache.org/jira/browse/SPARK-3515
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
Reporter: Cheng Lian


Reproduction step:
{code}
mvn -Phive,hadoop-2.4 
-DwildcardSuites=org.apache.spark.sql.parquet.ParquetMetastoreSuite,org.apache.spark.sql.hive.StatisticsSuite
 -pl core,sql/catalyst,sql/core,sql/hive test
{code}
Maven instantiates all discovered test suite object at first, and then starts 
executing all test cases. {{ParquetMetastoreSuite}} sets up several temporary 
tables in constructor, but these tables are deleted immediately since 
{{StatisticsSuite}}'s constructor calls {{TestHiveContext.reset()}}.

To fix this issue, we shouldn't put this kind of side effect in constructor, 
but in {{beforeAll}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-09-12 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-1021:

Assignee: Erik Erlandson  (was: Mark Hamstra)

> sortByKey() launches a cluster job when it shouldn't
> 
>
> Key: SPARK-1021
> URL: https://issues.apache.org/jira/browse/SPARK-1021
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 0.8.0, 0.9.0, 1.0.0, 1.1.0
>Reporter: Andrew Ash
>Assignee: Erik Erlandson
>  Labels: starter
>
> The sortByKey() method is listed as a transformation, not an action, in the 
> documentation.  But it launches a cluster job regardless.
> http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
> Some discussion on the mailing list suggested that this is a problem with the 
> rdd.count() call inside Partitioner.scala's rangeBounds method.
> https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
> Josh Rosen suggests that rangeBounds should be made into a lazy variable:
> {quote}
> I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
> fix this 
> (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
>   We'd need to make sure that rangeBounds() is never called before an action 
> is performed.  This could be tricky because it's called in the 
> RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
> number of partitions, the ids of the RDDs used to create the 
> RangePartitioner, and the sort ordering.  This still supports the case where 
> I range-partition one RDD and pass the same partitioner to a different RDD.  
> It breaks support for the case where two range partitioners created on 
> different RDDs happened to have the same rangeBounds(), but it seems unlikely 
> that this would really harm performance since it's probably unlikely that the 
> range partitioners are equal by chance.
> {quote}
> Can we please make this happen?  I'll send a PR on GitHub to start the 
> discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1449) Please delete old releases from mirroring system

2014-09-12 Thread Sebb (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132323#comment-14132323
 ] 

Sebb commented on SPARK-1449:
-

Is no-one able to deal with this please?

> Please delete old releases from mirroring system
> 
>
> Key: SPARK-1449
> URL: https://issues.apache.org/jira/browse/SPARK-1449
> Project: Spark
>  Issue Type: Task
>Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2
>Reporter: Sebb
>
> To reduce the load on the ASF mirrors, projects are required to delete old 
> releases [1]
> Please can you remove all non-current releases?
> Thanks!
> [Note that older releases are always available from the ASF archive server]
> Any links to older releases on download pages should first be adjusted to 
> point to the archive server.
> [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system

2014-09-12 Thread Sebb (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebb updated SPARK-1449:

Affects Version/s: 1.0.1
   0.9.2
   1.0.0
   1.0.2

> Please delete old releases from mirroring system
> 
>
> Key: SPARK-1449
> URL: https://issues.apache.org/jira/browse/SPARK-1449
> Project: Spark
>  Issue Type: Task
>Affects Versions: 0.8.1, 0.9.1, 0.9.2, 1.0.0, 1.0.1, 1.0.2
>Reporter: Sebb
>
> To reduce the load on the ASF mirrors, projects are required to delete old 
> releases [1]
> Please can you remove all non-current releases?
> Thanks!
> [Note that older releases are always available from the ASF archive server]
> Any links to older releases on download pages should first be adjusted to 
> point to the archive server.
> [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system

2014-09-12 Thread Sebb (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebb updated SPARK-1449:

Affects Version/s: (was: 0.9.0)

> Please delete old releases from mirroring system
> 
>
> Key: SPARK-1449
> URL: https://issues.apache.org/jira/browse/SPARK-1449
> Project: Spark
>  Issue Type: Task
>Affects Versions: 0.8.1, 0.9.1
>Reporter: Sebb
>
> To reduce the load on the ASF mirrors, projects are required to delete old 
> releases [1]
> Please can you remove all non-current releases?
> Thanks!
> [Note that older releases are always available from the ASF archive server]
> Any links to older releases on download pages should first be adjusted to 
> point to the archive server.
> [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1449) Please delete old releases from mirroring system

2014-09-12 Thread Sebb (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebb updated SPARK-1449:

Affects Version/s: (was: 0.8.0)

> Please delete old releases from mirroring system
> 
>
> Key: SPARK-1449
> URL: https://issues.apache.org/jira/browse/SPARK-1449
> Project: Spark
>  Issue Type: Task
>Affects Versions: 0.8.1, 0.9.0, 0.9.1
>Reporter: Sebb
>
> To reduce the load on the ASF mirrors, projects are required to delete old 
> releases [1]
> Please can you remove all non-current releases?
> Thanks!
> [Note that older releases are always available from the ASF archive server]
> Any links to older releases on download pages should first be adjusted to 
> point to the archive server.
> [1] http://www.apache.org/dev/release.html#when-to-archive



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3464) Graceful decommission of executors

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3464:
-
 Target Version/s: 1.2.0
Affects Version/s: 1.1.0
 Assignee: Andrew Or

> Graceful decommission of executors
> --
>
> Key: SPARK-3464
> URL: https://issues.apache.org/jira/browse/SPARK-3464
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Assignee: Andrew Or
>
> In most cases, even when an application is utilizing only a small fraction of 
> its available resources, executors will still have tasks running or blocks 
> cached.  It would be useful to have a mechanism for waiting for running tasks 
> on an executor to finish and migrating its cached blocks elsewhere before 
> discarding it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3217) Shaded Guava jar doesn't play well with Maven build when SPARK_PREPEND_CLASSES is set

2014-09-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3217.

   Resolution: Fixed
Fix Version/s: 1.2.0

Fixed by https://github.com/apache/spark/pull/2141/

> Shaded Guava jar doesn't play well with Maven build when 
> SPARK_PREPEND_CLASSES is set
> -
>
> Key: SPARK-3217
> URL: https://issues.apache.org/jira/browse/SPARK-3217
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Cheng Lian
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
>
> PR [#1813|https://github.com/apache/spark/pull/1813] shaded Guava jar file 
> and moved Guava classes to package {{org.spark-project.guava}} when Spark is 
> built by Maven. But if developers set the environment variable 
> {{SPARK_PREPEND_CLASSES}} to {{true}}, commands like {{bin/spark-shell}} 
> throws {{ClassNotFoundException}}:
> {code}
> # Set the env var
> $ export SPARK_PREPEND_CLASSES=true
> # Build Spark with Maven
> $ mvn clean package -Phive,hadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests
> ...
> # Then spark-shell complains
> $ ./bin/spark-shell
> Spark assembly has been built with Hive, including Datanucleus jars on 
> classpath
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> com/google/common/util/concurrent/ThreadFactoryBuilder
> at org.apache.spark.util.Utils$.(Utils.scala:636)
> at org.apache.spark.util.Utils$.(Utils.scala)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:134)
> at org.apache.spark.repl.SparkILoop.(SparkILoop.scala:65)
> at org.apache.spark.repl.Main$.main(Main.scala:30)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:317)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:73)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> com.google.common.util.concurrent.ThreadFactoryBuilder
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 13 more
> # Check the assembly jar file
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop2.3.0.jar | 
> grep -i ThreadFactoryBuilder
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder$1.class
> org/spark-project/guava/common/util/concurrent/ThreadFactoryBuilder.class
> {code}
> SBT build is fine since we don't shade Guava with SBT right now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1131) Better document the --args option for yarn-standalone mode

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1131.

Resolution: Fixed

--args is now deprecated. We use --arg instead.

> Better document the --args option for yarn-standalone mode
> --
>
> Key: SPARK-1131
> URL: https://issues.apache.org/jira/browse/SPARK-1131
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Sandy Pérez González
>Assignee: Karthik Kambatla
>
> It took me a while to figure out that the correct way to use it with multiple 
> arguments was to include the option multiple times.
> I.e.
> --args arg1
> --args arg2
> instead of
> --args "arg1 arg2" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1909) "--jars" is not supported in standalone cluster mode

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1909.

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Andrew Or

Also fixed in https://github.com/apache/spark/pull/1538

> "--jars" is not supported in standalone cluster mode
> 
>
> Key: SPARK-1909
> URL: https://issues.apache.org/jira/browse/SPARK-1909
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.1.0
>
>
> "--jars" is not processed in `spark-submit` for standalone cluster mode. The 
> workaround is building an assembly app jar. It might be easy to support user 
> jars that is accessible from a cluster node by setting `spark.jars` property 
> correctly and passing it to `DriverWrapper`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1908) Support local app jar in standalone cluster mode

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1908.

Resolution: Won't Fix
  Assignee: Andrew Or

As described in the comment of https://github.com/apache/spark/pull/1538, this 
is not an issue. (We shouldn't rely on a distributed cache for standalone mode)

> Support local app jar in standalone cluster mode
> 
>
> Key: SPARK-1908
> URL: https://issues.apache.org/jira/browse/SPARK-1908
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Xiangrui Meng
>Assignee: Andrew Or
>Priority: Minor
>
> Standalone cluster mode only supports app jar with a URL that is accessible 
> from all cluster nodes, e.g., a jar on HDFS or a local file that is available 
> on each cluster nodes. It would be nice to support app jar that is only 
> available on the client.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3014) Log a more informative messages in a couple failure scenarios

2014-09-12 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-3014.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Sandy Ryza

> Log a more informative messages in a couple failure scenarios
> -
>
> Key: SPARK-3014
> URL: https://issues.apache.org/jira/browse/SPARK-3014
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Minor
> Fix For: 1.2.0
>
>
> This is what shows up currently when the user code fails to initialize a 
> SparkContext when running in yarn-cluster mode:
> {code}
> Exception in thread "Thread-4" java.lang.NullPointerException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:187)
> Exception in thread "main" java.lang.AssertionError: assertion failed
>   at scala.Predef$.assert(Predef.scala:165)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkContextInitialized(ApplicationMaster.scala:223)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:112)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:470)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:53)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:52)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1554)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:52)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:469)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
> This is what shows up when the main method isn't static:
> {code}
> Exception in thread "main" java.lang.NullPointerException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-1906) spark-submit doesn't send master URL to Driver in standalone cluster mode

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-1906.

   Resolution: Fixed
Fix Version/s: 1.1.0
 Assignee: Andrew Or

This is fixed in https://github.com/apache/spark/pull/1538 as part of the 
broader fix for standalone cluster mode.

> spark-submit doesn't send master URL to Driver in standalone cluster mode
> -
>
> Key: SPARK-1906
> URL: https://issues.apache.org/jira/browse/SPARK-1906
> Project: Spark
>  Issue Type: Sub-task
>  Components: Deploy
>Reporter: Xiangrui Meng
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.1.0
>
>
> `spark-submit` doesn't send `spark.master` to DriverWrapper. So the latter 
> creates an empty SparkConf and throws an exception:
> {code}
> A master URL must be set in your configuration
> {code}
> The workaround is setting master explicitly in the user application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3107) Don't pass null jar to executor in yarn-client mode

2014-09-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132164#comment-14132164
 ] 

Andrew Or commented on SPARK-3107:
--

I see. Yes, setting it to an empty value is semantically different from not 
setting it, and that is exactly the reason why we shouldn't set these to empty 
strings when what we really mean to do is to not set it. I think we still do 
that in a few places here and there.

> Don't pass null jar to executor in yarn-client mode
> ---
>
> Key: SPARK-3107
> URL: https://issues.apache.org/jira/browse/SPARK-3107
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> In the following line, ExecutorLauncher's `--jar` takes in null.
> {code}
> 14/08/18 20:52:43 INFO yarn.Client:   command: $JAVA_HOME/bin/java -server 
> -Xmx512m ... org.apache.spark.deploy.yarn.ExecutorLauncher --class 'notused' 
> --jar null  --arg  'ip-172-31-0-12.us-west-2.compute.internal:56838' 
> --executor-memory 1024 --executor-cores 1 --num-executors  2
> {code}
> Also it appears that we set a bunch of system properties to empty strings 
> (not shown). We should avoid setting these if they don't actually contain 
> values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3187) Refactor and cleanup Yarn allocator code

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3187:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-3492

> Refactor and cleanup Yarn allocator code
> 
>
> Key: SPARK-3187
> URL: https://issues.apache.org/jira/browse/SPARK-3187
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.2.0
>
>
> This is a follow-up to SPARK-2933, which dealt with the ApplicationMaster 
> code.
> There's a lot of logic in the container allocation code in alpha/stable that 
> could probably be merged.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3492) Clean up Yarn integration code

2014-09-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132147#comment-14132147
 ] 

Andrew Or commented on SPARK-3492:
--

Thanks, I've added it to the list.

> Clean up Yarn integration code
> --
>
> Key: SPARK-3492
> URL: https://issues.apache.org/jira/browse/SPARK-3492
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> This is the parent umbrella for cleaning up the Yarn integration code in 
> general. This is a broad effort and each individual cleanup should opened as 
> a sub-issue against this one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3465) Task metrics are not aggregated correctly in local mode

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3465.

   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

> Task metrics are not aggregated correctly in local mode
> ---
>
> Key: SPARK-3465
> URL: https://issues.apache.org/jira/browse/SPARK-3465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.1.1, 1.2.0
>
>
> In local mode, after onExecutorMetricsUpdate(), t.taskMetrics will be the 
> same object with that in TaskContext (because there is no serialization for 
> MetricsUpdate in local mode), then all the upcoming changes in metrics will 
> be lost, because updateAggregateMetrics() only counts the difference in these 
> two. 
> This bug was introduced in https://issues.apache.org/jira/browse/SPARK-2099, 
> cc [~sandyr]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-3498) Block always replicated to the same node

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-3498.

Resolution: Duplicate
  Assignee: Tathagata Das

This is the cause of SPARK-3495, and the fix for both issues is the same: 
https://github.com/apache/spark/pull/2366

> Block always replicated to the same node
> 
>
> Key: SPARK-3498
> URL: https://issues.apache.org/jira/browse/SPARK-3498
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: shenhong
>Assignee: Tathagata Das
>
> When running a spark streaming job, we should replicate receiver blocks, but 
> all the blocks replicated to the  same node. Here is the log.
> 14/09/10 19:55:16 INFO BlockManagerInfo: Added input-0-1410350117000 in 
> memory on 10.196.131.19:42261 (size: 8.9 MB, free: 1050.3 MB)
> 14/09/10 19:55:16 INFO BlockManagerInfo: Added input-0-1410350117000 in 
> memory on tdw-10-196-130-155:51155 (size: 8.9 MB, free: 879.3 MB)
> 14/09/10 19:55:17 INFO BlockManagerInfo: Added input-0-1410350118000 in 
> memory on 10.196.131.19:42261 (size: 7.7 MB, free: 1042.6 MB)
> 14/09/10 19:55:17 INFO BlockManagerInfo: Added input-0-1410350118000 in 
> memory on tdw-10-196-130-155:51155 (size: 7.7 MB, free: 871.6 MB)
> 14/09/10 19:55:18 INFO BlockManagerInfo: Added input-0-1410350119000 in 
> memory on 10.196.131.19:42261 (size: 7.3 MB, free: 1035.3 MB)
> 14/09/10 19:55:18 INFO BlockManagerInfo: Added input-0-1410350119000 in 
> memory on tdw-10-196-130-155:51155 (size: 7.3 MB, free: 864.3 MB)
> The reason is when blockManagerSlave ask blockManagerMaster for a 
> blockManagerId, blockManagerMaster  always return the same blockManagerId.  
> Here is the code:
> private def getPeers(blockManagerId: BlockManagerId, size: Int): 
> Seq[BlockManagerId] = {
> val peers: Array[BlockManagerId] = blockManagerInfo.keySet.toArray
> val selfIndex = peers.indexOf(blockManagerId)
> if (selfIndex == -1) {
>   throw new SparkException("Self index for " + blockManagerId + " not 
> found")
> }
> // Note that this logic will select the same node multiple times if there 
> aren't enough peers
> Array.tabulate[BlockManagerId](size) { i => peers((selfIndex + i + 1) % 
> peers.length) }.toSeq
>   }
> I think the blockManagerMaster should return the size of  blockManagerId with 
> more remain memory .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3513) Provide a utility for running a function once on each executor

2014-09-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3513:
---
Issue Type: Improvement  (was: Bug)

> Provide a utility for running a function once on each executor
> --
>
> Key: SPARK-3513
> URL: https://issues.apache.org/jira/browse/SPARK-3513
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Minor
>
> This is minor, but it would be nice to have a utility where you can pass a 
> function and it will run some arbitrary function once on each each executor 
> and return the result to you (e.g. you could perform a jstack from within the 
> JVM). You could probably hack it together with custom locality preferences, 
> accessing the list of live executors, and mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3514) Provide a utility function for returning the hosts (and number) of live executors

2014-09-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3514:
---
Issue Type: Improvement  (was: Bug)

> Provide a utility function for returning the hosts (and number) of live 
> executors
> -
>
> Key: SPARK-3514
> URL: https://issues.apache.org/jira/browse/SPARK-3514
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Minor
>
> It would be nice to tell user applications how many executors they have 
> currently running in their application. Also, we could give them the host 
> names on which the executors are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3514) Provide a utility function for returning the hosts (and number) of live executors

2014-09-12 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3514:
--

 Summary: Provide a utility function for returning the hosts (and 
number) of live executors
 Key: SPARK-3514
 URL: https://issues.apache.org/jira/browse/SPARK-3514
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Minor


It would be nice to tell user applications how many executors they have 
currently running in their application. Also, we could give them the host names 
on which the executors are running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3513) Provide a utility for running a function once on each executor

2014-09-12 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3513:
--

 Summary: Provide a utility for running a function once on each 
executor
 Key: SPARK-3513
 URL: https://issues.apache.org/jira/browse/SPARK-3513
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Minor
 Fix For: 1.2.0


This is minor, but it would be nice to have a utility where you can pass a 
function and it will run some arbitrary function once on each each executor and 
return the result to you (e.g. you could perform a jstack from within the JVM). 
You could probably hack it together with custom locality preferences, accessing 
the list of live executors, and mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3513) Provide a utility for running a function once on each executor

2014-09-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3513:
---
Target Version/s: 1.2.0

> Provide a utility for running a function once on each executor
> --
>
> Key: SPARK-3513
> URL: https://issues.apache.org/jira/browse/SPARK-3513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Minor
>
> This is minor, but it would be nice to have a utility where you can pass a 
> function and it will run some arbitrary function once on each each executor 
> and return the result to you (e.g. you could perform a jstack from within the 
> JVM). You could probably hack it together with custom locality preferences, 
> accessing the list of live executors, and mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3513) Provide a utility for running a function once on each executor

2014-09-12 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3513:
---
Fix Version/s: (was: 1.2.0)

> Provide a utility for running a function once on each executor
> --
>
> Key: SPARK-3513
> URL: https://issues.apache.org/jira/browse/SPARK-3513
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Minor
>
> This is minor, but it would be nice to have a utility where you can pass a 
> function and it will run some arbitrary function once on each each executor 
> and return the result to you (e.g. you could perform a jstack from within the 
> JVM). You could probably hack it together with custom locality preferences, 
> accessing the list of live executors, and mapPartitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2699) Improve compatibility with parquet file/table

2014-09-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2699.
-
Resolution: Duplicate

Closing as already fixed duplicate :)

> Improve compatibility with parquet file/table
> -
>
> Key: SPARK-2699
> URL: https://issues.apache.org/jira/browse/SPARK-2699
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Teng Qiu
>
> after SPARK-2446, the compatibility with parquet file created by old spark 
> release (spark 1.0.x) and by impala (all of versions until now: 1.4.x-cdh5) 
> is broken.
> strings in those parquet files are not annotated with UTF8 or are just only 
> ASCII char set (impala doesn't support UTF8 yet)
> this ticket aims to add a configuration option or some version check to 
> support those parquet files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2992) The transforms formerly known as non-lazy

2014-09-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2992:
---
Affects Version/s: 1.1.0

> The transforms formerly known as non-lazy
> -
>
> Key: SPARK-2992
> URL: https://issues.apache.org/jira/browse/SPARK-2992
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>
> An umbrella for a grab-bag of tickets involving lazy implementations of 
> transforms formerly thought to be non-lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3427) Avoid active vertex tracking in static PageRank

2014-09-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3427.

   Resolution: Fixed
Fix Version/s: 1.2.0

> Avoid active vertex tracking in static PageRank
> ---
>
> Key: SPARK-3427
> URL: https://issues.apache.org/jira/browse/SPARK-3427
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 1.2.0
>
>
> GraphX's current implementation of static (fixed iteration count) PageRank 
> uses the Pregel API. This unnecessarily tracks active vertices, even though 
> in static PageRank all vertices are always active. Active vertex tracking 
> incurs the following costs:
> 1. A shuffle per iteration to ship the active sets to the edge partitions.
> 2. A hash table creation per iteration at each partition to index the active 
> sets for lookup.
> 3. A hash lookup per edge to check whether the source vertex is active.
> I reimplemented static PageRank using the lower-level GraphX API instead of 
> the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided 
> a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of 
> PageRank on a synthetic graph with 10M vertices and 1.27B edges.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2992) The transforms formerly known as non-lazy

2014-09-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2992:
---
Priority: Major  (was: Minor)

> The transforms formerly known as non-lazy
> -
>
> Key: SPARK-2992
> URL: https://issues.apache.org/jira/browse/SPARK-2992
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>
> An umbrella for a grab-bag of tickets involving lazy implementations of 
> transforms formerly thought to be non-lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1579) PySpark should distinguish expected IOExceptions from unexpected ones in the worker

2014-09-12 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1579:
--
Fix Version/s: (was: 1.1.0)
   1.0.0

> PySpark should distinguish expected IOExceptions from unexpected ones in the 
> worker
> ---
>
> Key: SPARK-1579
> URL: https://issues.apache.org/jira/browse/SPARK-1579
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Patrick Wendell
>Assignee: Aaron Davidson
> Fix For: 1.0.0
>
>
> I chatted with [~adav] a bit about this. Right now we drop IOExceptions 
> because they are (in some cases) expected if a Python worker returns before 
> consuming its entire input. The issue is this swallows legitimate IO 
> exceptions when they occur.
> One thought we had was to change the daemon.py file to, instead of closing 
> the socket when the function is over, simply busy-wait on the socket being 
> closed. We'd transfer the responsibility for closing the socket to the Java 
> reader. The Java reader could, when it has finished consuming output form 
> Python, set a flag on a volatile variable to indicate that Python has fully 
> returned, and then close the socket. Then if an IOException is thrown in the 
> write thread, it only swallows the exception if we are expecting it.
> This would also let us remove the warning message right now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3339) Support for skipping json lines that fail to parse

2014-09-12 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3339:

Priority: Critical  (was: Major)

> Support for skipping json lines that fail to parse
> --
>
> Key: SPARK-3339
> URL: https://issues.apache.org/jira/browse/SPARK-3339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Yin Huai
>Priority: Critical
>
> When dealing with large datasets there is alway some data that fails to 
> parse.  Would be nice to handle this instead of throwing an exception 
> requiring the user to filter it out manually.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2083) Allow local task to retry after failure.

2014-09-12 Thread Radim Kolar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14132005#comment-14132005
 ] 

Radim Kolar commented on SPARK-2083:


i took look at patch. why to add new settings to make things too complicated? 
just respect _spark.task.maxFailures_ while run in local mode.

> Allow local task to retry after failure.
> 
>
> Key: SPARK-2083
> URL: https://issues.apache.org/jira/browse/SPARK-2083
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Peng Cheng
>Priority: Trivial
>  Labels: easyfix
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> If a job is submitted to run locally using masterURL = "local[X]", spark will 
> not retry a failed task regardless of your "spark.task.maxFailures" setting. 
> This design is to facilitate debugging and QA of spark application where all 
> tasks are expected to succeed and yield a results. Unfortunately, such 
> setting will prevent a local job from finished if any of its task cannot 
> guarantee a result (e.g. visiting an external resouce/API), and retrying 
> inside the task is less favoured (e.g. the task needs to be executed on a 
> different computer on production).
> User however can still set masterURL ="local[X,Y]" to override this (where Y 
> is the local maxFailures), but it is not documented and hard to manage. A 
> quick fix to this can be to add a new configuration property 
> "spark.local.maxFailures" with a default value of 1. So user knows exactly 
> where to change when reading the documentation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3456) YarnAllocator can lose container requests to RM

2014-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131993#comment-14131993
 ] 

Apache Spark commented on SPARK-3456:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/2373

> YarnAllocator can lose container requests to RM
> ---
>
> Key: SPARK-3456
> URL: https://issues.apache.org/jira/browse/SPARK-3456
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I haven't actually tested this yet, but I believe that spark on yarn can lose 
> container requests to the RM.  The reason is that we ask for the total number 
> upfront (say x) but then we don't ask for anymore unless some are missing and 
> if we do then we could erase the original request.
> For example
> - ask for 3 containers
> - 1 is allocated
> - ask for 0 containers since asked for 3 originally (2 left)
> - the 1 allocated dies
> - We now ask for 1 since its missing, this will override whatever is on the 
> yarn side (in this case 2).
> Then we lose the 2 more we need.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131933#comment-14131933
 ] 

Nicholas Chammas commented on SPARK-3500:
-

[~davies] - PySpark doesn't seem to support {{distinct(N)}} on even a regular 
RDD. Should it?

{code}
>>> sc.parallelize([1,2,3]).distinct(2)
Traceback (most recent call last):
  File "", line 1, in 
TypeError: distinct() takes exactly 1 argument (2 given)
{code}

This sounds like it's a separate issue. Maybe it should be tracked in a 
separate JIRA issue?

Also, could we edit the title of this JIRA issue to read something like 
"SchemaRDDs are missing these methods: ..."? The problem is not limited to 
SchemaRDDs created by jsonRDD().

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}
> repartition() and distinct(N) are also missing too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on your local machine outside of 
the firewall through shh tunneling (ssh -D  
@) into some work station inside the firewall. Then the 
spark yarn-client only needs to talk to the cluster through this proxy without 
the need of changing any configurations. Does this sound feasible? Maybe VPN is 
the right solution?  (was: I believe this would be a common scenario that the 
yarn cluster runs behind a firewall, while people want to run spark driver 
locally for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on your local machine outside of 
the firewall through shh tunneling (ssh -D  
@) into some work station inside the firewall. Then the 
spark yarn-client only needs to talk to the cluster through this proxy without 
the need of changing any configurations. Does this sound feasible?)

> yarn-client through socks proxy
> ---
>
> Key: SPARK-3512
> URL: https://issues.apache.org/jira/browse/SPARK-3512
> Project: Spark
>  Issue Type: Wish
>  Components: YARN
>Reporter: Yongjia Wang
>
> I believe this would be a common scenario that the yarn cluster runs behind a 
> firewall, while people want to run spark driver locally for best 
> interactivity experience. You would have full control of local resource that 
> can be accessed by the client as opposed to be limited to the spark-shell if 
> you would do the conventional way to ssh to the remote host inside the 
> firewall. For example, using ipython notebook, or more fancy IDEs, etc. 
> Installing anything you want on the remote host is usually not an option. A 
> potential solution is to setup socks proxy on your local machine outside of 
> the firewall through shh tunneling (ssh -D  
> @) into some work station inside the firewall. Then the 
> spark yarn-client only needs to talk to the cluster through this proxy 
> without the need of changing any configurations. Does this sound feasible? 
> Maybe VPN is the right solution?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3500:
--
Description: 
{code}
>>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
>>> '{"foo":"baz"}'])).coalesce(1)
Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
java.lang.Boolean]) does not exist
{code}

repartition() and distinct(N) are also missing too.

  was:
{code}
>>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
>>> '{"foo":"baz"}'])).coalesce(1)
Py4JError: An error occurred while calling o94.coalesce. Trace:
py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
java.lang.Boolean]) does not exist
{code}


> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}
> repartition() and distinct(N) are also missing too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131869#comment-14131869
 ] 

Davies Liu commented on SPARK-3500:
---

repartition() and distinct(N) are also missing too.

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}
> repartition() and distinct(N) are also missing too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3490) Alleviate port collisions during tests

2014-09-12 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3490:
-
Fix Version/s: (was: 1.1.1)

> Alleviate port collisions during tests
> --
>
> Key: SPARK-3490
> URL: https://issues.apache.org/jira/browse/SPARK-3490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.2.0
>
>
> A few tests, notably SparkSubmitSuite and DriverSuite, have been failing 
> intermittently because we open too many ephemeral ports and occasionally 
> can't bind to new ones.
> We should minimize the use of ports during tests where possible. A great 
> candidate is the SparkUI, which is not needed for most tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3490) Alleviate port collisions during tests

2014-09-12 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131844#comment-14131844
 ] 

Andrew Or commented on SPARK-3490:
--

This still needs to be back ported into branch-1.1

> Alleviate port collisions during tests
> --
>
> Key: SPARK-3490
> URL: https://issues.apache.org/jira/browse/SPARK-3490
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.2.0
>
>
> A few tests, notably SparkSubmitSuite and DriverSuite, have been failing 
> intermittently because we open too many ephemeral ports and occasionally 
> can't bind to new ones.
> We should minimize the use of ports during tests where possible. A great 
> candidate is the SparkUI, which is not needed for most tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131823#comment-14131823
 ] 

Reynold Xin commented on SPARK-2926:


Do you mind creating a separate branch that's based on 
https://github.com/rxin/spark/tree/netty-blockTransferService ?

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131757#comment-14131757
 ] 

Nicholas Chammas commented on SPARK-3500:
-

Btw, this seems like the same type of problem reported in [SPARK-2797].

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on your local machine outside of 
the firewall through shh tunneling (ssh -D  
@) into some work station inside the firewall. Then the 
spark yarn-client only needs to talk to the cluster through this proxy without 
the need of changing any configurations. Does this sound feasible?  (was: I 
believe this would be a common scenario that the yarn cluster runs behind a 
firewall, while people want to run spark driver locally for best interactivity 
experience. You would have full control of local resource that can be accessed 
by the client as opposed to be limited to the spark-shell if you would do the 
conventional way to ssh to the remote host inside the firewall. For example, 
using ipython notebook, or more fancy IDEs, etc. Installing anything you want 
on the remote host is usually not an option. A potential solution is to setup 
socks proxy on the local machine outside of the firewall through shh tunneling 
(ssh -D port user@remote-host) into some work station inside the firewall. Then 
the spark yarn-client only needs to talk to the cluster through this proxy 
without the need to changing any configurations. Does this sound feasible?)

> yarn-client through socks proxy
> ---
>
> Key: SPARK-3512
> URL: https://issues.apache.org/jira/browse/SPARK-3512
> Project: Spark
>  Issue Type: Wish
>  Components: YARN
>Reporter: Yongjia Wang
>
> I believe this would be a common scenario that the yarn cluster runs behind a 
> firewall, while people want to run spark driver locally for best 
> interactivity experience. You would have full control of local resource that 
> can be accessed by the client as opposed to be limited to the spark-shell if 
> you would do the conventional way to ssh to the remote host inside the 
> firewall. For example, using ipython notebook, or more fancy IDEs, etc. 
> Installing anything you want on the remote host is usually not an option. A 
> potential solution is to setup socks proxy on your local machine outside of 
> the firewall through shh tunneling (ssh -D  
> @) into some work station inside the firewall. Then the 
> spark yarn-client only needs to talk to the cluster through this proxy 
> without the need of changing any configurations. Does this sound feasible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131752#comment-14131752
 ] 

Nicholas Chammas commented on SPARK-3500:
-

Hmm, you _could_ perhaps consider this a missing feature, though since all base 
RDD operations should also be valid SchemaRDD operations (right?), it 
definitely feels like a bug. And it's not just for SchemaRDDs created by 
jsonRDD (as noted in the title).

It looks like {{repartition}} is missing, too.

{code}
from pyspark.sql import SQLContext
from pyspark.sql import Row
sqlContext = SQLContext(sc)

a = sc.parallelize([Row(field1=1, field2="row1")])
sqlContext.inferSchema(a).coalesce(1)  # Method coalesce does not exist
sqlContext.inferSchema(a).repartition(1)  # Method repartition does not exist
{code}

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. You would have full control of local 
resource that can be accessed by the client as opposed to be limited to the 
spark-shell if you would do the conventional way to ssh to the remote host 
inside the firewall. For example, using ipython notebook, or more fancy IDEs, 
etc. Installing anything you want on the remote host is usually not an option. 
A potential solution is to setup socks proxy on the local machine outside of 
the firewall through shh tunneling (ssh -D port user@remote-host) into some 
work station inside the firewall. Then the spark yarn-client only needs to talk 
to the cluster through this proxy without the need to changing any 
configurations. Does this sound feasible?  (was: I believe this would be a 
common scenario that the yarn cluster runs behind a firewall, while people want 
to run spark driver locally for best interactivity experience. For example, 
using ipython notebook, or more fancy IDEs, etc. A potential solution is to 
setup socks proxy on the local machine outside of the firewall through shh 
tunneling into some work station inside the firewall. Then the spark 
yarn-client only needs to talk to the cluster through this proxy without 
changing any configurations.)

> yarn-client through socks proxy
> ---
>
> Key: SPARK-3512
> URL: https://issues.apache.org/jira/browse/SPARK-3512
> Project: Spark
>  Issue Type: Wish
>  Components: YARN
>Reporter: Yongjia Wang
>
> I believe this would be a common scenario that the yarn cluster runs behind a 
> firewall, while people want to run spark driver locally for best 
> interactivity experience. You would have full control of local resource that 
> can be accessed by the client as opposed to be limited to the spark-shell if 
> you would do the conventional way to ssh to the remote host inside the 
> firewall. For example, using ipython notebook, or more fancy IDEs, etc. 
> Installing anything you want on the remote host is usually not an option. A 
> potential solution is to setup socks proxy on the local machine outside of 
> the firewall through shh tunneling (ssh -D port user@remote-host) into some 
> work station inside the firewall. Then the spark yarn-client only needs to 
> talk to the cluster through this proxy without the need to changing any 
> configurations. Does this sound feasible?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. For example, using ipython notebook, or more 
fancy IDEs, etc. A potential solution is to setup socks proxy on the local 
machine outside of the firewall through shh tunneling into some work station 
inside the firewall. Then the spark yarn-client only needs to talk through this 
proxy.  (was: I believe this would be a common scenario that the yarn cluster 
runs behind a firewall, while people want to run spark driver locally for best 
interactivity experience. For example, using ipython notebook, or more fancy 
IDEs, etc. A potential solution is to setup socks proxy on the local machine 
outside of the firewall through shh tunneling into some work station inside the 
firewall. Then the client only needs to talk through this proxy.)

> yarn-client through socks proxy
> ---
>
> Key: SPARK-3512
> URL: https://issues.apache.org/jira/browse/SPARK-3512
> Project: Spark
>  Issue Type: Wish
>  Components: YARN
>Reporter: Yongjia Wang
>
> I believe this would be a common scenario that the yarn cluster runs behind a 
> firewall, while people want to run spark driver locally for best 
> interactivity experience. For example, using ipython notebook, or more fancy 
> IDEs, etc. A potential solution is to setup socks proxy on the local machine 
> outside of the firewall through shh tunneling into some work station inside 
> the firewall. Then the spark yarn-client only needs to talk through this 
> proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yongjia Wang updated SPARK-3512:

Description: I believe this would be a common scenario that the yarn 
cluster runs behind a firewall, while people want to run spark driver locally 
for best interactivity experience. For example, using ipython notebook, or more 
fancy IDEs, etc. A potential solution is to setup socks proxy on the local 
machine outside of the firewall through shh tunneling into some work station 
inside the firewall. Then the spark yarn-client only needs to talk to the 
cluster through this proxy without changing any configurations.  (was: I 
believe this would be a common scenario that the yarn cluster runs behind a 
firewall, while people want to run spark driver locally for best interactivity 
experience. For example, using ipython notebook, or more fancy IDEs, etc. A 
potential solution is to setup socks proxy on the local machine outside of the 
firewall through shh tunneling into some work station inside the firewall. Then 
the spark yarn-client only needs to talk through this proxy.)

> yarn-client through socks proxy
> ---
>
> Key: SPARK-3512
> URL: https://issues.apache.org/jira/browse/SPARK-3512
> Project: Spark
>  Issue Type: Wish
>  Components: YARN
>Reporter: Yongjia Wang
>
> I believe this would be a common scenario that the yarn cluster runs behind a 
> firewall, while people want to run spark driver locally for best 
> interactivity experience. For example, using ipython notebook, or more fancy 
> IDEs, etc. A potential solution is to setup socks proxy on the local machine 
> outside of the firewall through shh tunneling into some work station inside 
> the firewall. Then the spark yarn-client only needs to talk to the cluster 
> through this proxy without changing any configurations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3512) yarn-client through socks proxy

2014-09-12 Thread Yongjia Wang (JIRA)
Yongjia Wang created SPARK-3512:
---

 Summary: yarn-client through socks proxy
 Key: SPARK-3512
 URL: https://issues.apache.org/jira/browse/SPARK-3512
 Project: Spark
  Issue Type: Wish
  Components: YARN
Reporter: Yongjia Wang


I believe this would be a common scenario that the yarn cluster runs behind a 
firewall, while people want to run spark driver locally for best interactivity 
experience. For example, using ipython notebook, or more fancy IDEs, etc. A 
potential solution is to setup socks proxy on the local machine outside of the 
firewall through shh tunneling into some work station inside the firewall. Then 
the client only needs to talk through this proxy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131714#comment-14131714
 ] 

Davies Liu commented on SPARK-3500:
---

I think it's a bug, there is a workaround for it:

{code}
srdd2 = SchemaRDD(srdd._jschema_rdd.coalesce(N, False, None), sqlCtx)
{code} 

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131706#comment-14131706
 ] 

Patrick Wendell commented on SPARK-3500:


If it's just a missing feature we tend to be conservative and only put it in 
the new minor release. [~davies] - is there code that people can use in 1.1 
that works around this (i.e if they add a conversion themselves)? If there is a 
workaround for 1.1 I think we should just publish that and target this for 1.2. 

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3500) SchemaRDD from jsonRDD() has not coalesce() method

2014-09-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131688#comment-14131688
 ] 

Nicholas Chammas commented on SPARK-3500:
-

[~davies] - Shouldn't the target version for this bugfix be 1.1.1?

> SchemaRDD from jsonRDD() has not coalesce() method
> --
>
> Key: SPARK-3500
> URL: https://issues.apache.org/jira/browse/SPARK-3500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.1.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>
> {code}
> >>> sqlCtx.jsonRDD(sc.parallelize(['{"foo":"bar"}', 
> >>> '{"foo":"baz"}'])).coalesce(1)
> Py4JError: An error occurred while calling o94.coalesce. Trace:
> py4j.Py4JException: Method coalesce([class java.lang.Integer, class 
> java.lang.Boolean]) does not exist
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3511) Create a RELEASE-NOTES.txt file in the repo

2014-09-12 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3511:
--

 Summary: Create a RELEASE-NOTES.txt file in the repo
 Key: SPARK-3511
 URL: https://issues.apache.org/jira/browse/SPARK-3511
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Blocker


There are a few different things we need to do a better job of tracking. This 
file would allow us to track things:

1. When we want to give credit to secondary people for contributing to a patch
2. Changes to default configuration values w/ how to restore legacy options
3. New features that are disabled by default
4. Known API breaks (if any) along w/ explanation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3507) Create RegressionLearner trait and make some currect code implement it

2014-09-12 Thread Egor Pakhomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131658#comment-14131658
 ] 

Egor Pakhomov commented on SPARK-3507:
--

https://github.com/apache/spark/pull/2371

> Create RegressionLearner trait and make some currect code implement it
> --
>
> Key: SPARK-3507
> URL: https://issues.apache.org/jira/browse/SPARK-3507
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Egor Pakhomov
>Assignee: Egor Pakhomov
>Priority: Minor
> Fix For: 1.2.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Here in Yandex, during implementation of gradient boosting in spark and 
> creating our ML tool for internal use, we found next serious problems in 
> MLLib:
> There is no Regression/Classification learner model abstraction. We were 
> building abstract data processing pipelines, which should work just with some 
> regression - exact algorithm specified outside this code. There is no 
> abstraction, which will allow me to do that. (It's main reason for all 
> further problems) 
> There is no common practice among MLlib for testing algorithms: every model 
> generates it's own random test data. There is no easy extractable test cases 
> applible to another algorithm. There is no benchmarks for comparing 
> algorithms. After implementing new algorithm it's very hard to understand how 
> it should be tested.  
> Lack of serialization testing: MLlib algorithms don't contain tests which 
> test that model work after serialization.  
> During implementation of new algorithm it's hard to understand what API you 
> should create and which interface to implement.
> Start for solving all these problems must be done in creating common 
> interface for typical algorithms/models - regression, classification, 
> clustering, collaborative filtering.
> All main tests should be written against these interfaces, so when new 
> algorithm implemented - all it should do is passed already written tests. It 
> allow us to have managble quality among all lib.
> There should be couple benchmarks which allow new spark user to get feeling 
> about which algorithm to use.
> Test set against these abstractions should contain serialization test. In 
> production most time there is no need in model, which can't be stored.
> As the first step of this roadmap I'd like to create trait RegressionLearner, 
> ADD methods to current algorithms to implement this trait and create some 
> tests against it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3507) Create RegressionLearner trait and make some currect code implement it

2014-09-12 Thread Egor Pakhomov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Egor Pakhomov updated SPARK-3507:
-
Comment: was deleted

(was: https://github.com/apache/spark/pull/2371)

> Create RegressionLearner trait and make some currect code implement it
> --
>
> Key: SPARK-3507
> URL: https://issues.apache.org/jira/browse/SPARK-3507
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Egor Pakhomov
>Assignee: Egor Pakhomov
>Priority: Minor
> Fix For: 1.2.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Here in Yandex, during implementation of gradient boosting in spark and 
> creating our ML tool for internal use, we found next serious problems in 
> MLLib:
> There is no Regression/Classification learner model abstraction. We were 
> building abstract data processing pipelines, which should work just with some 
> regression - exact algorithm specified outside this code. There is no 
> abstraction, which will allow me to do that. (It's main reason for all 
> further problems) 
> There is no common practice among MLlib for testing algorithms: every model 
> generates it's own random test data. There is no easy extractable test cases 
> applible to another algorithm. There is no benchmarks for comparing 
> algorithms. After implementing new algorithm it's very hard to understand how 
> it should be tested.  
> Lack of serialization testing: MLlib algorithms don't contain tests which 
> test that model work after serialization.  
> During implementation of new algorithm it's hard to understand what API you 
> should create and which interface to implement.
> Start for solving all these problems must be done in creating common 
> interface for typical algorithms/models - regression, classification, 
> clustering, collaborative filtering.
> All main tests should be written against these interfaces, so when new 
> algorithm implemented - all it should do is passed already written tests. It 
> allow us to have managble quality among all lib.
> There should be couple benchmarks which allow new spark user to get feeling 
> about which algorithm to use.
> Test set against these abstractions should contain serialization test. In 
> production most time there is no need in model, which can't be stored.
> As the first step of this roadmap I'd like to create trait RegressionLearner, 
> ADD methods to current algorithms to implement this trait and create some 
> tests against it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3507) Create RegressionLearner trait and make some currect code implement it

2014-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131650#comment-14131650
 ] 

Apache Spark commented on SPARK-3507:
-

User 'epahomov' has created a pull request for this issue:
https://github.com/apache/spark/pull/2371

> Create RegressionLearner trait and make some currect code implement it
> --
>
> Key: SPARK-3507
> URL: https://issues.apache.org/jira/browse/SPARK-3507
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Egor Pakhomov
>Assignee: Egor Pakhomov
>Priority: Minor
> Fix For: 1.2.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Here in Yandex, during implementation of gradient boosting in spark and 
> creating our ML tool for internal use, we found next serious problems in 
> MLLib:
> There is no Regression/Classification learner model abstraction. We were 
> building abstract data processing pipelines, which should work just with some 
> regression - exact algorithm specified outside this code. There is no 
> abstraction, which will allow me to do that. (It's main reason for all 
> further problems) 
> There is no common practice among MLlib for testing algorithms: every model 
> generates it's own random test data. There is no easy extractable test cases 
> applible to another algorithm. There is no benchmarks for comparing 
> algorithms. After implementing new algorithm it's very hard to understand how 
> it should be tested.  
> Lack of serialization testing: MLlib algorithms don't contain tests which 
> test that model work after serialization.  
> During implementation of new algorithm it's hard to understand what API you 
> should create and which interface to implement.
> Start for solving all these problems must be done in creating common 
> interface for typical algorithms/models - regression, classification, 
> clustering, collaborative filtering.
> All main tests should be written against these interfaces, so when new 
> algorithm implemented - all it should do is passed already written tests. It 
> allow us to have managble quality among all lib.
> There should be couple benchmarks which allow new spark user to get feeling 
> about which algorithm to use.
> Test set against these abstractions should contain serialization test. In 
> production most time there is no need in model, which can't be stored.
> As the first step of this roadmap I'd like to create trait RegressionLearner, 
> ADD methods to current algorithms to implement this trait and create some 
> tests against it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3499) Create Spark-based distcp utility

2014-09-12 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131054#comment-14131054
 ] 

Nicholas Chammas edited comment on SPARK-3499 at 9/12/14 2:27 PM:
--

I'm not sure if this type of request should be tracked in the Spark project, 
since it's an ecosystem tool rather than a Spark feature. Furthermore, I don't 
know if there is already work to port {{distcp}} to Spark or the like.

In any case, here is the request for tracking purposes.


was (Author: nchammas):
I'm not sure if this type of request should be track in the Spark project, 
since it's an ecosystem tool rather than a Spark feature. Furthermore, I don't 
know if there is already work to port {{distcp}} to Spark or the like.

In any case, here is the request for tracking purposes.

> Create Spark-based distcp utility
> -
>
> Key: SPARK-3499
> URL: https://issues.apache.org/jira/browse/SPARK-3499
> Project: Spark
>  Issue Type: Wish
>  Components: Input/Output
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Create a {{distcp}} clone that runs on Spark as opposed to MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3510) Create method for calculating error between expected result and actual

2014-09-12 Thread Egor Pakhomov (JIRA)
Egor Pakhomov created SPARK-3510:


 Summary: Create method for calculating error between expected 
result and actual
 Key: SPARK-3510
 URL: https://issues.apache.org/jira/browse/SPARK-3510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Egor Pakhomov
Priority: Minor


Could not do it right now, because zip operation need same number elements in 
partition, which I can not garantee.
right now using next code: 
{code:title=Bar.scala|borderStyle=solid}
val expectedRepartitioned = expected.cache().repartition(1) // need for zip 
operation
val actualRepartitioned = actual.cache().repartition(1)
new DoubleRDDFunctions(
  expectedRepartitioned
.zip(actualRepartitioned)
.map(a => {
Math.pow(a._1 - a._2, 2)
  })).mean()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3509) Method for generating random LabeledPoints for testing

2014-09-12 Thread Egor Pakhomov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131554#comment-14131554
 ] 

Egor Pakhomov commented on SPARK-3509:
--

So far I have bad code for my usages. need good code. bad code  -
  def randomRegressionLabeledFeatureSet(size: Int, featureNumber: Int) = {
// bad code. task for better code  - SPARK-3509
val seed = Random.nextLong();
sc.parallelize(1 to size, 10).map(i => {
  val features = (1 to featureNumber).map(_ => Random.nextDouble()).toArray
  var seedCopy = seed
  val result = features.reduceLeft((a, b) => {
if (seedCopy % 3 == 0) {
  seedCopy = seedCopy / 3
  a * b
} else {
  seedCopy = seedCopy / 2
  a + b
}
  })
  new LabeledPoint(result, Vectors.dense(features))
})
  }

> Method for generating random LabeledPoints for testing
> --
>
> Key: SPARK-3509
> URL: https://issues.apache.org/jira/browse/SPARK-3509
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Egor Pakhomov
>Priority: Minor
> Fix For: 1.2.0
>
>
> During testing I need random LabeledPoints with some correletion behind it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3509) Method for generating random LabeledPoints for testing

2014-09-12 Thread Egor Pakhomov (JIRA)
Egor Pakhomov created SPARK-3509:


 Summary: Method for generating random LabeledPoints for testing
 Key: SPARK-3509
 URL: https://issues.apache.org/jira/browse/SPARK-3509
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


During testing I need random LabeledPoints with some correletion behind it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3508) annotate the Spark configs to indicate which ones are meant for the end user

2014-09-12 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3508:


 Summary: annotate the Spark configs to indicate which ones are 
meant for the end user
 Key: SPARK-3508
 URL: https://issues.apache.org/jira/browse/SPARK-3508
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Thomas Graves


Spark has lots of configs floating around.  To me configs are like api's and we 
should make it clear which ones are meant for the end user and which ones are 
only used internally.  We should decide on exactly how we want to do this.

I've seen in the past users looking at the code and then using a config that 
was meant to be internal and file a jira to document it.  Since there are many 
comitters its easy for someone who doesn't have the history with that config to 
just think we forgot to document it and then it becomes public.

Perhaps we need to name internal configs specially (spark.internal.) or we need 
to annotate them or something else.

thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2558) Mention --queue argument in YARN documentation

2014-09-12 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2558.
--
  Resolution: Fixed
   Fix Version/s: 1.2.0
Target Version/s: 1.2.0  (was: 1.1.0, 1.0.3)

> Mention --queue argument in YARN documentation 
> ---
>
> Key: SPARK-2558
> URL: https://issues.apache.org/jira/browse/SPARK-2558
> Project: Spark
>  Issue Type: Documentation
>  Components: YARN
>Reporter: Matei Zaharia
>Priority: Trivial
>  Labels: Starter
> Fix For: 1.2.0
>
>
> The docs about it went away when we updated the page to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3507) Create RegressionLearner trait and make some currect code implement it

2014-09-12 Thread Egor Pakhomov (JIRA)
Egor Pakhomov created SPARK-3507:


 Summary: Create RegressionLearner trait and make some currect code 
implement it
 Key: SPARK-3507
 URL: https://issues.apache.org/jira/browse/SPARK-3507
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Egor Pakhomov
Priority: Minor
 Fix For: 1.2.0


Here in Yandex, during implementation of gradient boosting in spark and 
creating our ML tool for internal use, we found next serious problems in MLLib:

There is no Regression/Classification learner model abstraction. We were 
building abstract data processing pipelines, which should work just with some 
regression - exact algorithm specified outside this code. There is no 
abstraction, which will allow me to do that. (It's main reason for all further 
problems) 
There is no common practice among MLlib for testing algorithms: every model 
generates it's own random test data. There is no easy extractable test cases 
applible to another algorithm. There is no benchmarks for comparing algorithms. 
After implementing new algorithm it's very hard to understand how it should be 
tested.  
Lack of serialization testing: MLlib algorithms don't contain tests which test 
that model work after serialization.  
During implementation of new algorithm it's hard to understand what API you 
should create and which interface to implement.
Start for solving all these problems must be done in creating common interface 
for typical algorithms/models - regression, classification, clustering, 
collaborative filtering.

All main tests should be written against these interfaces, so when new 
algorithm implemented - all it should do is passed already written tests. It 
allow us to have managble quality among all lib.

There should be couple benchmarks which allow new spark user to get feeling 
about which algorithm to use.

Test set against these abstractions should contain serialization test. In 
production most time there is no need in model, which can't be stored.

As the first step of this roadmap I'd like to create trait RegressionLearner, 
ADD methods to current algorithms to implement this trait and create some tests 
against it. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2182) Scalastyle rule blocking unicode operators

2014-09-12 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131451#comment-14131451
 ] 

Prashant Sharma commented on SPARK-2182:


Found this SO link useful, 
http://stackoverflow.com/questions/23224219/does-the-scala-compiler-work-with-utf-8-encoded-source-files.
 

> Scalastyle rule blocking unicode operators
> --
>
> Key: SPARK-2182
> URL: https://issues.apache.org/jira/browse/SPARK-2182
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Andrew Ash
>Assignee: Prashant Sharma
> Attachments: Screen Shot 2014-06-18 at 3.28.44 PM.png
>
>
> Some IDEs don't support Scala's [unicode 
> operators|http://www.scala-lang.org/old/node/4723] so we should consider 
> adding a scalastyle rule to block them for wider compatibility among 
> contributors.
> See this PR for a place we reverted a unicode operator: 
> https://github.com/apache/spark/pull/1119



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3506) 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest

2014-09-12 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-3506:
--

 Summary: 1.1.0-SNAPSHOT in docs for 1.1.0 under docs/latest
 Key: SPARK-3506
 URL: https://issues.apache.org/jira/browse/SPARK-3506
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Jacek Laskowski
Priority: Trivial


In https://spark.apache.org/docs/latest/ there are references to 1.1.0-SNAPSHOT:

* This documentation is for Spark version 1.1.0-SNAPSHOT.
* For the Scala API, Spark 1.1.0-SNAPSHOT uses Scala 2.10.

It should be version 1.1.0 since that's the latest released version and the 
header tells so, too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark

2014-09-12 Thread Ryan D Braley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131333#comment-14131333
 ] 

Ryan D Braley commented on SPARK-2593:
--

This would be quite useful. It is hard to use actorStream with spark streaming 
where you have remote actors sending to spark because we need two actor 
systems. Right now it seems the name of the actor system in spark is hardcoded 
to "spark". In order for actors to join an akka cluster we need to have the 
actor systems share the same name. Thus it is currently difficult to distribute 
work from an external actor system to the spark cluster without this change.

> Add ability to pass an existing Akka ActorSystem into Spark
> ---
>
> Key: SPARK-2593
> URL: https://issues.apache.org/jira/browse/SPARK-2593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Helena Edelson
>
> As a developer I want to pass an existing ActorSystem into StreamingContext 
> in load-time so that I do not have 2 actor systems running on a node in an 
> Akka application.
> This would mean having spark's actor system on its own named-dispatchers as 
> well as exposing the new private creation of its own actor system.
>  
> I would like to create an Akka Extension that wraps around Spark/Spark 
> Streaming and Cassandra. So the programmatic creation would simply be this 
> for a user
> val extension = SparkCassandra(system)
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131314#comment-14131314
 ] 

Saisai Shao commented on SPARK-2926:


Hi Reynold, thanks a lot for your watching this, here is the branch 
(https://github.com/jerryshao/apache-spark/tree/sort-based-shuffle-read), 
though code is not rebase to the latest master branch.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-09-12 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131299#comment-14131299
 ] 

Xiangrui Meng commented on SPARK-1405:
--

[~xusen] and [~gq] Thanks for working on LDA! The major feedback of your 
implementations is the how models are stored.

[~josephkb] and I had an offline discussion with Evan and Joey (AMPLab) on 
LDA's interface and implementation. For the input data, we recommend `RDD[(Int, 
Vector)]`, where each pair consists of the document id and its word 
distribution, which may come from a text vectorizer. For the output model, 
because the LDA model is huge (W*K + D*K), where W is the number of words, D is 
the number of documents, and K is the number of topics, we should store the 
model distributively for better scalability, e.g., in RDD[(Int, Vector)], or 
using Long for ids. Joey already had an LDA implementation using GraphX: 

https://github.com/jegonzal/graphx/blob/LDA/graph/src/main/scala/org/apache/spark/graph/algorithms/TopicModeling.scala

With GraphX, we can treat documents and words as graph nodes and topic 
assignments as edges. The code is easy to understand.

There is also a paper describing a distributed implementation of LDA on Spark 
that uses a DGSD-like partitioning of the doc-word matrix:

http://jmlr.org/proceedings/papers/v36/qiu14.pdf

Anyone interested in helping test those implementations?

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: features
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2838) performance tests for feature transformations

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2838:
-
Assignee: (was: Xiangrui Meng)

> performance tests for feature transformations
> -
>
> Key: SPARK-2838
> URL: https://issues.apache.org/jira/browse/SPARK-2838
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Priority: Minor
>
> 1. TF-IDF
> 2. StandardScaler
> 3. Normalizer
> 4. Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3436) [MLlib]Streaming SVM

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3436:
-
Assignee: Liquan Pei

> [MLlib]Streaming SVM 
> -
>
> Key: SPARK-3436
> URL: https://issues.apache.org/jira/browse/SPARK-3436
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Liquan Pei
>Assignee: Liquan Pei
>
> Implement online learning with kernels according to 
> http://users.cecs.anu.edu.au/~williams/papers/P172.pdf
> The algorithms proposed in the above paper are implemented in R 
> (http://users.cecs.anu.edu.au/~williams/papers/P172.pdf) and MADlib 
> (http://doc.madlib.net/latest/group__grp__kernmach.html)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3249) Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3249:
-
Target Version/s: 1.2.0  (was: 1.1.0)

> Fix links in ScalaDoc that cause warning messages in `sbt/sbt unidoc`
> -
>
> Key: SPARK-3249
> URL: https://issues.apache.org/jira/browse/SPARK-3249
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> If there are multiple overloaded versions of a method, we should make the 
> links more specific. Otherwise, `sbt/sbt unidoc` generates warning messages 
> like the following:
> {code}
> [warn] 
> mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala:305: The 
> link target "org.apache.spark.mllib.tree.DecisionTree$#trainClassifier" is 
> ambiguous. Several members fit the target:
> [warn] (input: 
> org.apache.spark.api.java.JavaRDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: java.util.Map[Integer,Integer],impurity: 
> String,maxDepth: Int,maxBins: Int): 
> org.apache.spark.mllib.tree.model.DecisionTreeModel in object DecisionTree 
> [chosen]
> [warn] (input: 
> org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint],numClassesForClassification:
>  Int,categoricalFeaturesInfo: Map[Int,Int],impurity: String,maxDepth: 
> Int,maxBins: Int): org.apache.spark.mllib.tree.model.DecisionTreeModel in 
> object DecisionTree
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2838) performance tests for feature transformations

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2838:
-
Target Version/s: 1.2.0  (was: 1.1.0)

> performance tests for feature transformations
> -
>
> Key: SPARK-2838
> URL: https://issues.apache.org/jira/browse/SPARK-2838
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>
> 1. TF-IDF
> 2. StandardScaler
> 3. Normalizer
> 4. Word2Vec



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2830) MLlib v1.1 documentation

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2830.
--
   Resolution: Fixed
Fix Version/s: 1.1.0

> MLlib v1.1 documentation
> 
>
> Key: SPARK-2830
> URL: https://issues.apache.org/jira/browse/SPARK-2830
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Ameet Talwalkar
> Fix For: 1.1.0
>
>
> This is an umbrella JIRA for MLlib v1.1 documentation. Tasks are
> 1. write docs for new features
> 2. add code examples for Python/Java
> 3. migration guide (if there are API changes, which I don't remember)
> 4. update MLlib's webpage



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3160) Simplify DecisionTree data structure for training

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3160.
--
   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2341
[https://github.com/apache/spark/pull/2341]

> Simplify DecisionTree data structure for training
> -
>
> Key: SPARK-3160
> URL: https://issues.apache.org/jira/browse/SPARK-3160
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.2.0
>
>
> Improvement: code clarity
> Currently, we maintain a tree structure, a flat array of nodes, and a 
> parentImpurities array.
> Proposed fix: Maintain everything within a growing tree structure.
> This would let us eliminate the flat array of nodes, thus saving storage when 
> we do not grow a full tree.  It would also potentially make it easier to pass 
> subtrees to compute nodes for local training.
> Note:
> * This JIRA used to have this item as well: We could have a “LearningNode 
> extends Node” setup where the LearningNode holds metadata for learning (such 
> as impurities).  The test-time model could be extracted from this 
> training-time model, so that extra information (such as impurities) does not 
> have to be kept after training.
> * However, this is really a separate issue, so I removed it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3494) DecisionTree overflow error in calculating maxMemoryUsage

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3494:
-
Fix Version/s: 1.2.0

> DecisionTree overflow error in calculating maxMemoryUsage
> -
>
> Key: SPARK-3494
> URL: https://issues.apache.org/jira/browse/SPARK-3494
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.2.0
>
>
> maxMemoryUsage can easily overflow.  It needs to use long ints, and also 
> check for overflows afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3494) DecisionTree overflow error in calculating maxMemoryUsage

2014-09-12 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3494.
--
Resolution: Fixed
  Assignee: Joseph K. Bradley

https://github.com/apache/spark/pull/2341

> DecisionTree overflow error in calculating maxMemoryUsage
> -
>
> Key: SPARK-3494
> URL: https://issues.apache.org/jira/browse/SPARK-3494
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> maxMemoryUsage can easily overflow.  It needs to use long ints, and also 
> check for overflows afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3393) Align the log4j configuration for Spark & SparkSQLCLI

2014-09-12 Thread Cheng Hao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Hao updated SPARK-3393:
-
Summary: Align the log4j configuration for Spark & SparkSQLCLI  (was: Add 
configuration templates for HQL user)

> Align the log4j configuration for Spark & SparkSQLCLI
> -
>
> Key: SPARK-3393
> URL: https://issues.apache.org/jira/browse/SPARK-3393
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
>
> User may be confused for the HQL logging & configurations, we'd better 
> provide a default template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3084) Collect broadcasted tables in parallel in joins

2014-09-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131258#comment-14131258
 ] 

Reynold Xin commented on SPARK-3084:


Note that the current fix actually launches jobs immediately in explain ... we 
should fix that.

> Collect broadcasted tables in parallel in joins
> ---
>
> Key: SPARK-3084
> URL: https://issues.apache.org/jira/browse/SPARK-3084
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Fix For: 1.1.0
>
>
> BroadcastHashJoin has a broadcastFuture variable that tries to collect the 
> broadcasted table in a separate thread, but this doesn't help because it's a 
> lazy val that only gets initialized when you attempt to build the RDD. Thus 
> queries that broadcast multiple tables would collect and broadcast them 
> sequentially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2992) The transforms formerly known as non-lazy

2014-09-12 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2992:
---
Target Version/s: 1.2.0  (was: 1.1.1)

> The transforms formerly known as non-lazy
> -
>
> Key: SPARK-2992
> URL: https://issues.apache.org/jira/browse/SPARK-2992
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>
> An umbrella for a grab-bag of tickets involving lazy implementations of 
> transforms formerly thought to be non-lazy.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3499) Create Spark-based distcp utility

2014-09-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131255#comment-14131255
 ] 

Reynold Xin commented on SPARK-3499:


Would be pretty cool to have actually. I don't know if it belongs in Spark 
directly. Maybe examples?

> Create Spark-based distcp utility
> -
>
> Key: SPARK-3499
> URL: https://issues.apache.org/jira/browse/SPARK-3499
> Project: Spark
>  Issue Type: Wish
>  Components: Input/Output
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Create a {{distcp}} clone that runs on Spark as opposed to MapReduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3498) Block always replicated to the same node

2014-09-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131256#comment-14131256
 ] 

Reynold Xin commented on SPARK-3498:


cc [~tdas]

> Block always replicated to the same node
> 
>
> Key: SPARK-3498
> URL: https://issues.apache.org/jira/browse/SPARK-3498
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: shenhong
>
> When running a spark streaming job, we should replicate receiver blocks, but 
> all the blocks replicated to the  same node. Here is the log.
> 14/09/10 19:55:16 INFO BlockManagerInfo: Added input-0-1410350117000 in 
> memory on 10.196.131.19:42261 (size: 8.9 MB, free: 1050.3 MB)
> 14/09/10 19:55:16 INFO BlockManagerInfo: Added input-0-1410350117000 in 
> memory on tdw-10-196-130-155:51155 (size: 8.9 MB, free: 879.3 MB)
> 14/09/10 19:55:17 INFO BlockManagerInfo: Added input-0-1410350118000 in 
> memory on 10.196.131.19:42261 (size: 7.7 MB, free: 1042.6 MB)
> 14/09/10 19:55:17 INFO BlockManagerInfo: Added input-0-1410350118000 in 
> memory on tdw-10-196-130-155:51155 (size: 7.7 MB, free: 871.6 MB)
> 14/09/10 19:55:18 INFO BlockManagerInfo: Added input-0-1410350119000 in 
> memory on 10.196.131.19:42261 (size: 7.3 MB, free: 1035.3 MB)
> 14/09/10 19:55:18 INFO BlockManagerInfo: Added input-0-1410350119000 in 
> memory on tdw-10-196-130-155:51155 (size: 7.3 MB, free: 864.3 MB)
> The reason is when blockManagerSlave ask blockManagerMaster for a 
> blockManagerId, blockManagerMaster  always return the same blockManagerId.  
> Here is the code:
> private def getPeers(blockManagerId: BlockManagerId, size: Int): 
> Seq[BlockManagerId] = {
> val peers: Array[BlockManagerId] = blockManagerInfo.keySet.toArray
> val selfIndex = peers.indexOf(blockManagerId)
> if (selfIndex == -1) {
>   throw new SparkException("Self index for " + blockManagerId + " not 
> found")
> }
> // Note that this logic will select the same node multiple times if there 
> aren't enough peers
> Array.tabulate[BlockManagerId](size) { i => peers((selfIndex + i + 1) % 
> peers.length) }.toSeq
>   }
> I think the blockManagerMaster should return the size of  blockManagerId with 
> more remain memory .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14131248#comment-14131248
 ] 

Reynold Xin commented on SPARK-2926:


Do you have a branch that I can test with? I'm running some sorting tests and 
can test this out also on some dataset.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
> Report(contd).pdf, Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >