date:20140922

[jira] [Commented] (SPARK-3481) HiveComparisonTest throws exception of "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default"

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144421#comment-14144421
 ] 

Apache Spark commented on SPARK-3481:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2505

> HiveComparisonTest throws exception of 
> "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
> default"
> ---
>
> Key: SPARK-3481
> URL: https://issues.apache.org/jira/browse/SPARK-3481
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Cheng Hao
>Priority: Minor
> Fix For: 1.2.0
>
>
> In local test, lots of exception raised like:
> {panel}
> 11:08:01.746 ERROR hive.ql.exec.DDLTask: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: 
> default
>   at 
> org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480)
>   at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237)
>   at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151)
>   at 
> org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65)
>   at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414)
>   at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192)
>   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020)
>   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888)
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88)
>   at 
> org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at 
> org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
>   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158)
>   at org.scalatest.Suite$class.withFixture(Suite.scala:1121)
>   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167)
>   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
>   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167)
>   at org.scalatest.FunSuite.runTest(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
>   at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
>   at 
> org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
>   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
>   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200)
>   at org.scalatest.FunSuite.runTests(FunSuite.scala:1559)
>   at org.scalatest.Suite$class.run(Suite.scala:1423)
>   at 
> org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at 
> org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204)
>   at org.scalatest.SuperEngine.runImpl(Engine.scala:545)
>   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204)
>   at 
> org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(Hive

[jira] [Commented] (SPARK-3577) Add task metric to report spill time

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144412#comment-14144412
 ] 

Apache Spark commented on SPARK-3577:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2504

> Add task metric to report spill time
> 
>
> Key: SPARK-3577
> URL: https://issues.apache.org/jira/browse/SPARK-3577
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Kay Ousterhout
>Priority: Minor
>
> The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into 
> {{ExternalSorter}}.  The write time recorded in those metrics is never used.  
> We should probably add task metrics to report this spill time, since for 
> shuffles, this would have previously been reported as part of shuffle write 
> time (with the original hash-based sorter).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3172) Distinguish between shuffle spill on the map and reduce side

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144411#comment-14144411
 ] 

Apache Spark commented on SPARK-3172:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2504

> Distinguish between shuffle spill on the map and reduce side
> 
>
> Key: SPARK-3172
> URL: https://issues.apache.org/jira/browse/SPARK-3172
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144326#comment-14144326
 ] 

Apache Spark commented on SPARK-3649:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/2503

> ClassCastException in GraphX custom serializers when sort-based shuffle spills
> --
>
> Key: SPARK-3649
> URL: https://issues.apache.org/jira/browse/SPARK-3649
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> As 
> [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
>  on the mailing list, GraphX throws
> {code}
> java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
> at 
> org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
>  
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
>  
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
> {code}
> when sort-based shuffle attempts to spill to disk. This is because GraphX 
> defines custom serializers for shuffling pair RDDs that assume Spark will 
> always serialize the entire pair object rather than breaking it up into its 
> components. However, the spill code path in sort-based shuffle [violates this 
> assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].
> GraphX uses the custom serializers to compress vertex ID keys using 
> variable-length integer encoding. However, since the serializer can no longer 
> rely on the key and value being serialized and deserialized together, 
> performing such encoding would require writing a tag byte. Therefore it may 
> be better to simply remove the custom serializers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144258#comment-14144258
 ] 

Saisai Shao commented on SPARK-3032:


Hi Matei, thanks for your reply, I will try again using your comments.

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Critical
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3212) Improve the clarity of caching semantics

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144245#comment-14144245
 ] 

Apache Spark commented on SPARK-3212:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2501

> Improve the clarity of caching semantics
> 
>
> Key: SPARK-3212
> URL: https://issues.apache.org/jira/browse/SPARK-3212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> Right now there are a bunch of different ways to cache tables in Spark SQL. 
> For example:
>  - tweets.cache()
>  - sql("SELECT * FROM tweets").cache()
>  - table("tweets").cache()
>  - tweets.cache().registerTempTable(tweets)
>  - sql("CACHE TABLE tweets")
>  - cacheTable("tweets")
> Each of the above commands has subtly different semantics, leading to a very 
> confusing user experience.  Ideally, we would stop doing caching based on 
> simple tables names and instead have a phase of optimization that does 
> intelligent matching of query plans with available cached data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3032:
-
Assignee: Saisai Shao

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Critical
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144243#comment-14144243
 ] 

Matei Zaharia commented on SPARK-3032:
--

I'm not completely sure that this is because hashCode provides a partial 
ordering, because I believe TimSort is supposed to work on partial orderings as 
well. I believe the problem is an integer over flow when we subtract 
key1.hashCode - key2.hashCode. Can you try replacing the line that returns h1 - 
h2 in keyComparator with returning Integer.compare(h1, h2)? This will properly 
deal with overflow.

Returning h1 - h2 is definitely wrong: for example suppose that h1 = 
Int.MaxValue and h2 = Int.MinValue, then h1 - h2 = -1.

Please add a unit test for this case as well.

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Priority: Critical
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144244#comment-14144244
 ] 

Matei Zaharia commented on SPARK-3032:
--

Yeah actually I'm sure TimSort works fine with a partial ordering, I read 
through the contract of Comparable. We also use it that way all the time when 
we only sort by partition ID.

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Assignee: Saisai Shao
>Priority: Critical
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3655) Secondary sort

2014-09-22 Thread koert kuipers (JIRA)

koert kuipers created SPARK-3655:


 Summary: Secondary sort
 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: koert kuipers
Priority: Minor


Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
There are some use cases where getting a sorted iterator of values per key is 
helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator

2014-09-22 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3654:
-

 Summary: Implement all extended HiveQL statements/commands with a 
separate parser combinator
 Key: SPARK-3654
 URL: https://issues.apache.org/jira/browse/SPARK-3654
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian


Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. are 
currently parsed in a quite hacky way, like this:
{code}
if (sql.trim.toLowerCase.startsWith("cache table")) {
  sql.trim.toLowerCase.startsWith("cache table") match {
...
  }
}
{code}
It would be much better to add an extra parser combinator that parses these 
syntax extensions first, and then fallback to the normal Hive parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3610) History server log name should not be based on user input

2014-09-22 Thread SK (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144192#comment-14144192
 ] 

SK commented on SPARK-3610:
---

Today I found that the same issue occurs with Graphx application logs as well 
(basiclly it includes parentheses and commas in the log file name), and the 
history server gets messed up and needs to be restarted every time. 

thanks

> History server log name should not be based on user input
> -
>
> Key: SPARK-3610
> URL: https://issues.apache.org/jira/browse/SPARK-3610
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: SK
>Priority: Critical
>
> Right now we use the user-defined application name when creating the logging 
> file for the history server. We should use some type of GUID generated from 
> inside of Spark instead of allowing user input here. It can cause errors if 
> users provide characters that are not valid in filesystem paths.
> Original bug report:
> {quote}
> The default log files for the Mllib examples use a rather long naming 
> convention that includes special characters like parentheses and comma.For 
> e.g. one of my log files is named 
> "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032".
> When I click on the program on the history server page (at port 18080), to 
> view the detailed application logs, the history server crashes and I need to 
> restart it. I am using Spark 1.1 on a mesos cluster.
> I renamed the  log file by removing the special characters and  then it loads 
> up correctly. I am not sure which program is creating the log files. Can it 
> be changed so that the default log file naming convention does not include  
> special characters? 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3653) SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144155#comment-14144155
 ] 

Apache Spark commented on SPARK-3653:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2500

> SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode
> -
>
> Key: SPARK-3653
> URL: https://issues.apache.org/jira/browse/SPARK-3653
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Critical
>
> We only check for these in the spark-class but not in SparkSubmit. For client 
> mode, this is OK because the driver can read directly from these environment 
> variables.
> For cluster mode however, SPARK_EXECUTOR_MEMORY is not set on the node that 
> starts the driver, and SPARK_DRIVER_MEMORY is simply not propagated to the 
> driver JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3652) upgrade spark sql hive version to 0.13.1

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144150#comment-14144150
 ] 

Apache Spark commented on SPARK-3652:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2499

> upgrade spark sql hive version to 0.13.1
> 
>
> Key: SPARK-3652
> URL: https://issues.apache.org/jira/browse/SPARK-3652
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
>
> now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3653) SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode

2014-09-22 Thread Andrew Or (JIRA)

Andrew Or created SPARK-3653:


 Summary: SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode
 Key: SPARK-3653
 URL: https://issues.apache.org/jira/browse/SPARK-3653
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical


We only check for these in the spark-class but not in SparkSubmit. For client 
mode, this is OK because the driver can read directly from these environment 
variables.

For cluster mode however, SPARK_EXECUTOR_MEMORY is not set on the node that 
starts the driver, and SPARK_DRIVER_MEMORY is simply not propagated to the 
driver JVM.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3652) upgrade spark sql hive version to 0.13.1

2014-09-22 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3652:
---
Description: now spark sql hive version is 0.12.0, compile with 0.13.1 will 
get errors. 

> upgrade spark sql hive version to 0.13.1
> 
>
> Key: SPARK-3652
> URL: https://issues.apache.org/jira/browse/SPARK-3652
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: wangfei
>
> now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3652) upgrade spark sql hive version to 0.13.1

2014-09-22 Thread wangfei (JIRA)

wangfei created SPARK-3652:
--

 Summary: upgrade spark sql hive version to 0.13.1
 Key: SPARK-3652
 URL: https://issues.apache.org/jira/browse/SPARK-3652
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144100#comment-14144100
 ] 

Apache Spark commented on SPARK-3606:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2497

> Spark-on-Yarn AmIpFilter does not work with Yarn HA.
> 
>
> Key: SPARK-3606
> URL: https://issues.apache.org/jira/browse/SPARK-3606
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> The current IP filter only considers one of the RMs in an HA setup. If the 
> active RM is not the configured one, you get a "connection refused" error 
> when clicking on the Spark AM links in the RM UI.
> Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit

2014-09-22 Thread Dale Richardson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144095#comment-14144095
 ] 

Dale Richardson commented on SPARK-3620:


Due to typesafe conf being based on a JSON-iike tree structure of config 
values, it will never support non-common prefixes on config variable.  So I've 
gone back to using property objects

> Refactor config option handling code for spark-submit
> -
>
> Key: SPARK-3620
> URL: https://issues.apache.org/jira/browse/SPARK-3620
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Dale Richardson
>Assignee: Dale Richardson
>Priority: Minor
>
> I'm proposing its time to refactor the configuration argument handling code 
> in spark-submit. The code has grown organically in a short period of time, 
> handles a pretty complicated logic flow, and is now pretty fragile. Some 
> issues that have been identified:
> 1. Hand-crafted property file readers that do not support the property file 
> format as specified in 
> http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader)
> 2. ResolveURI not called on paths read from conf/prop files
> 3. inconsistent means of merging / overriding values from different sources 
> (Some get overridden by file, others by manual settings of field on object, 
> Some by properties)
> 4. Argument validation should be done after combining config files, system 
> properties and command line arguments, 
> 5. Alternate conf file location not handled in shell scripts
> 6. Some options can only be passed as command line arguments
> 7. Defaults for options are hard-coded (and sometimes overridden multiple 
> times) in many through-out the code e.g. master = local[*]
> Initial proposal is to use typesafe conf to read in the config information 
> and merge the various config sources



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3651) Consolidate executor maps in CoarseGrainedSchedulerBackend

2014-09-22 Thread Andrew Or (JIRA)

Andrew Or created SPARK-3651:


 Summary: Consolidate executor maps in CoarseGrainedSchedulerBackend
 Key: SPARK-3651
 URL: https://issues.apache.org/jira/browse/SPARK-3651
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Or


In CoarseGrainedSchedulerBackend, we have:
{code}
private val executorActor = new HashMap[String, ActorRef]
private val executorAddress = new HashMap[String, Address]
private val executorHost = new HashMap[String, String]
private val freeCores = new HashMap[String, Int]
private val totalCores = new HashMap[String, Int]
{code}
We only ever put / remove stuff from these maps together. It would simplify the 
code if we consolidate these all into one map as we have done in 
JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3032:
---
Priority: Critical  (was: Major)

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>Priority: Critical
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3032:
---
Target Version/s: 1.2.0

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1860:
---
Target Version/s: 1.2.0

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Critical
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1860:
---
Priority: Blocker  (was: Critical)

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1860:
---
Fix Version/s: (was: 1.2.0)

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Critical
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3647) Shaded Guava patch causes access issues with package private classes

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143968#comment-14143968
 ] 

Apache Spark commented on SPARK-3647:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2496

> Shaded Guava patch causes access issues with package private classes
> 
>
> Key: SPARK-3647
> URL: https://issues.apache.org/jira/browse/SPARK-3647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
>
> The patch that introduced shading to Guava (SPARK-2848) tried to maintain 
> backwards compatibility in the Java API by not relocating the "Optional" 
> class. That causes problems when that class references package private 
> members in the Absent and Present classes, which are now in a different 
> package:
> {noformat}
> Exception in thread "main" java.lang.IllegalAccessError: tried to access 
> class org.spark-project.guava.common.base.Present from class 
> com.google.common.base.Optional
>   at com.google.common.base.Optional.of(Optional.java:86)
>   at 
> org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25)
>   at 
> org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143954#comment-14143954
 ] 

Apache Spark commented on SPARK-3650:
-

User 'jegonzal' has created a pull request for this issue:
https://github.com/apache/spark/pull/2495

> Triangle Count handles reverse edges incorrectly
> 
>
> Key: SPARK-3650
> URL: https://issues.apache.org/jira/browse/SPARK-3650
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Joseph E. Gonzalez
>
> The triangle count implementation assumes that edges are aligned in a 
> canonical direction.  As stated in the documentation:
> bq. Note that the input graph should have its edges in canonical direction 
> (i.e. the `sourceId` less than `destId`)
> However the TriangleCount algorithm does not verify that this condition holds 
> and indeed even the unit tests exploits this functionality:
> {code:scala}
> val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
> Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
>   val rawEdges = sc.parallelize(triangles, 2)
>   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
>   val triangleCount = graph.triangleCount()
>   val verts = triangleCount.vertices
>   verts.collect().foreach { case (vid, count) =>
> if (vid == 0) {
>   assert(count === 4)  // <-- Should be 2
> } else {
>   assert(count === 2) // <-- Should be 1
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2014-09-22 Thread Joseph E. Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-3650:
--
Description: 
The triangle count implementation assumes that edges are aligned in a canonical 
direction.  As stated in the documentation:

bq. Note that the input graph should have its edges in canonical direction 
(i.e. the `sourceId` less than `destId`)

However the TriangleCount algorithm does not verify that this condition holds 
and indeed even the unit tests exploits this functionality:

{code:scala}
val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
  val rawEdges = sc.parallelize(triangles, 2)
  val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
  val triangleCount = graph.triangleCount()
  val verts = triangleCount.vertices
  verts.collect().foreach { case (vid, count) =>
if (vid == 0) {
  assert(count === 4)  // <-- Should be 2
} else {
  assert(count === 2) // <-- Should be 1
}
  }
{code}




  was:
The triangle count implementation assumes that edges are aligned in a canonical 
direction.  As stated in the documentation:

bq. Note that the input graph should have its edges in canonical direction 
(i.e. the `sourceId` less than `destId`)

However the TriangleCount algorithm does not verify that this condition holds 
and indeed even the unit tests exploits this functionality:

{code:scala}
val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
  val rawEdges = sc.parallelize(triangles, 2)
  val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
  val triangleCount = graph.triangleCount()
  val verts = triangleCount.vertices
  verts.collect().foreach { case (vid, count) =>
if (vid == 0) {
  assert(count === 4)  // <-- Should be 2
} else {
  assert(count === 2) // <-- Should be 1
}
  }
{code:scala}





> Triangle Count handles reverse edges incorrectly
> 
>
> Key: SPARK-3650
> URL: https://issues.apache.org/jira/browse/SPARK-3650
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Joseph E. Gonzalez
>
> The triangle count implementation assumes that edges are aligned in a 
> canonical direction.  As stated in the documentation:
> bq. Note that the input graph should have its edges in canonical direction 
> (i.e. the `sourceId` less than `destId`)
> However the TriangleCount algorithm does not verify that this condition holds 
> and indeed even the unit tests exploits this functionality:
> {code:scala}
> val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
> Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
>   val rawEdges = sc.parallelize(triangles, 2)
>   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
>   val triangleCount = graph.triangleCount()
>   val verts = triangleCount.vertices
>   verts.collect().foreach { case (vid, count) =>
> if (vid == 0) {
>   assert(count === 4)  // <-- Should be 2
> } else {
>   assert(count === 2) // <-- Should be 1
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2014-09-22 Thread Joseph E. Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez updated SPARK-3650:
--
Description: 
The triangle count implementation assumes that edges are aligned in a canonical 
direction.  As stated in the documentation:

bq. Note that the input graph should have its edges in canonical direction 
(i.e. the `sourceId` less than `destId`)

However the TriangleCount algorithm does not verify that this condition holds 
and indeed even the unit tests exploits this functionality:

{code:scala}
val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
  val rawEdges = sc.parallelize(triangles, 2)
  val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
  val triangleCount = graph.triangleCount()
  val verts = triangleCount.vertices
  verts.collect().foreach { case (vid, count) =>
if (vid == 0) {
  assert(count === 4)  // <-- Should be 2
} else {
  assert(count === 2) // <-- Should be 1
}
  }
{code:scala}




  was:
The triangle count implementation assumes that edges are aligned in a canonical 
direction.  As stated in the documentation:

```
Note that the input graph should have its edges in canonical direction
 * (i.e. the `sourceId` less than `destId`)
```

However the TriangleCount algorithm does not verify that this condition holds 
and indeed even the unit tests exploits this functionality:

~~~
val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
  val rawEdges = sc.parallelize(triangles, 2)
  val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
  val triangleCount = graph.triangleCount()
  val verts = triangleCount.vertices
  verts.collect().foreach { case (vid, count) =>
if (vid == 0) {
  assert(count === 4)  // <-- Should be 2
} else {
  assert(count === 2) // <-- Should be 1
}
  }
~~~





> Triangle Count handles reverse edges incorrectly
> 
>
> Key: SPARK-3650
> URL: https://issues.apache.org/jira/browse/SPARK-3650
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.1.0
>Reporter: Joseph E. Gonzalez
>
> The triangle count implementation assumes that edges are aligned in a 
> canonical direction.  As stated in the documentation:
> bq. Note that the input graph should have its edges in canonical direction 
> (i.e. the `sourceId` less than `destId`)
> However the TriangleCount algorithm does not verify that this condition holds 
> and indeed even the unit tests exploits this functionality:
> {code:scala}
> val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
> Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
>   val rawEdges = sc.parallelize(triangles, 2)
>   val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
>   val triangleCount = graph.triangleCount()
>   val verts = triangleCount.vertices
>   verts.collect().foreach { case (vid, count) =>
> if (vid == 0) {
>   assert(count === 4)  // <-- Should be 2
> } else {
>   assert(count === 2) // <-- Should be 1
> }
>   }
> {code:scala}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3650) Triangle Count handles reverse edges incorrectly

2014-09-22 Thread Joseph E. Gonzalez (JIRA)

Joseph E. Gonzalez created SPARK-3650:
-

 Summary: Triangle Count handles reverse edges incorrectly
 Key: SPARK-3650
 URL: https://issues.apache.org/jira/browse/SPARK-3650
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.1.0
Reporter: Joseph E. Gonzalez


The triangle count implementation assumes that edges are aligned in a canonical 
direction.  As stated in the documentation:

```
Note that the input graph should have its edges in canonical direction
 * (i.e. the `sourceId` less than `destId`)
```

However the TriangleCount algorithm does not verify that this condition holds 
and indeed even the unit tests exploits this functionality:

~~~
val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++
Array(0L -> -1L, -1L -> -2L, -2L -> 0L)
  val rawEdges = sc.parallelize(triangles, 2)
  val graph = Graph.fromEdgeTuples(rawEdges, true).cache()
  val triangleCount = graph.triangleCount()
  val verts = triangleCount.vertices
  verts.collect().foreach { case (vid, count) =>
if (vid == 0) {
  assert(count === 4)  // <-- Should be 2
} else {
  assert(count === 2) // <-- Should be 1
}
  }
~~~






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1720) use LD_LIBRARY_PATH instead of -Djava.library.path

2014-09-22 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143930#comment-14143930
 ] 

Patrick Wendell commented on SPARK-1720:


Another user reported this issue, so let's try to get it into spark 1.2

> use LD_LIBRARY_PATH instead of -Djava.library.path
> --
>
> Key: SPARK-1720
> URL: https://issues.apache.org/jira/browse/SPARK-1720
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Guoqiang Li
>Priority: Critical
>
> I think it would be better to use LD_LIBRARY_PATH rather then 
> -Djava.library.path.  Once  java.library.path is set, it doesn't search 
> LD_LIBRARY_PATH.  In Hadoop we switched to use LD_LIBRARY_PATH instead of 
> java.library.path.  See https://issues.apache.org/jira/browse/MAPREDUCE-4072.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills

2014-09-22 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3649:
--
Description: 
As 
[reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
 on the mailing list, GraphX throws

{code}
java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
at 
org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
 
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
 
at 
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
{code}

when sort-based shuffle attempts to spill to disk. This is because GraphX 
defines custom serializers for shuffling pair RDDs that assume Spark will 
always serialize the entire pair object rather than breaking it up into its 
components. However, the spill code path in sort-based shuffle [violates this 
assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].

GraphX uses the custom serializers to compress vertex ID keys using 
variable-length integer encoding. However, since the serializer can no longer 
rely on the key and value being serialized and deserialized together, 
performing such encoding would require writing a tag byte. Therefore it may be 
better to simply remove the custom serializers.

  was:
As 
[reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
 on the mailing list, GraphX throws

{code}
java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
at 
org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
 
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
 
at 
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
{code}

when sort-based shuffle attempts to spill to disk. This is because GraphX 
defines custom serializers for shuffling pair RDDs that assume Spark will 
always serialize the entire pair object rather than breaking it up into its 
components. However, the spill code path in sort-based shuffle [violates this 
assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].

GraphX uses the custom serializers to compress vertex ID keys using 
variable-length integer encoding. However, since the serializer can no longer 
rely on the key and value being serialized and deserialized together, 
performing such encoding would require writing a tag byte.


> ClassCastException in GraphX custom serializers when sort-based shuffle spills
> --
>
> Key: SPARK-3649
> URL: https://issues.apache.org/jira/browse/SPARK-3649
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> As 
> [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
>  on the mailing list, GraphX throws
> {code}
> java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
> at 
> org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
>  
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
>  
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
> {code}
> when sort-based shuffle attempts to spill to disk. This is because GraphX 
> defines custom serializers for shuffling pair RDDs that assume Spark will 
> always serialize the entire pair object rather than breaking it up into its 
> components. However, the spill code path in sort-based shuffle [violates this 
> assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].
> GraphX uses the custom serializers to compress vertex ID keys using 
> variable-length integer encoding. However, since the serializer can no longer 
> rely on the key and value being serialized and deserialized together, 
> performing such encoding would require writing a tag byte. Therefore it may 
> be better to simply remove the custom serializers.



--
This message was sent by

[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access

2014-09-22 Thread bc Wong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143933#comment-14143933
 ] 

bc Wong commented on SPARK-3621:


I think this is for the case of a map-side join where one of the tables is 
small.

[~xuefuz], if the driver is running in the cluster, then RDD.collect() means it 
reading from HDFS and then broadcast the data to everyone. Right? That seems 
reasonable. I don't see another way to "broadcast" something. Alternatively, 
it's probably better for each executor to individually read that small HDFS 
file into its memory.

> Provide a way to broadcast an RDD (instead of just a variable made of the 
> RDD) so that a job can access
> ---
>
> Key: SPARK-3621
> URL: https://issues.apache.org/jira/browse/SPARK-3621
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Xuefu Zhang
>
> In some cases, such as Hive's way of doing map-side join, it would be 
> benefcial to allow client program to broadcast RDDs rather than just 
> variables made of these RDDs. Broadcasting a variable made of RDDs requires 
> all RDD data be collected to the driver and that the variable be shipped to 
> the cluster after being made. It would be more performing if driver just 
> broadcasts the RDDs and uses the corresponding data in jobs (such building 
> hashmaps at executors).
> Tez has a broadcast edge which can ship data from previous stage to the next 
> stage, which doesn't require driver side processing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2014-09-22 Thread Xuefu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143921#comment-14143921
 ] 

Xuefu Zhang commented on SPARK-3622:


They are related but not exactly the same. SPARK-2688 is about branching off 
RDD tree with no custom transformation invovled. This JIRA is about returning 
multiple RDDs from a single transformation (branching happening within a 
transformation).

> Provide a custom transformation that can output multiple RDDs
> -
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.

2014-09-22 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3606:
-
Affects Version/s: 1.1.0

> Spark-on-Yarn AmIpFilter does not work with Yarn HA.
> 
>
> Key: SPARK-3606
> URL: https://issues.apache.org/jira/browse/SPARK-3606
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>
> The current IP filter only considers one of the RMs in an HA setup. If the 
> active RM is not the configured one, you get a "connection refused" error 
> when clicking on the Spark AM links in the RM UI.
> Similar to YARN-1811, but for Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills

2014-09-22 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-3649:
--
Description: 
As 
[reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
 on the mailing list, GraphX throws

{code}
java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
at 
org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
 
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
 
at 
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
{code}

when sort-based shuffle attempts to spill to disk. This is because GraphX 
defines custom serializers for shuffling pair RDDs that assume Spark will 
always serialize the entire pair object rather than breaking it up into its 
components. However, the spill code path in sort-based shuffle [violates this 
assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].

GraphX uses the custom serializers to compress vertex ID keys using 
variable-length integer encoding. However, since the serializer can no longer 
rely on the key and value being serialized and deserialized together, 
performing such encoding would require writing a tag byte.

  was:
As 
[reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
 on the mailing list, GraphX throws

{code}
java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
at 
org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
 
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
 
at 
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
{code}

when sort-based shuffle attempts to spill to disk. This is because GraphX 
defines custom serializers for shuffling pair RDDs that assume Spark will 
always serialize the entire pair object rather than breaking it up into its 
components. However, the spill code path in sort-based shuffle [violates this 
assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].


> ClassCastException in GraphX custom serializers when sort-based shuffle spills
> --
>
> Key: SPARK-3649
> URL: https://issues.apache.org/jira/browse/SPARK-3649
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> As 
> [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
>  on the mailing list, GraphX throws
> {code}
> java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
> at 
> org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
>  
> at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
>  
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
> {code}
> when sort-based shuffle attempts to spill to disk. This is because GraphX 
> defines custom serializers for shuffling pair RDDs that assume Spark will 
> always serialize the entire pair object rather than breaking it up into its 
> components. However, the spill code path in sort-based shuffle [violates this 
> assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].
> GraphX uses the custom serializers to compress vertex ID keys using 
> variable-length integer encoding. However, since the serializer can no longer 
> rely on the key and value being serialized and deserialized together, 
> performing such encoding would require writing a tag byte.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1720) use LD_LIBRARY_PATH instead of -Djava.library.path

2014-09-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-1720:
---
Priority: Critical  (was: Major)
Target Version/s: 1.2.0

> use LD_LIBRARY_PATH instead of -Djava.library.path
> --
>
> Key: SPARK-1720
> URL: https://issues.apache.org/jira/browse/SPARK-1720
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>Assignee: Guoqiang Li
>Priority: Critical
>
> I think it would be better to use LD_LIBRARY_PATH rather then 
> -Djava.library.path.  Once  java.library.path is set, it doesn't search 
> LD_LIBRARY_PATH.  In Hadoop we switched to use LD_LIBRARY_PATH instead of 
> java.library.path.  See https://issues.apache.org/jira/browse/MAPREDUCE-4072.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs

2014-09-22 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143908#comment-14143908
 ] 

Sandy Ryza commented on SPARK-3622:
---

Is this a duplicate of SPARK-2688?

> Provide a custom transformation that can output multiple RDDs
> -
>
> Key: SPARK-3622
> URL: https://issues.apache.org/jira/browse/SPARK-3622
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xuefu Zhang
>
> All existing transformations return just one RDD at most, even for those 
> which takes user-supplied functions such as mapPartitions() . However, 
> sometimes a user provided function may need to output multiple RDDs. For 
> instance, a filter function that divides the input RDD into serveral RDDs. 
> While it's possible to get multiple RDDs by transforming the same RDD 
> multiple times, it may be more efficient to do this concurrently in one shot. 
> Especially user's existing function is already generating different data sets.
> This the case in Hive on Spark, where Hive's map function and reduce function 
> can output different data sets to be consumed by subsequent stages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce

2014-09-22 Thread Grega Kespret (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143895#comment-14143895
 ] 

Grega Kespret commented on SPARK-2620:
--

We have this issue on Spark 1.1.0.

> case class cannot be used as key for reduce
> ---
>
> Key: SPARK-2620
> URL: https://issues.apache.org/jira/browse/SPARK-2620
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.1.0
> Environment: reproduced on spark-shell local[4]
>Reporter: Gerard Maas
>Priority: Critical
>  Labels: case-class, core
>
> Using a case class as a key doesn't seem to work properly on Spark 1.0.0
> A minimal example:
> case class P(name:String)
> val ps = Array(P("alice"), P("bob"), P("charly"), P("bob"))
> sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect
> [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), 
> (P(bob),1), (P(abe),1), (P(charly),1))
> In contrast to the expected behavior, that should be equivalent to:
> sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect
> Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2))
> groupByKey and distinct also present the same behavior.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-22 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143898#comment-14143898
 ] 

RJ Nowling commented on SPARK-3614:
---

It could lead to over-fitting and thus mis-predictions.  In such cases, it may 
be valuable to exclude overly-specific terms.

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills

2014-09-22 Thread Ankur Dave (JIRA)

Ankur Dave created SPARK-3649:
-

 Summary: ClassCastException in GraphX custom serializers when 
sort-based shuffle spills
 Key: SPARK-3649
 URL: https://issues.apache.org/jira/browse/SPARK-3649
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
Reporter: Ankur Dave
Assignee: Ankur Dave


As 
[reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501]
 on the mailing list, GraphX throws

{code}
java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
at 
org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
 
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
 
at 
org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
{code}

when sort-based shuffle attempts to spill to disk. This is because GraphX 
defines custom serializers for shuffling pair RDDs that assume Spark will 
always serialize the entire pair object rather than breaking it up into its 
components. However, the spill code path in sort-based shuffle [violates this 
assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3648) Provide a script for fetching remote PR's for review

2014-09-22 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3648:
---
Issue Type: New Feature  (was: Bug)

> Provide a script for fetching remote PR's for review
> 
>
> Key: SPARK-3648
> URL: https://issues.apache.org/jira/browse/SPARK-3648
> Project: Spark
>  Issue Type: New Feature
>  Components: Project Infra
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>
> I've found it's useful to have a small utility script for fetching specific 
> pull requests locally when doing reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3648) Provide a script for fetching remote PR's for review

2014-09-22 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-3648:
--

 Summary: Provide a script for fetching remote PR's for review
 Key: SPARK-3648
 URL: https://issues.apache.org/jira/browse/SPARK-3648
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Reporter: Patrick Wendell
Assignee: Patrick Wendell


I've found it's useful to have a small utility script for fetching specific 
pull requests locally when doing reviews.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2848) Shade Guava in Spark deliverables

2014-09-22 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-2848:
--
Fix Version/s: (was: 1.1.0)
   1.2.0

> Shade Guava in Spark deliverables
> -
>
> Key: SPARK-2848
> URL: https://issues.apache.org/jira/browse/SPARK-2848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
>
> As discussed in SPARK-2420, this task covers the work of shading Guava in 
> Spark deliverables so that they don't conflict with the Hadoop classpath (nor 
> user's classpath).
> Since one Guava class is exposed through Spark's API, that class will be 
> forked from 14.0.1 (current version used by Spark) and excluded from any 
> shading.
> The end result is that Spark's Guava won't be exposed to users anymore. This 
> has the side-effect of effectively downgrading to version 11 (the one used by 
> Hadoop) for those that do not explicitly depend on / package Guava with their 
> apps. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2848) Shade Guava in Spark deliverables

2014-09-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143885#comment-14143885
 ] 

Marcelo Vanzin commented on SPARK-2848:
---

Yes, that's right, this was pushed onto master after 1.1 branched.

> Shade Guava in Spark deliverables
> -
>
> Key: SPARK-2848
> URL: https://issues.apache.org/jira/browse/SPARK-2848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.2.0
>
>
> As discussed in SPARK-2420, this task covers the work of shading Guava in 
> Spark deliverables so that they don't conflict with the Hadoop classpath (nor 
> user's classpath).
> Since one Guava class is exposed through Spark's API, that class will be 
> forked from 14.0.1 (current version used by Spark) and excluded from any 
> shading.
> The end result is that Spark's Guava won't be exposed to users anymore. This 
> has the side-effect of effectively downgrading to version 11 (the one used by 
> Hadoop) for those that do not explicitly depend on / package Guava with their 
> apps. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-22 Thread Liquan Pei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143887#comment-14143887
 ] 

Liquan Pei commented on SPARK-3614:
---

To me, the less number of documents a term appears, the more important the idf 
part of tf*idf. Why do we ignore these terms in idf computation? Any use case? 
Thanks!

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2848) Shade Guava in Spark deliverables

2014-09-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143877#comment-14143877
 ] 

Thomas Graves commented on SPARK-2848:
--

[~vanzin] [~pwendell]
I think the fix version on this is wrong, I don't see this in 1.1.0, I only see 
it in 1.2.0, can you confirm?

> Shade Guava in Spark deliverables
> -
>
> Key: SPARK-2848
> URL: https://issues.apache.org/jira/browse/SPARK-2848
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.1.0
>
>
> As discussed in SPARK-2420, this task covers the work of shading Guava in 
> Spark deliverables so that they don't conflict with the Hadoop classpath (nor 
> user's classpath).
> Since one Guava class is exposed through Spark's API, that class will be 
> forked from 14.0.1 (current version used by Spark) and excluded from any 
> shading.
> The end result is that Spark's Guava won't be exposed to users anymore. This 
> has the side-effect of effectively downgrading to version 11 (the one used by 
> Hadoop) for those that do not explicitly depend on / package Guava with their 
> apps. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143835#comment-14143835
 ] 

Sean Owen commented on SPARK-3431:
--

For your experiments, scalatest just copies an old subset of surefire's config:

http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin
vs
http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html

You can see discussion of how forkMode works:

http://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html

Bad news is that scalatest's support is much more limited, but parallel=true 
and forkMode=once might do the trick.
Otherwise... I guess we can figure out if it's realistic to use standard 
surefire instead of scalatest.


> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143820#comment-14143820
 ] 

Apache Spark commented on SPARK-3614:
-

User 'rnowling' has created a pull request for this issue:
https://github.com/apache/spark/pull/2494

> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-09-22 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143814#comment-14143814
 ] 

Mark Hamstra commented on SPARK-2321:
-

Which would be kind of the opposite half of the SparkListenerJobStart event, 
which includes an array of the StageIds in a Job.  I included that way back 
when as a suggestion of at least some of what might be needed to implement 
better job-based progress reporting.  I'd have to look, but I don't believe 
anything is actually using that stage-reporting on JobStart right now.  In any 
event, any proper progress reporting should rationalize, extend or eliminate 
that part of SparkListenerJobStart.

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result

2014-09-22 Thread Joseph E. Gonzalez (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph E. Gonzalez resolved SPARK-3578.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

Resolved by https://github.com/apache/spark/pull/2439

> GraphGenerators.sampleLogNormal sometimes returns too-large result
> --
>
> Key: SPARK-3578
> URL: https://issues.apache.org/jira/browse/SPARK-3578
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.2.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>Priority: Minor
> Fix For: 1.2.0
>
>
> GraphGenerators.sampleLogNormal is supposed to return an integer strictly 
> less than maxVal. However, it violates this guarantee. It generates its 
> return value as follows:
> {code}
> var X: Double = maxVal
> while (X >= maxVal) {
>   val Z = rand.nextGaussian()
>   X = math.exp(mu + sigma*Z)
> }
> math.round(X.toFloat)
> {code}
> When X is sampled to be close to (but less than) maxVal, then it will pass 
> the while loop condition, but the rounded result will be equal to maxVal, 
> which will fail the test.
> For example, if maxVal is 5 and X is 4.9, then X < maxVal, but 
> math.round(X.toFloat) is 5.
> A solution is to round X down instead of to the nearest integer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3647) Shaded Guava patch causes access issues with package private classes

2014-09-22 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143796#comment-14143796
 ] 

Marcelo Vanzin commented on SPARK-3647:
---

There are two options I see here:

- extend the hack to also not relocate the affected classes (Absent and Present 
should  be enough)
- fork some code from Guava and modify it to avoid the issue.

I'll go on a limb and say the first option is easier.

> Shaded Guava patch causes access issues with package private classes
> 
>
> Key: SPARK-3647
> URL: https://issues.apache.org/jira/browse/SPARK-3647
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Critical
>
> The patch that introduced shading to Guava (SPARK-2848) tried to maintain 
> backwards compatibility in the Java API by not relocating the "Optional" 
> class. That causes problems when that class references package private 
> members in the Absent and Present classes, which are now in a different 
> package:
> {noformat}
> Exception in thread "main" java.lang.IllegalAccessError: tried to access 
> class org.spark-project.guava.common.base.Present from class 
> com.google.common.base.Optional
>   at com.google.common.base.Optional.of(Optional.java:86)
>   at 
> org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25)
>   at 
> org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3647) Shaded Guava patch causes access issues with package private classes

2014-09-22 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-3647:
-

 Summary: Shaded Guava patch causes access issues with package 
private classes
 Key: SPARK-3647
 URL: https://issues.apache.org/jira/browse/SPARK-3647
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Marcelo Vanzin
Priority: Critical


The patch that introduced shading to Guava (SPARK-2848) tried to maintain 
backwards compatibility in the Java API by not relocating the "Optional" class. 
That causes problems when that class references package private members in the 
Absent and Present classes, which are now in a different package:

{noformat}
Exception in thread "main" java.lang.IllegalAccessError: tried to access class 
org.spark-project.guava.common.base.Present from class 
com.google.common.base.Optional
at com.google.common.base.Optional.of(Optional.java:86)
at 
org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25)
at 
org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143792#comment-14143792
 ] 

Aaron Davidson commented on SPARK-3032:
---

[~matei] any thoughts on this issue?

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-22 Thread Hari Shreedharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143775#comment-14143775
 ] 

Hari Shreedharan commented on SPARK-3129:
-

I did multiple rounds of testing and it looks like on average total rate for 
writing and flushing is around 100 MB/s. There are a couple of outliers, but 
that is likely due to flakey networking on EC2. Barring the one outlier, the 
least I got was 79 MB/s and max was 142 MB/s, but most were near 100.

> Prevent data loss in Spark Streaming
> 
>
> Key: SPARK-3129
> URL: https://issues.apache.org/jira/browse/SPARK-3129
> Project: Spark
>  Issue Type: New Feature
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf
>
>
> Spark Streaming can small amounts of data when the driver goes down - and the 
> sending system cannot re-send the data (or the data has already expired on 
> the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-09-22 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143750#comment-14143750
 ] 

Yin Huai commented on SPARK-3641:
-

No, I have not started. I can start after your caching stuff is in.

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143744#comment-14143744
 ] 

Nicholas Chammas commented on SPARK-3431:
-

I see. I'll try to look into it then. I don't know much about Maven, frankly, 
but this sounds doable for the relative n00b.

Since for starters we're just gonna try parallelizing the execution of entire 
test suites, we may not need to make many modifications to the tests upfront. 
We'll see.

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-09-22 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143737#comment-14143737
 ] 

Michael Armbrust commented on SPARK-3641:
-

Hey [~yhuai] have you started on this yet?  I think the addition of a logical 
plan for existing RDD is going to conflict with some work on caching that I'm 
doing.

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3270) Spark API for Application Extensions

2014-09-22 Thread Michal Malohlava (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143732#comment-14143732
 ] 

Michal Malohlava commented on SPARK-3270:
-

Hi Patrick,

you are right - in the case of an independent component, we can initialize them 
lazily with a task. 
Nevertheless, if all components inside all Executors need to share a common 
knowledge, then lazy initialization is little bit cumbersome.

In this JIRA, we do not want to propose any heavy-weight generic discovery 
system, but just a lightweight way of running code inside Spark infrastructure 
without modifying Spark core code (i would compare it to Linux kernel drivers).



> Spark API for Application Extensions
> 
>
> Key: SPARK-3270
> URL: https://issues.apache.org/jira/browse/SPARK-3270
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Michal Malohlava
>
> Any application should be able to enrich spark infrastructure by services 
> which are not available by default.  
> Hence, to support such application extensions (aka "extesions"/"plugins") 
> Spark platform should provide:
>   - an API to register an extension 
>   - an API to register a "service" (meaning provided functionality)
>   - well-defined points in Spark infrastructure which can be enriched/hooked 
> by extension
>   - a way of deploying extension (for example, simply putting the extension 
> on classpath and using Java service interface)
>   - a way to access extension from application
> Overall proposal is available here: 
> https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing
> Note: In this context, I do not mean reinventing OSGi (or another plugin 
> platform) but it can serve as a good starting point.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143724#comment-14143724
 ] 

Sean Owen commented on SPARK-3431:
--

It's trivial to configure Maven surefire/failsafe to execute tests in parallel. 
It can parallelize by class or method, fork or not, control number of 
concurrent forks as a multiple of cores, etc. For example, it's no problem to 
make test classes use their own JVM, and not even reuse JVMs if you don't want.

The harder part is making the tests play nice with each other on one machine 
when it comes to shared resources: files and ports, really. I think the tests 
have had several passes of improvements to reliably use their own temp space, 
and try to use an unused port, but this is one typical cause of test breakage. 
It's not yet clear that tests don't clobber each other by trying to use the 
same default Spark working dir or something.

Finally, some tests that depend on a certain sequence of random numbers may 
need to be made more robust.

but the parallelization is trivial in Maven, at least.  

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2014-09-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143720#comment-14143720
 ] 

Nicholas Chammas commented on SPARK-2870:
-

[~marmbrus] - API-wise, how are you thinking of exposing this functionality 
when implemented? Would it make sense to add an additional input parameter like 
{{sampleFraction}} to {{SQLContext.inferSchema()}}?

So, for example, if you want the inference to run on the whole RDD, you pass 
{{sampleFraction=1.0}}. And if you don't specify this parameter, it defaults to 
a very small fraction, or maybe even the current behavior of looking at just 
the first element. This could perhaps call {{RDD.sample()}} under the sheets.

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables

2014-09-22 Thread Evan Chan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143716#comment-14143716
 ] 

Evan Chan commented on SPARK-3298:
--

Sounds good, thanks!

-Evan
"Never doubt that a small group of thoughtful, committed citizens can change 
the world" - M. Mead



> [SQL] registerAsTable / registerTempTable overwrites old tables
> ---
>
> Key: SPARK-3298
> URL: https://issues.apache.org/jira/browse/SPARK-3298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Evan Chan
>Assignee: Michael Armbrust
>Priority: Minor
>  Labels: newbie
>
> At least in Spark 1.0.2,  calling registerAsTable("a") when "a" had been 
> registered before does not cause an error.  However, there is no way to 
> access the old table, even though it may be cached and taking up space.
> How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3610) History server log name should not be based on user input

2014-09-22 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143709#comment-14143709
 ] 

Andrew Or commented on SPARK-3610:
--

Hi all, I don't have the time to fix this, but this is where we generate the 
name for these event log files:

https://github.com/apache/spark/blob/56dae30ca70489a62686cb245728b09b2179bb5a/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L61

I think we should try to keep the name of the application so the user can still 
associate with logs are with which application, and coming up with a random 
GUID makes this difficult. Maybe instead we should just escape more characters 
(there are only so many).

> History server log name should not be based on user input
> -
>
> Key: SPARK-3610
> URL: https://issues.apache.org/jira/browse/SPARK-3610
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.0
>Reporter: SK
>Priority: Critical
>
> Right now we use the user-defined application name when creating the logging 
> file for the history server. We should use some type of GUID generated from 
> inside of Spark instead of allowing user input here. It can cause errors if 
> users provide characters that are not valid in filesystem paths.
> Original bug report:
> {quote}
> The default log files for the Mllib examples use a rather long naming 
> convention that includes special characters like parentheses and comma.For 
> e.g. one of my log files is named 
> "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032".
> When I click on the program on the history server page (at port 18080), to 
> view the detailed application logs, the history server crashes and I need to 
> restart it. I am using Spark 1.1 on a mesos cluster.
> I renamed the  log file by removing the special characters and  then it loads 
> up correctly. I am not sure which program is creating the log files. Can it 
> be changed so that the default log file naming convention does not include  
> special characters? 
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143707#comment-14143707
 ] 

Nicholas Chammas commented on SPARK-3431:
-

{quote}
Do you know how maven / sbt plugins handle this?
{quote}

Not really. What I can do for starters is just experiment with GNU parallel and 
see how it works.

{quote}
The GNU parallel approach ... has the nice advantage of only affecting Jenkins
{quote}

Well, if we are modifying {{dev/run-tests}} then developers should also be able 
to use it locally. The contributing guide recommends running tests using that 
script. If we do go the GNU parallel route, we can have it trigger only if it 
detects GNU parallel on the host.

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3646) Copy SQL options from the spark context

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143693#comment-14143693
 ] 

Apache Spark commented on SPARK-3646:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2493

> Copy SQL options from the spark context
> ---
>
> Key: SPARK-3646
> URL: https://issues.apache.org/jira/browse/SPARK-3646
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143684#comment-14143684
 ] 

Josh Rosen commented on SPARK-3431:
---

[~nchammas] I'm not sure.

The different test suites depend on the same build artifacts, but it looks like 
we call {{sbt assembly}} before running any tests.  The GNU parallel approach 
would certainly be easy to implement and it has the nice advantage of only 
affecting Jenkins, but I have one concern about test reporting.  How will 
output from tests be printed and will the test report XML files be generated at 
the same locations?  It might be confusing to see the output of several test 
suites interleaved in an arbitrary way.  Do you know how maven / sbt plugins 
handle this?

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3646) Copy SQL options from the spark context

2014-09-22 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-3646:
---

 Summary: Copy SQL options from the spark context
 Key: SPARK-3646
 URL: https://issues.apache.org/jira/browse/SPARK-3646
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2062) VertexRDD.apply does not use the mergeFunc

2014-09-22 Thread Ankur Dave (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-2062.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1
 Assignee: Larry Xiao  (was: Ankur Dave)

Resolved by https://github.com/apache/spark/pull/1903

> VertexRDD.apply does not use the mergeFunc
> --
>
> Key: SPARK-2062
> URL: https://issues.apache.org/jira/browse/SPARK-2062
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Larry Xiao
> Fix For: 1.1.1, 1.2.0
>
>
> Here: 
> https://github.com/apache/spark/blob/b1feb60209174433262de2a26d39616ba00edcc8/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L410



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143656#comment-14143656
 ] 

Nicholas Chammas commented on SPARK-3431:
-

[~joshrosen] I can take a crack at this in the next week or so if it's a simple 
matter of breaking up [this 
line|https://github.com/apache/spark/blob/56dae30ca70489a62686cb245728b09b2179bb5a/dev/run-tests#L170]
 into several invocations of {{sbt}} and parallelizing them with [GNU 
parallel|http://www.gnu.org/software/parallel/].

Would that work?

I remember on the dev list we were discussing using some plugin to Maven to 
parallelize tests, but I don't know much about that at this time.

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3645) Make caching using SQL commands eager by default, with the option of being lazy

2014-09-22 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-3645:
---

 Summary: Make caching using SQL commands eager by default, with 
the option of being lazy
 Key: SPARK-3645
 URL: https://issues.apache.org/jira/browse/SPARK-3645
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3634) Python modules added through addPyFile should take precedence over system modules

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143641#comment-14143641
 ] 

Apache Spark commented on SPARK-3634:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/2492

> Python modules added through addPyFile should take precedence over system 
> modules
> -
>
> Key: SPARK-3634
> URL: https://issues.apache.org/jira/browse/SPARK-3634
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Josh Rosen
>
> Python modules added through {{SparkContext.addPyFile()}} are currently 
> _appended_ to {{sys.path}}; this is probably the opposite of the behavior 
> that we want, since it causes system versions of modules to take precedence 
> over versions explicitly added by users.
> To fix this, we should change the {{sys.path}} manipulation code in 
> {{context.py}} and {{worker.py}} to prepend files to {{sys.path}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables

2014-09-22 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143637#comment-14143637
 ] 

Michael Armbrust commented on SPARK-3298:
-

I think the plan here is to add an allowExisting flag to registerTempTable that 
checks to see if the table exists and throws an exception.  This flag will 
default to false.  I'll add this as part of the work I'm going to fix our 
caching behavior.

> [SQL] registerAsTable / registerTempTable overwrites old tables
> ---
>
> Key: SPARK-3298
> URL: https://issues.apache.org/jira/browse/SPARK-3298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Evan Chan
>Assignee: Michael Armbrust
>Priority: Minor
>  Labels: newbie
>
> At least in Spark 1.0.2,  calling registerAsTable("a") when "a" had been 
> registered before does not cause an error.  However, there is no way to 
> access the old table, even though it may be cached and taking up space.
> How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables

2014-09-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-3298:
---

Assignee: Michael Armbrust

> [SQL] registerAsTable / registerTempTable overwrites old tables
> ---
>
> Key: SPARK-3298
> URL: https://issues.apache.org/jira/browse/SPARK-3298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Evan Chan
>Assignee: Michael Armbrust
>Priority: Minor
>  Labels: newbie
>
> At least in Spark 1.0.2,  calling registerAsTable("a") when "a" had been 
> registered before does not cause an error.  However, there is no way to 
> access the old table, even though it may be cached and taking up space.
> How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables

2014-09-22 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3298:

Target Version/s: 1.2.0

> [SQL] registerAsTable / registerTempTable overwrites old tables
> ---
>
> Key: SPARK-3298
> URL: https://issues.apache.org/jira/browse/SPARK-3298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Evan Chan
>Assignee: Michael Armbrust
>Priority: Minor
>  Labels: newbie
>
> At least in Spark 1.0.2,  calling registerAsTable("a") when "a" had been 
> registered before does not cause an error.  However, there is no way to 
> access the old table, even though it may be cached and taking up space.
> How about at least throwing an error?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3644:
--
Description: 
This JIRA is a forum to draft a design proposal for a REST interface for 
accessing information about Spark applications, such as job / stage / task / 
storage status.

There have been a number of proposals to serve JSON representations of the 
information displayed in Spark's web UI.  Given that we might redesign the 
pages of the web UI (and possibly re-implement the UI as a client of a REST 
API), the API endpoints and their responses should be independent of what we 
choose to display on particular web UI pages / layouts.

Let's start a discussion of what a good REST API would look like from 
first-principles.  We can discuss what urls / endpoints expose access to data, 
how our JSON responses will be formatted, how fields will be named, how the API 
will be documented and tested, etc.

Some links for inspiration:

https://developer.github.com/v3/
http://developer.netflix.com/docs/REST_API_Reference
https://helloreverb.com/developers/swagger

  was:
This JIRA is a forum to draft a design proposal for a REST interface for 
accessing information about Spark applications, such as job / stage / task / 
storage status.

There have been a number of proposals to serve JSON representations of the 
information displayed in Spark's web UI.  Given that we might redesign the 
pages of the web UI (and possibly re-implement the UI as a client of a REST 
API), the API endpoints and their responses should be independent of what we 
choose to display on particular web UI pages / layouts.

Let's start a discussion of what a good REST API would look like from 
first-principles.  We can discuss what urls / endpoints expose access to data, 
how our JSON responses will be formatted, how fields will be named, etc.

Some links for inspiration:

https://developer.github.com/v3/
http://developer.netflix.com/docs/REST_API_Reference
https://helloreverb.com/developers/swagger


> REST API for Spark application info (jobs / stages / tasks / storage info)
> --
>
> Key: SPARK-3644
> URL: https://issues.apache.org/jira/browse/SPARK-3644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>
> This JIRA is a forum to draft a design proposal for a REST interface for 
> accessing information about Spark applications, such as job / stage / task / 
> storage status.
> There have been a number of proposals to serve JSON representations of the 
> information displayed in Spark's web UI.  Given that we might redesign the 
> pages of the web UI (and possibly re-implement the UI as a client of a REST 
> API), the API endpoints and their responses should be independent of what we 
> choose to display on particular web UI pages / layouts.
> Let's start a discussion of what a good REST API would look like from 
> first-principles.  We can discuss what urls / endpoints expose access to 
> data, how our JSON responses will be formatted, how fields will be named, how 
> the API will be documented and tested, etc.
> Some links for inspiration:
> https://developer.github.com/v3/
> http://developer.netflix.com/docs/REST_API_Reference
> https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3644:
--
Assignee: (was: Josh Rosen)

> REST API for Spark application info (jobs / stages / tasks / storage info)
> --
>
> Key: SPARK-3644
> URL: https://issues.apache.org/jira/browse/SPARK-3644
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>
> This JIRA is a forum to draft a design proposal for a REST interface for 
> accessing information about Spark applications, such as job / stage / task / 
> storage status.
> There have been a number of proposals to serve JSON representations of the 
> information displayed in Spark's web UI.  Given that we might redesign the 
> pages of the web UI (and possibly re-implement the UI as a client of a REST 
> API), the API endpoints and their responses should be independent of what we 
> choose to display on particular web UI pages / layouts.
> Let's start a discussion of what a good REST API would look like from 
> first-principles.  We can discuss what urls / endpoints expose access to 
> data, how our JSON responses will be formatted, how fields will be named, etc.
> Some links for inspiration:
> https://developer.github.com/v3/
> http://developer.netflix.com/docs/REST_API_Reference
> https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)

2014-09-22 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-3644:
-

 Summary: REST API for Spark application info (jobs / stages / 
tasks / storage info)
 Key: SPARK-3644
 URL: https://issues.apache.org/jira/browse/SPARK-3644
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Reporter: Josh Rosen


This JIRA is a forum to draft a design proposal for a REST interface for 
accessing information about Spark applications, such as job / stage / task / 
storage status.

There have been a number of proposals to serve JSON representations of the 
information displayed in Spark's web UI.  Given that we might redesign the 
pages of the web UI (and possibly re-implement the UI as a client of a REST 
API), the API endpoints and their responses should be independent of what we 
choose to display on particular web UI pages / layouts.

Let's start a discussion of what a good REST API would look like from 
first-principles.  We can discuss what urls / endpoints expose access to data, 
how our JSON responses will be formatted, how fields will be named, etc.

Some links for inspiration:

https://developer.github.com/v3/
http://developer.netflix.com/docs/REST_API_Reference
https://helloreverb.com/developers/swagger



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3643) Add cluster-specific config settings to configuration page

2014-09-22 Thread Matei Zaharia (JIRA)

Matei Zaharia created SPARK-3643:


 Summary: Add cluster-specific config settings to configuration page
 Key: SPARK-3643
 URL: https://issues.apache.org/jira/browse/SPARK-3643
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Matei Zaharia


This would make it easier to search a single page for these options. The 
downside is that we'd have to maintain them in 2 places (cluster-specific pages 
and this one).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2373) RDD add span function (split an RDD to two RDD based on user's function)

2014-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2373.
---
Resolution: Won't Fix

Resolving this as "Won't Fix", per discussion on the PR.  [Matei 
said|https://github.com/apache/spark/pull/1306#issuecomment-53838250]:

{quote}
IMO this is too specialized to include. It's small enough that applications can 
do it themselves, but also fairly confusing unless your RDD is already sorted 
in some way. I think we should just leave it for applications to do it. If you 
are doing a skewed join operator for example, you can do it within the 
implementation of that but not show it to the user.
{quote}

> RDD add  span function (split an RDD to two RDD based on user's function)
> -
>
> Key: SPARK-2373
> URL: https://issues.apache.org/jira/browse/SPARK-2373
> Project: Spark
>  Issue Type: New Feature
>Reporter: Yanjie Gao
>
> Splits this RDD into a prefix/suffix pair according to a predicate .
> returns
> a pair consisting of the longest prefix of this RDD whose elements all 
> satisfy p, and the rest of this list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3629) Improvements to YARN doc

2014-09-22 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-3629:
-
Labels: starter  (was: )

> Improvements to YARN doc
> 
>
> Key: SPARK-3629
> URL: https://issues.apache.org/jira/browse/SPARK-3629
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, YARN
>Reporter: Matei Zaharia
>  Labels: starter
>
> Right now this doc starts off with a big list of config options, and only 
> then tells you how to submit an app. It would be better to put that part and 
> the packaging part first, and the config options only at the end.
> In addition, the doc mentions yarn-cluster vs yarn-client as separate 
> masters, which is inconsistent with the help output from spark-submit (which 
> says to always use "yarn").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-09-22 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143566#comment-14143566
 ] 

Yin Huai commented on SPARK-3641:
-

Sounds good. Let me fix it.

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Priority: Critical
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3431) Parallelize execution of tests

2014-09-22 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143559#comment-14143559
 ] 

Josh Rosen commented on SPARK-3431:
---

It would be great to address this soon, since several open PRs plan to add 
expensive new test suites (Hive integration tests, Selenium tests for the web 
UI, etc.).

There are some thread-safety issues when running multiple SparkContexts in the 
same JVM, so for now we're restricted to running one test suite per JVM.  
However, I think we should be able to parallelize the execution of tests from 
different subprojects, e.g. by running Spark SQL tests in parallel with Spark 
Streaming tests (each using its own JVM).

Our Jenkins cluster is pretty underutilized, so I don't think this will cause 
problems.  We also recently increased the file descriptor ulimits, so this 
shouldn't cause any issues with port exhaustion, etc.

> Parallelize execution of tests
> --
>
> Key: SPARK-3431
> URL: https://issues.apache.org/jira/browse/SPARK-3431
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Nicholas Chammas
>
> Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common 
> strategy to cut test time down is to parallelize the execution of the tests. 
> Doing that may in turn require some prerequisite changes to be made to how 
> certain tests run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-09-22 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143540#comment-14143540
 ] 

Michael Armbrust commented on SPARK-3641:
-

The idea here is to be able to support more than one SQL context, so I think we 
will always need populate this field before constructing physical operators.  
To avoid bugs like this, it would be good to limit the number of places where 
physical plans are constructed.  Right now its kind of a hack that we use 
SparkLogicalPlan as a connector and manually create the physical ExistingRDD 
operator.  If we instead had a true logical concept for ExistingRDDs then this 
bug would not have occurred

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Priority: Critical
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143550#comment-14143550
 ] 

Apache Spark commented on SPARK-1655:
-

User 'staple' has created a pull request for this issue:
https://github.com/apache/spark/pull/2491

> In naive Bayes, store conditional probabilities distributively.
> ---
>
> Key: SPARK-1655
> URL: https://issues.apache.org/jira/browse/SPARK-1655
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> In the current implementation, we collect all conditional probabilities to 
> the driver node. When there are many labels and many features, this puts 
> heavy load on the driver. For scalability, we should provide a way to store 
> conditional probabilities distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads

2014-09-22 Thread Adam Kawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143517#comment-14143517
 ] 

Adam Kawa commented on SPARK-3561:
--

We also would be very interested in trying this out (especially for large, 
batch applications that we wish to run on Spark).

> Native Hadoop/YARN integration for batch/ETL workloads
> --
>
> Key: SPARK-3561
> URL: https://issues.apache.org/jira/browse/SPARK-3561
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Oleg Zhurakousky
>  Labels: features
> Fix For: 1.2.0
>
> Attachments: SPARK-3561.pdf
>
>
> Currently Spark provides integration with external resource-managers such as 
> Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the 
> current architecture of Spark-on-YARN can be enhanced to provide 
> significantly better utilization of cluster resources for large scale, batch 
> and/or ETL applications when run alongside other applications (Spark and 
> others) and services in YARN. 
> Proposal:
> The proposed approach would introduce a pluggable JobExecutionContext (trait) 
> - a gateway and a delegate to Hadoop execution environment - as a non-public 
> api (@DeveloperAPI) not exposed to end users of Spark.
> The trait will define 4 only operations:
> * hadoopFile
> * newAPIHadoopFile
> * broadcast
> * runJob
> Each method directly maps to the corresponding methods in current version of 
> SparkContext. JobExecutionContext implementation will be accessed by 
> SparkContext via master URL as 
> "execution-context:foo.bar.MyJobExecutionContext" with default implementation 
> containing the existing code from SparkContext, thus allowing current 
> (corresponding) methods of SparkContext to delegate to such implementation. 
> An integrator will now have an option to provide custom implementation of 
> DefaultExecutionContext by either implementing it from scratch or extending 
> form DefaultExecutionContext.
> Please see the attached design doc for more details.
> Pull Request will be posted shortly as well



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3614) Filter on minimum occurrences of a term in IDF

2014-09-22 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860
 ] 

RJ Nowling edited comment on SPARK-3614 at 9/22/14 5:52 PM:


Thanks, Andrew! I'll do that.


was (Author: rnowling):
Thanks, Andrew! I'll do that.




-- 
em rnowl...@gmail.com
c 954.496.2314


> Filter on minimum occurrences of a term in IDF 
> ---
>
> Key: SPARK-3614
> URL: https://issues.apache.org/jira/browse/SPARK-3614
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Jatinpreet Singh
>Assignee: RJ Nowling
>Priority: Minor
>  Labels: TFIDF
>
> The IDF class in MLlib does not provide the capability of defining a minimum 
> number of documents a term should appear in the corpus. The idea is to have a 
> cutoff variable which defines this minimum occurrence value, and the terms 
> which have lower frequency are ignored.
> Mathematically,
> IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance
> where, 
> D is the total number of documents in the corpus
> DF(t,D) is the number of documents that contain the term t
> minimumOccurance is the minimum number of documents the term appears in the 
> document corpus
> This would have an impact on accuracy as terms that appear in less than a 
> certain limit of documents, have low or no importance in TFIDF vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3627) spark on yarn reports success even though job fails

2014-09-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143523#comment-14143523
 ] 

Thomas Graves commented on SPARK-3627:
--

this might be the same as SPARK-3293

> spark on yarn reports success even though job fails
> ---
>
> Key: SPARK-3627
> URL: https://issues.apache.org/jira/browse/SPARK-3627
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a wordcount and saving the output to hdfs.  If the output 
> directory already exists, yarn reports success even though the job fails 
> since it requires the output directory to not be there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3631) Add docs for checkpoint usage

2014-09-22 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143484#comment-14143484
 ] 

Burak Yavuz commented on SPARK-3631:


Thanks for setting this up [~aash]! [~pwendell], [~tdas], [~joshrosen] could 
you please confirm/correct/add to my explanation above. Thanks!

> Add docs for checkpoint usage
> -
>
> Key: SPARK-3631
> URL: https://issues.apache.org/jira/browse/SPARK-3631
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.1.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
>
> We should include general documentation on using checkpoints.  Right now the 
> docs only cover checkpoints in the Spark Streaming use case which is slightly 
> different from Core.
> Some content to consider for inclusion from [~brkyvz]:
> {quote}
> If you set the checkpointing directory however, the intermediate state of the 
> RDDs will be saved in HDFS, and the lineage will pick off from there.
> You won't need to keep the shuffle data before the checkpointed state, 
> therefore those can be safely removed (will be removed automatically).
> However, checkpoint must be called explicitly as in 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291
>  ,just setting the directory will not be enough.
> {quote}
> {quote}
> Yes, writing to HDFS is more expensive, but I feel it is still a small price 
> to pay when compared to having a Disk Space Full error three hours in
> and having to start from scratch.
> The main goal of checkpointing is to truncate the lineage. Clearing up 
> shuffle writes come as a bonus to checkpointing, it is not the main goal. The
> subtlety here is that .checkpoint() is just like .cache(). Until you call an 
> action, nothing happens. Therefore, if you're going to do 1000 maps in a
> row and you don't want to checkpoint in the meantime until a shuffle happens, 
> you will still get a StackOverflowError, because the lineage is too long.
> I went through some of the code for checkpointing. As far as I can tell, it 
> materializes the data in HDFS, and resets all its dependencies, so you start
> a fresh lineage. My understanding would be that checkpointing still should be 
> done every N operations to reset the lineage. However, an action must be
> performed before the lineage grows too long.
> {quote}
> A good place to put this information would be at 
> https://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering

2014-09-22 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3588:
--
Assignee: Meethu Mathew

> Gaussian Mixture Model clustering
> -
>
> Key: SPARK-3588
> URL: https://issues.apache.org/jira/browse/SPARK-3588
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, PySpark
>Reporter: Meethu Mathew
>Assignee: Meethu Mathew
> Attachments: GMMSpark.py
>
>
> Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM 
> models the entire data set as a finite mixture of Gaussian distributions,each 
> parameterized by a mean vector µ ,a covariance matrix ∑ and  a mixture weight 
> π. In this technique, probability of  each point to belong to each cluster is 
> computed along with the cluster statistics.
> We have come up with an initial distributed implementation of GMM in pyspark 
> where the parameters are estimated using the  Expectation-Maximization 
> algorithm.Our current implementation considers diagonal covariance matrix for 
> each component.
> We did an initial benchmark study on a  2 node Spark standalone cluster setup 
> where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. 
> We also evaluated python version of k-means available in spark on the same 
> datasets.
> Below are the results from this benchmark study. The reported stats are 
> average from 10 runs.Tests were done on multiple datasets with varying number 
> of features and instances.
> ||  Dataset  
>    ||   Gaussian
>  mixture model || 
>    Kmeans(Python)   ||
>  
> |Instances|Dimensions |Avg time per iteration|Time for  100 iterations |Avg 
> time per iteration |Time for 100 iterations | 
> |0.7million|    13 
>   |  
>    7s 
>     | 
>  12min 
>    |  
>     13s  
>     |  26min 
>    |
> |1.8million|    11 
>   |   
>     17s 
>  | 
>    29min 
>    |  
>     33s  
>      |  53min 
>      |
> |10million|   16 
>   |  
>     1.6min     
>  |    2.7hr 
>      |  
>    1.2min | 
>  2hr       
>  |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1475) Drain event logging queue before stopping event logger

2014-09-22 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-1475:
-
Summary: Drain event logging queue before stopping event logger  (was: 
Draining event logging queue before stopping event logger)

> Drain event logging queue before stopping event logger
> --
>
> Key: SPARK-1475
> URL: https://issues.apache.org/jira/browse/SPARK-1475
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Kan Zhang
>Assignee: Kan Zhang
>Priority: Blocker
> Fix For: 1.0.0
>
>
> When stopping SparkListenerBus, its event queue needs to be drained. And this 
> needs to happen before event logger is stopped. Otherwise, any event still 
> waiting to be processed in the queue may be lost and consequently event log 
> file may be incomplete. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3627) spark on yarn reports success even though job fails

2014-09-22 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143438#comment-14143438
 ] 

Thomas Graves commented on SPARK-3627:
--

We could make this a separate issue, but I've also seen it report failure when 
it actually succeeded.  In that case I believe it did an sc.stop() and 
System.exit(0). 



> spark on yarn reports success even though job fails
> ---
>
> Key: SPARK-3627
> URL: https://issues.apache.org/jira/browse/SPARK-3627
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.2.0
>Reporter: Thomas Graves
>Priority: Critical
>
> I was running a wordcount and saving the output to hdfs.  If the output 
> directory already exists, yarn reports success even though the job fails 
> since it requires the output directory to not be there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API

2014-09-22 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143425#comment-14143425
 ] 

Josh Rosen commented on SPARK-2321:
---

{quote}
... maybe we should redesign the SparkListener event API, and add job id info 
into Stage/Task event in Scheduler before post it to listener bus.
{quote}

A stage may be used by multiple jobs, so we'd have to think carefully about how 
the API should reflect this.  It looks like DAGScheduler's internal {{Stage}} 
class tracks the id of the job that first submitted the stage, and 
{{activeJobForStage}} finds "the earliest-created active job that needs the 
stage."  It might make sense to associate Stage/Task start events with the list 
of active jobs that depend on them.

> Design a proper progress reporting & event listener API
> ---
>
> Key: SPARK-2321
> URL: https://issues.apache.org/jira/browse/SPARK-2321
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API, Spark Core
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a ticket to track progress on redesigning the SparkListener and 
> JobProgressListener API.
> There are multiple problems with the current design, including:
> 0. I'm not sure if the API is usable in Java (there are at least some enums 
> we used in Scala and a bunch of case classes that might complicate things).
> 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
> attention to it yet. Something as important as progress reporting deserves a 
> more stable API.
> 2. There is no easy way to connect jobs with stages. Similarly, there is no 
> easy way to connect job groups with jobs / stages.
> 3. JobProgressListener itself has no encapsulation at all. States can be 
> arbitrarily mutated by external programs. Variable names are sort of randomly 
> decided and inconsistent. 
> We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-22 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3625:
---
Description: 
The reproduce code:
{code}
sc.setCheckpointDir(checkpointDir)
val c = sc.parallelize((1 to 1000)).map(_ + 1)
c.count
val dep = c.dependencies.head.rdd
c.checkpoint()
c.count
assert(dep != c.dependencies.head.rdd)
{code}
This limit is too strict , This makes it difficult to implement SPARK-3623 .


  was:
The reproduce code:
{code}
sc.setCheckpointDir(checkpointDir)
val c = sc.parallelize((1 to 1000)).map(_ + 1)
c.count
val dep = c.dependencies.head.rdd
c.checkpoint()
c.count
assert(dep != c.dependencies.head.rdd)
{code}
SPARK-3623 


> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> The reproduce code:
> {code}
> sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000)).map(_ + 1)
> c.count
> val dep = c.dependencies.head.rdd
> c.checkpoint()
> c.count
> assert(dep != c.dependencies.head.rdd)
> {code}
> This limit is too strict , This makes it difficult to implement SPARK-3623 .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3642) Better document the nuances of shared variables

2014-09-22 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143423#comment-14143423
 ] 

Apache Spark commented on SPARK-3642:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2490

> Better document the nuances of shared variables
> ---
>
> Key: SPARK-3642
> URL: https://issues.apache.org/jira/browse/SPARK-3642
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-22 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3625:
---
Description: 
The reproduce code:
{code}
sc.setCheckpointDir(checkpointDir)
val c = sc.parallelize((1 to 1000)).map(_ + 1)
c.count
val dep = c.dependencies.head.rdd
c.checkpoint()
c.count
assert(dep != c.dependencies.head.rdd)
{code}
SPARK-3623 

  was:
The reproduce code:

{code}
sc.setCheckpointDir(checkpointDir)
val c = sc.parallelize((1 to 1000)).map(_ + 1)
c.count
val dep = c.dependencies.head.rdd
c.checkpoint()
c.count
assert(dep != c.dependencies.head.rdd)
{code}


> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> The reproduce code:
> {code}
> sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000)).map(_ + 1)
> c.count
> val dep = c.dependencies.head.rdd
> c.checkpoint()
> c.count
> assert(dep != c.dependencies.head.rdd)
> {code}
> SPARK-3623 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-22 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143410#comment-14143410
 ] 

Guoqiang Li commented on SPARK-3625:


Ok, it has been modified to improvement
This limit is too strict , SPARK-3623 relies on here.

> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> The reproduce code:
> {code}
> sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000)).map(_ + 1)
> c.count
> val dep = c.dependencies.head.rdd
> c.checkpoint()
> c.count
> assert(dep != c.dependencies.head.rdd)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-22 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3625:
---
Priority: Major  (was: Blocker)

> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> The reproduce code:
> {code}
> sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000)).map(_ + 1)
> c.count
> val dep = c.dependencies.head.rdd
> c.checkpoint()
> c.count
> assert(dep != c.dependencies.head.rdd)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work

2014-09-22 Thread Guoqiang Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-3625:
---
Issue Type: Improvement  (was: Bug)

> In some cases, the RDD.checkpoint does not work
> ---
>
> Key: SPARK-3625
> URL: https://issues.apache.org/jira/browse/SPARK-3625
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
>
> The reproduce code:
> {code}
> sc.setCheckpointDir(checkpointDir)
> val c = sc.parallelize((1 to 1000)).map(_ + 1)
> c.count
> val dep = c.dependencies.head.rdd
> c.checkpoint()
> c.count
> assert(dep != c.dependencies.head.rdd)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3642) Better document the nuances of shared variables

2014-09-22 Thread Sandy Ryza (JIRA)

Sandy Ryza created SPARK-3642:
-

 Summary: Better document the nuances of shared variables
 Key: SPARK-3642
 URL: https://issues.apache.org/jira/browse/SPARK-3642
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext

2014-09-22 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143347#comment-14143347
 ] 

Yin Huai commented on SPARK-3641:
-

[~marmbrus] Can we populate SparkPlan.currentContext in the constructor of 
SQLContext instead of populate it every time before using ExistingRDD?

> Correctly populate SparkPlan.currentContext
> ---
>
> Key: SPARK-3641
> URL: https://issues.apache.org/jira/browse/SPARK-3641
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Yin Huai
>Priority: Critical
>
> After creating a new SQLContext, we need to populate SparkPlan.currentContext 
> before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD 
> populate SparkPlan.currentContext. SQLContext.applySchema is missing this 
> call and we can have NPE as described in 
> http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 126 matches

Mail list logo