[jira] [Commented] (SPARK-3481) HiveComparisonTest throws exception of "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: default"
[ https://issues.apache.org/jira/browse/SPARK-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144421#comment-14144421 ] Apache Spark commented on SPARK-3481: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/2505 > HiveComparisonTest throws exception of > "org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: > default" > --- > > Key: SPARK-3481 > URL: https://issues.apache.org/jira/browse/SPARK-3481 > Project: Spark > Issue Type: Test > Components: SQL >Reporter: Cheng Hao >Assignee: Cheng Hao >Priority: Minor > Fix For: 1.2.0 > > > In local test, lots of exception raised like: > {panel} > 11:08:01.746 ERROR hive.ql.exec.DDLTask: > org.apache.hadoop.hive.ql.metadata.HiveException: Database does not exist: > default > at > org.apache.hadoop.hive.ql.exec.DDLTask.switchDatabase(DDLTask.java:3480) > at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:237) > at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:151) > at > org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:65) > at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1414) > at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1192) > at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1020) > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:888) > at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:298) > at > org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:272) > at > org.apache.spark.sql.hive.test.TestHiveContext.runSqlHive(TestHive.scala:88) > at > org.apache.spark.sql.hive.test.TestHiveContext.reset(TestHive.scala:348) > at > org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply$mcV$sp(HiveComparisonTest.scala:255) > at > org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225) > at > org.apache.spark.sql.hive.execution.HiveComparisonTest$$anonfun$createQueryTest$1.apply(HiveComparisonTest.scala:225) > at > org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at > org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:158) > at org.scalatest.Suite$class.withFixture(Suite.scala:1121) > at org.scalatest.FunSuite.withFixture(FunSuite.scala:1559) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:155) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:167) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:167) > at org.scalatest.FunSuite.runTest(FunSuite.scala:1559) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:200) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) > at > org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) > at scala.collection.immutable.List.foreach(List.scala:318) > at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > at > org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) > at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) > at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:200) > at org.scalatest.FunSuite.runTests(FunSuite.scala:1559) > at org.scalatest.Suite$class.run(Suite.scala:1423) > at > org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1559) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) > at > org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:204) > at org.scalatest.SuperEngine.runImpl(Engine.scala:545) > at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:204) > at > org.apache.spark.sql.hive.execution.HiveComparisonTest.org$scalatest$BeforeAndAfterAll$$super$run(Hive
[jira] [Commented] (SPARK-3577) Add task metric to report spill time
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144412#comment-14144412 ] Apache Spark commented on SPARK-3577: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2504 > Add task metric to report spill time > > > Key: SPARK-3577 > URL: https://issues.apache.org/jira/browse/SPARK-3577 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > The {{ExternalSorter}} passes its own {{ShuffleWriteMetrics}} into > {{ExternalSorter}}. The write time recorded in those metrics is never used. > We should probably add task metrics to report this spill time, since for > shuffles, this would have previously been reported as part of shuffle write > time (with the original hash-based sorter). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3172) Distinguish between shuffle spill on the map and reduce side
[ https://issues.apache.org/jira/browse/SPARK-3172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144411#comment-14144411 ] Apache Spark commented on SPARK-3172: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2504 > Distinguish between shuffle spill on the map and reduce side > > > Key: SPARK-3172 > URL: https://issues.apache.org/jira/browse/SPARK-3172 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2 >Reporter: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills
[ https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144326#comment-14144326 ] Apache Spark commented on SPARK-3649: - User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/2503 > ClassCastException in GraphX custom serializers when sort-based shuffle spills > -- > > Key: SPARK-3649 > URL: https://issues.apache.org/jira/browse/SPARK-3649 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > > As > [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] > on the mailing list, GraphX throws > {code} > java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 > at > org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) > > at > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) > > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) > {code} > when sort-based shuffle attempts to spill to disk. This is because GraphX > defines custom serializers for shuffling pair RDDs that assume Spark will > always serialize the entire pair object rather than breaking it up into its > components. However, the spill code path in sort-based shuffle [violates this > assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. > GraphX uses the custom serializers to compress vertex ID keys using > variable-length integer encoding. However, since the serializer can no longer > rely on the key and value being serialized and deserialized together, > performing such encoding would require writing a tag byte. Therefore it may > be better to simply remove the custom serializers. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144258#comment-14144258 ] Saisai Shao commented on SPARK-3032: Hi Matei, thanks for your reply, I will try again using your comments. > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Critical > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3212) Improve the clarity of caching semantics
[ https://issues.apache.org/jira/browse/SPARK-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144245#comment-14144245 ] Apache Spark commented on SPARK-3212: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2501 > Improve the clarity of caching semantics > > > Key: SPARK-3212 > URL: https://issues.apache.org/jira/browse/SPARK-3212 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust >Priority: Blocker > > Right now there are a bunch of different ways to cache tables in Spark SQL. > For example: > - tweets.cache() > - sql("SELECT * FROM tweets").cache() > - table("tweets").cache() > - tweets.cache().registerTempTable(tweets) > - sql("CACHE TABLE tweets") > - cacheTable("tweets") > Each of the above commands has subtly different semantics, leading to a very > confusing user experience. Ideally, we would stop doing caching based on > simple tables names and instead have a phase of optimization that does > intelligent matching of query plans with available cached data. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3032: - Assignee: Saisai Shao > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Critical > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144243#comment-14144243 ] Matei Zaharia commented on SPARK-3032: -- I'm not completely sure that this is because hashCode provides a partial ordering, because I believe TimSort is supposed to work on partial orderings as well. I believe the problem is an integer over flow when we subtract key1.hashCode - key2.hashCode. Can you try replacing the line that returns h1 - h2 in keyComparator with returning Integer.compare(h1, h2)? This will properly deal with overflow. Returning h1 - h2 is definitely wrong: for example suppose that h1 = Int.MaxValue and h2 = Int.MinValue, then h1 - h2 = -1. Please add a unit test for this case as well. > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao >Priority: Critical > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144244#comment-14144244 ] Matei Zaharia commented on SPARK-3032: -- Yeah actually I'm sure TimSort works fine with a partial ordering, I read through the contract of Comparable. We also use it that way all the time when we only sort by partition ID. > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao >Assignee: Saisai Shao >Priority: Critical > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3655) Secondary sort
koert kuipers created SPARK-3655: Summary: Secondary sort Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: koert kuipers Priority: Minor Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3654) Implement all extended HiveQL statements/commands with a separate parser combinator
Cheng Lian created SPARK-3654: - Summary: Implement all extended HiveQL statements/commands with a separate parser combinator Key: SPARK-3654 URL: https://issues.apache.org/jira/browse/SPARK-3654 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Cheng Lian Statements and commands like {{SET}}, {{CACHE TABLE}} and {{ADD JAR}} etc. are currently parsed in a quite hacky way, like this: {code} if (sql.trim.toLowerCase.startsWith("cache table")) { sql.trim.toLowerCase.startsWith("cache table") match { ... } } {code} It would be much better to add an extra parser combinator that parses these syntax extensions first, and then fallback to the normal Hive parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144192#comment-14144192 ] SK commented on SPARK-3610: --- Today I found that the same issue occurs with Graphx application logs as well (basiclly it includes parentheses and commas in the log file name), and the history server gets messed up and needs to be restarted every time. thanks > History server log name should not be based on user input > - > > Key: SPARK-3610 > URL: https://issues.apache.org/jira/browse/SPARK-3610 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: SK >Priority: Critical > > Right now we use the user-defined application name when creating the logging > file for the history server. We should use some type of GUID generated from > inside of Spark instead of allowing user input here. It can cause errors if > users provide characters that are not valid in filesystem paths. > Original bug report: > {quote} > The default log files for the Mllib examples use a rather long naming > convention that includes special characters like parentheses and comma.For > e.g. one of my log files is named > "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032". > When I click on the program on the history server page (at port 18080), to > view the detailed application logs, the history server crashes and I need to > restart it. I am using Spark 1.1 on a mesos cluster. > I renamed the log file by removing the special characters and then it loads > up correctly. I am not sure which program is creating the log files. Can it > be changed so that the default log file naming convention does not include > special characters? > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3653) SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-3653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144155#comment-14144155 ] Apache Spark commented on SPARK-3653: - User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/2500 > SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode > - > > Key: SPARK-3653 > URL: https://issues.apache.org/jira/browse/SPARK-3653 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Critical > > We only check for these in the spark-class but not in SparkSubmit. For client > mode, this is OK because the driver can read directly from these environment > variables. > For cluster mode however, SPARK_EXECUTOR_MEMORY is not set on the node that > starts the driver, and SPARK_DRIVER_MEMORY is simply not propagated to the > driver JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3652) upgrade spark sql hive version to 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144150#comment-14144150 ] Apache Spark commented on SPARK-3652: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/2499 > upgrade spark sql hive version to 0.13.1 > > > Key: SPARK-3652 > URL: https://issues.apache.org/jira/browse/SPARK-3652 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > > now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3653) SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode
Andrew Or created SPARK-3653: Summary: SPARK_{DRIVER|EXECUTOR}_MEMORY is ignored in cluster mode Key: SPARK-3653 URL: https://issues.apache.org/jira/browse/SPARK-3653 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Critical We only check for these in the spark-class but not in SparkSubmit. For client mode, this is OK because the driver can read directly from these environment variables. For cluster mode however, SPARK_EXECUTOR_MEMORY is not set on the node that starts the driver, and SPARK_DRIVER_MEMORY is simply not propagated to the driver JVM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3652) upgrade spark sql hive version to 0.13.1
[ https://issues.apache.org/jira/browse/SPARK-3652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3652: --- Description: now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. > upgrade spark sql hive version to 0.13.1 > > > Key: SPARK-3652 > URL: https://issues.apache.org/jira/browse/SPARK-3652 > Project: Spark > Issue Type: Dependency upgrade > Components: SQL >Affects Versions: 1.1.0 >Reporter: wangfei > > now spark sql hive version is 0.12.0, compile with 0.13.1 will get errors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3652) upgrade spark sql hive version to 0.13.1
wangfei created SPARK-3652: -- Summary: upgrade spark sql hive version to 0.13.1 Key: SPARK-3652 URL: https://issues.apache.org/jira/browse/SPARK-3652 Project: Spark Issue Type: Dependency upgrade Components: SQL Affects Versions: 1.1.0 Reporter: wangfei -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
[ https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144100#comment-14144100 ] Apache Spark commented on SPARK-3606: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2497 > Spark-on-Yarn AmIpFilter does not work with Yarn HA. > > > Key: SPARK-3606 > URL: https://issues.apache.org/jira/browse/SPARK-3606 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > The current IP filter only considers one of the RMs in an HA setup. If the > active RM is not the configured one, you get a "connection refused" error > when clicking on the Spark AM links in the RM UI. > Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3620) Refactor config option handling code for spark-submit
[ https://issues.apache.org/jira/browse/SPARK-3620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14144095#comment-14144095 ] Dale Richardson commented on SPARK-3620: Due to typesafe conf being based on a JSON-iike tree structure of config values, it will never support non-common prefixes on config variable. So I've gone back to using property objects > Refactor config option handling code for spark-submit > - > > Key: SPARK-3620 > URL: https://issues.apache.org/jira/browse/SPARK-3620 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 1.0.0, 1.1.0 >Reporter: Dale Richardson >Assignee: Dale Richardson >Priority: Minor > > I'm proposing its time to refactor the configuration argument handling code > in spark-submit. The code has grown organically in a short period of time, > handles a pretty complicated logic flow, and is now pretty fragile. Some > issues that have been identified: > 1. Hand-crafted property file readers that do not support the property file > format as specified in > http://docs.oracle.com/javase/6/docs/api/java/util/Properties.html#load(java.io.Reader) > 2. ResolveURI not called on paths read from conf/prop files > 3. inconsistent means of merging / overriding values from different sources > (Some get overridden by file, others by manual settings of field on object, > Some by properties) > 4. Argument validation should be done after combining config files, system > properties and command line arguments, > 5. Alternate conf file location not handled in shell scripts > 6. Some options can only be passed as command line arguments > 7. Defaults for options are hard-coded (and sometimes overridden multiple > times) in many through-out the code e.g. master = local[*] > Initial proposal is to use typesafe conf to read in the config information > and merge the various config sources -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3651) Consolidate executor maps in CoarseGrainedSchedulerBackend
Andrew Or created SPARK-3651: Summary: Consolidate executor maps in CoarseGrainedSchedulerBackend Key: SPARK-3651 URL: https://issues.apache.org/jira/browse/SPARK-3651 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Or In CoarseGrainedSchedulerBackend, we have: {code} private val executorActor = new HashMap[String, ActorRef] private val executorAddress = new HashMap[String, Address] private val executorHost = new HashMap[String, String] private val freeCores = new HashMap[String, Int] private val totalCores = new HashMap[String, Int] {code} We only ever put / remove stuff from these maps together. It would simplify the code if we consolidate these all into one map as we have done in JobProgressListener in https://issues.apache.org/jira/browse/SPARK-2299. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3032: --- Priority: Critical (was: Major) > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao >Priority: Critical > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3032: --- Target Version/s: 1.2.0 > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1860: --- Target Version/s: 1.2.0 > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1860: --- Priority: Blocker (was: Critical) > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1860: --- Fix Version/s: (was: 1.2.0) > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3647) Shaded Guava patch causes access issues with package private classes
[ https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143968#comment-14143968 ] Apache Spark commented on SPARK-3647: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/2496 > Shaded Guava patch causes access issues with package private classes > > > Key: SPARK-3647 > URL: https://issues.apache.org/jira/browse/SPARK-3647 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > > The patch that introduced shading to Guava (SPARK-2848) tried to maintain > backwards compatibility in the Java API by not relocating the "Optional" > class. That causes problems when that class references package private > members in the Absent and Present classes, which are now in a different > package: > {noformat} > Exception in thread "main" java.lang.IllegalAccessError: tried to access > class org.spark-project.guava.common.base.Present from class > com.google.common.base.Optional > at com.google.common.base.Optional.of(Optional.java:86) > at > org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25) > at > org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3650) Triangle Count handles reverse edges incorrectly
[ https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143954#comment-14143954 ] Apache Spark commented on SPARK-3650: - User 'jegonzal' has created a pull request for this issue: https://github.com/apache/spark/pull/2495 > Triangle Count handles reverse edges incorrectly > > > Key: SPARK-3650 > URL: https://issues.apache.org/jira/browse/SPARK-3650 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Joseph E. Gonzalez > > The triangle count implementation assumes that edges are aligned in a > canonical direction. As stated in the documentation: > bq. Note that the input graph should have its edges in canonical direction > (i.e. the `sourceId` less than `destId`) > However the TriangleCount algorithm does not verify that this condition holds > and indeed even the unit tests exploits this functionality: > {code:scala} > val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ > Array(0L -> -1L, -1L -> -2L, -2L -> 0L) > val rawEdges = sc.parallelize(triangles, 2) > val graph = Graph.fromEdgeTuples(rawEdges, true).cache() > val triangleCount = graph.triangleCount() > val verts = triangleCount.vertices > verts.collect().foreach { case (vid, count) => > if (vid == 0) { > assert(count === 4) // <-- Should be 2 > } else { > assert(count === 2) // <-- Should be 1 > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3650) Triangle Count handles reverse edges incorrectly
[ https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez updated SPARK-3650: -- Description: The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: bq. Note that the input graph should have its edges in canonical direction (i.e. the `sourceId` less than `destId`) However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: {code:scala} val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ Array(0L -> -1L, -1L -> -2L, -2L -> 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) => if (vid == 0) { assert(count === 4) // <-- Should be 2 } else { assert(count === 2) // <-- Should be 1 } } {code} was: The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: bq. Note that the input graph should have its edges in canonical direction (i.e. the `sourceId` less than `destId`) However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: {code:scala} val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ Array(0L -> -1L, -1L -> -2L, -2L -> 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) => if (vid == 0) { assert(count === 4) // <-- Should be 2 } else { assert(count === 2) // <-- Should be 1 } } {code:scala} > Triangle Count handles reverse edges incorrectly > > > Key: SPARK-3650 > URL: https://issues.apache.org/jira/browse/SPARK-3650 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Joseph E. Gonzalez > > The triangle count implementation assumes that edges are aligned in a > canonical direction. As stated in the documentation: > bq. Note that the input graph should have its edges in canonical direction > (i.e. the `sourceId` less than `destId`) > However the TriangleCount algorithm does not verify that this condition holds > and indeed even the unit tests exploits this functionality: > {code:scala} > val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ > Array(0L -> -1L, -1L -> -2L, -2L -> 0L) > val rawEdges = sc.parallelize(triangles, 2) > val graph = Graph.fromEdgeTuples(rawEdges, true).cache() > val triangleCount = graph.triangleCount() > val verts = triangleCount.vertices > verts.collect().foreach { case (vid, count) => > if (vid == 0) { > assert(count === 4) // <-- Should be 2 > } else { > assert(count === 2) // <-- Should be 1 > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3650) Triangle Count handles reverse edges incorrectly
[ https://issues.apache.org/jira/browse/SPARK-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez updated SPARK-3650: -- Description: The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: bq. Note that the input graph should have its edges in canonical direction (i.e. the `sourceId` less than `destId`) However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: {code:scala} val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ Array(0L -> -1L, -1L -> -2L, -2L -> 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) => if (vid == 0) { assert(count === 4) // <-- Should be 2 } else { assert(count === 2) // <-- Should be 1 } } {code:scala} was: The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: ``` Note that the input graph should have its edges in canonical direction * (i.e. the `sourceId` less than `destId`) ``` However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: ~~~ val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ Array(0L -> -1L, -1L -> -2L, -2L -> 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) => if (vid == 0) { assert(count === 4) // <-- Should be 2 } else { assert(count === 2) // <-- Should be 1 } } ~~~ > Triangle Count handles reverse edges incorrectly > > > Key: SPARK-3650 > URL: https://issues.apache.org/jira/browse/SPARK-3650 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.1.0 >Reporter: Joseph E. Gonzalez > > The triangle count implementation assumes that edges are aligned in a > canonical direction. As stated in the documentation: > bq. Note that the input graph should have its edges in canonical direction > (i.e. the `sourceId` less than `destId`) > However the TriangleCount algorithm does not verify that this condition holds > and indeed even the unit tests exploits this functionality: > {code:scala} > val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ > Array(0L -> -1L, -1L -> -2L, -2L -> 0L) > val rawEdges = sc.parallelize(triangles, 2) > val graph = Graph.fromEdgeTuples(rawEdges, true).cache() > val triangleCount = graph.triangleCount() > val verts = triangleCount.vertices > verts.collect().foreach { case (vid, count) => > if (vid == 0) { > assert(count === 4) // <-- Should be 2 > } else { > assert(count === 2) // <-- Should be 1 > } > } > {code:scala} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3650) Triangle Count handles reverse edges incorrectly
Joseph E. Gonzalez created SPARK-3650: - Summary: Triangle Count handles reverse edges incorrectly Key: SPARK-3650 URL: https://issues.apache.org/jira/browse/SPARK-3650 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.1.0 Reporter: Joseph E. Gonzalez The triangle count implementation assumes that edges are aligned in a canonical direction. As stated in the documentation: ``` Note that the input graph should have its edges in canonical direction * (i.e. the `sourceId` less than `destId`) ``` However the TriangleCount algorithm does not verify that this condition holds and indeed even the unit tests exploits this functionality: ~~~ val triangles = Array(0L -> 1L, 1L -> 2L, 2L -> 0L) ++ Array(0L -> -1L, -1L -> -2L, -2L -> 0L) val rawEdges = sc.parallelize(triangles, 2) val graph = Graph.fromEdgeTuples(rawEdges, true).cache() val triangleCount = graph.triangleCount() val verts = triangleCount.vertices verts.collect().foreach { case (vid, count) => if (vid == 0) { assert(count === 4) // <-- Should be 2 } else { assert(count === 2) // <-- Should be 1 } } ~~~ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1720) use LD_LIBRARY_PATH instead of -Djava.library.path
[ https://issues.apache.org/jira/browse/SPARK-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143930#comment-14143930 ] Patrick Wendell commented on SPARK-1720: Another user reported this issue, so let's try to get it into spark 1.2 > use LD_LIBRARY_PATH instead of -Djava.library.path > -- > > Key: SPARK-1720 > URL: https://issues.apache.org/jira/browse/SPARK-1720 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Guoqiang Li >Priority: Critical > > I think it would be better to use LD_LIBRARY_PATH rather then > -Djava.library.path. Once java.library.path is set, it doesn't search > LD_LIBRARY_PATH. In Hadoop we switched to use LD_LIBRARY_PATH instead of > java.library.path. See https://issues.apache.org/jira/browse/MAPREDUCE-4072. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills
[ https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3649: -- Description: As [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] on the mailing list, GraphX throws {code} java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) {code} when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would require writing a tag byte. Therefore it may be better to simply remove the custom serializers. was: As [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] on the mailing list, GraphX throws {code} java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) {code} when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would require writing a tag byte. > ClassCastException in GraphX custom serializers when sort-based shuffle spills > -- > > Key: SPARK-3649 > URL: https://issues.apache.org/jira/browse/SPARK-3649 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > > As > [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] > on the mailing list, GraphX throws > {code} > java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 > at > org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) > > at > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) > > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) > {code} > when sort-based shuffle attempts to spill to disk. This is because GraphX > defines custom serializers for shuffling pair RDDs that assume Spark will > always serialize the entire pair object rather than breaking it up into its > components. However, the spill code path in sort-based shuffle [violates this > assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. > GraphX uses the custom serializers to compress vertex ID keys using > variable-length integer encoding. However, since the serializer can no longer > rely on the key and value being serialized and deserialized together, > performing such encoding would require writing a tag byte. Therefore it may > be better to simply remove the custom serializers. -- This message was sent by
[jira] [Commented] (SPARK-3621) Provide a way to broadcast an RDD (instead of just a variable made of the RDD) so that a job can access
[ https://issues.apache.org/jira/browse/SPARK-3621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143933#comment-14143933 ] bc Wong commented on SPARK-3621: I think this is for the case of a map-side join where one of the tables is small. [~xuefuz], if the driver is running in the cluster, then RDD.collect() means it reading from HDFS and then broadcast the data to everyone. Right? That seems reasonable. I don't see another way to "broadcast" something. Alternatively, it's probably better for each executor to individually read that small HDFS file into its memory. > Provide a way to broadcast an RDD (instead of just a variable made of the > RDD) so that a job can access > --- > > Key: SPARK-3621 > URL: https://issues.apache.org/jira/browse/SPARK-3621 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 >Reporter: Xuefu Zhang > > In some cases, such as Hive's way of doing map-side join, it would be > benefcial to allow client program to broadcast RDDs rather than just > variables made of these RDDs. Broadcasting a variable made of RDDs requires > all RDD data be collected to the driver and that the variable be shipped to > the cluster after being made. It would be more performing if driver just > broadcasts the RDDs and uses the corresponding data in jobs (such building > hashmaps at executors). > Tez has a broadcast edge which can ship data from previous stage to the next > stage, which doesn't require driver side processing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143921#comment-14143921 ] Xuefu Zhang commented on SPARK-3622: They are related but not exactly the same. SPARK-2688 is about branching off RDD tree with no custom transformation invovled. This JIRA is about returning multiple RDDs from a single transformation (branching happening within a transformation). > Provide a custom transformation that can output multiple RDDs > - > > Key: SPARK-3622 > URL: https://issues.apache.org/jira/browse/SPARK-3622 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang > > All existing transformations return just one RDD at most, even for those > which takes user-supplied functions such as mapPartitions() . However, > sometimes a user provided function may need to output multiple RDDs. For > instance, a filter function that divides the input RDD into serveral RDDs. > While it's possible to get multiple RDDs by transforming the same RDD > multiple times, it may be more efficient to do this concurrently in one shot. > Especially user's existing function is already generating different data sets. > This the case in Hive on Spark, where Hive's map function and reduce function > can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3606) Spark-on-Yarn AmIpFilter does not work with Yarn HA.
[ https://issues.apache.org/jira/browse/SPARK-3606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3606: - Affects Version/s: 1.1.0 > Spark-on-Yarn AmIpFilter does not work with Yarn HA. > > > Key: SPARK-3606 > URL: https://issues.apache.org/jira/browse/SPARK-3606 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > > The current IP filter only considers one of the RMs in an HA setup. If the > active RM is not the configured one, you get a "connection refused" error > when clicking on the Spark AM links in the RM UI. > Similar to YARN-1811, but for Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills
[ https://issues.apache.org/jira/browse/SPARK-3649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3649: -- Description: As [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] on the mailing list, GraphX throws {code} java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) {code} when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would require writing a tag byte. was: As [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] on the mailing list, GraphX throws {code} java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) {code} when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. > ClassCastException in GraphX custom serializers when sort-based shuffle spills > -- > > Key: SPARK-3649 > URL: https://issues.apache.org/jira/browse/SPARK-3649 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0 >Reporter: Ankur Dave >Assignee: Ankur Dave > > As > [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] > on the mailing list, GraphX throws > {code} > java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 > at > org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) > > at > org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) > > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) > {code} > when sort-based shuffle attempts to spill to disk. This is because GraphX > defines custom serializers for shuffling pair RDDs that assume Spark will > always serialize the entire pair object rather than breaking it up into its > components. However, the spill code path in sort-based shuffle [violates this > assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. > GraphX uses the custom serializers to compress vertex ID keys using > variable-length integer encoding. However, since the serializer can no longer > rely on the key and value being serialized and deserialized together, > performing such encoding would require writing a tag byte. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1720) use LD_LIBRARY_PATH instead of -Djava.library.path
[ https://issues.apache.org/jira/browse/SPARK-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-1720: --- Priority: Critical (was: Major) Target Version/s: 1.2.0 > use LD_LIBRARY_PATH instead of -Djava.library.path > -- > > Key: SPARK-1720 > URL: https://issues.apache.org/jira/browse/SPARK-1720 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Thomas Graves >Assignee: Guoqiang Li >Priority: Critical > > I think it would be better to use LD_LIBRARY_PATH rather then > -Djava.library.path. Once java.library.path is set, it doesn't search > LD_LIBRARY_PATH. In Hadoop we switched to use LD_LIBRARY_PATH instead of > java.library.path. See https://issues.apache.org/jira/browse/MAPREDUCE-4072. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3622) Provide a custom transformation that can output multiple RDDs
[ https://issues.apache.org/jira/browse/SPARK-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143908#comment-14143908 ] Sandy Ryza commented on SPARK-3622: --- Is this a duplicate of SPARK-2688? > Provide a custom transformation that can output multiple RDDs > - > > Key: SPARK-3622 > URL: https://issues.apache.org/jira/browse/SPARK-3622 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Xuefu Zhang > > All existing transformations return just one RDD at most, even for those > which takes user-supplied functions such as mapPartitions() . However, > sometimes a user provided function may need to output multiple RDDs. For > instance, a filter function that divides the input RDD into serveral RDDs. > While it's possible to get multiple RDDs by transforming the same RDD > multiple times, it may be more efficient to do this concurrently in one shot. > Especially user's existing function is already generating different data sets. > This the case in Hive on Spark, where Hive's map function and reduce function > can output different data sets to be consumed by subsequent stages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2620) case class cannot be used as key for reduce
[ https://issues.apache.org/jira/browse/SPARK-2620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143895#comment-14143895 ] Grega Kespret commented on SPARK-2620: -- We have this issue on Spark 1.1.0. > case class cannot be used as key for reduce > --- > > Key: SPARK-2620 > URL: https://issues.apache.org/jira/browse/SPARK-2620 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0, 1.1.0 > Environment: reproduced on spark-shell local[4] >Reporter: Gerard Maas >Priority: Critical > Labels: case-class, core > > Using a case class as a key doesn't seem to work properly on Spark 1.0.0 > A minimal example: > case class P(name:String) > val ps = Array(P("alice"), P("bob"), P("charly"), P("bob")) > sc.parallelize(ps).map(x=> (x,1)).reduceByKey((x,y) => x+y).collect > [Spark shell local mode] res : Array[(P, Int)] = Array((P(bob),1), > (P(bob),1), (P(abe),1), (P(charly),1)) > In contrast to the expected behavior, that should be equivalent to: > sc.parallelize(ps).map(x=> (x.name,1)).reduceByKey((x,y) => x+y).collect > Array[(String, Int)] = Array((charly,1), (abe,1), (bob,2)) > groupByKey and distinct also present the same behavior. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143898#comment-14143898 ] RJ Nowling commented on SPARK-3614: --- It could lead to over-fitting and thus mis-predictions. In such cases, it may be valuable to exclude overly-specific terms. > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3649) ClassCastException in GraphX custom serializers when sort-based shuffle spills
Ankur Dave created SPARK-3649: - Summary: ClassCastException in GraphX custom serializers when sort-based shuffle spills Key: SPARK-3649 URL: https://issues.apache.org/jira/browse/SPARK-3649 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Reporter: Ankur Dave Assignee: Ankur Dave As [reported|http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501] on the mailing list, GraphX throws {code} java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2 at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329) {code} when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption|https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3648) Provide a script for fetching remote PR's for review
[ https://issues.apache.org/jira/browse/SPARK-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3648: --- Issue Type: New Feature (was: Bug) > Provide a script for fetching remote PR's for review > > > Key: SPARK-3648 > URL: https://issues.apache.org/jira/browse/SPARK-3648 > Project: Spark > Issue Type: New Feature > Components: Project Infra >Reporter: Patrick Wendell >Assignee: Patrick Wendell > > I've found it's useful to have a small utility script for fetching specific > pull requests locally when doing reviews. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3648) Provide a script for fetching remote PR's for review
Patrick Wendell created SPARK-3648: -- Summary: Provide a script for fetching remote PR's for review Key: SPARK-3648 URL: https://issues.apache.org/jira/browse/SPARK-3648 Project: Spark Issue Type: Bug Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell I've found it's useful to have a small utility script for fetching specific pull requests locally when doing reviews. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2848) Shade Guava in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-2848: -- Fix Version/s: (was: 1.1.0) 1.2.0 > Shade Guava in Spark deliverables > - > > Key: SPARK-2848 > URL: https://issues.apache.org/jira/browse/SPARK-2848 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.2.0 > > > As discussed in SPARK-2420, this task covers the work of shading Guava in > Spark deliverables so that they don't conflict with the Hadoop classpath (nor > user's classpath). > Since one Guava class is exposed through Spark's API, that class will be > forked from 14.0.1 (current version used by Spark) and excluded from any > shading. > The end result is that Spark's Guava won't be exposed to users anymore. This > has the side-effect of effectively downgrading to version 11 (the one used by > Hadoop) for those that do not explicitly depend on / package Guava with their > apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2848) Shade Guava in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143885#comment-14143885 ] Marcelo Vanzin commented on SPARK-2848: --- Yes, that's right, this was pushed onto master after 1.1 branched. > Shade Guava in Spark deliverables > - > > Key: SPARK-2848 > URL: https://issues.apache.org/jira/browse/SPARK-2848 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.2.0 > > > As discussed in SPARK-2420, this task covers the work of shading Guava in > Spark deliverables so that they don't conflict with the Hadoop classpath (nor > user's classpath). > Since one Guava class is exposed through Spark's API, that class will be > forked from 14.0.1 (current version used by Spark) and excluded from any > shading. > The end result is that Spark's Guava won't be exposed to users anymore. This > has the side-effect of effectively downgrading to version 11 (the one used by > Hadoop) for those that do not explicitly depend on / package Guava with their > apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143887#comment-14143887 ] Liquan Pei commented on SPARK-3614: --- To me, the less number of documents a term appears, the more important the idf part of tf*idf. Why do we ignore these terms in idf computation? Any use case? Thanks! > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2848) Shade Guava in Spark deliverables
[ https://issues.apache.org/jira/browse/SPARK-2848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143877#comment-14143877 ] Thomas Graves commented on SPARK-2848: -- [~vanzin] [~pwendell] I think the fix version on this is wrong, I don't see this in 1.1.0, I only see it in 1.2.0, can you confirm? > Shade Guava in Spark deliverables > - > > Key: SPARK-2848 > URL: https://issues.apache.org/jira/browse/SPARK-2848 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 1.1.0 > > > As discussed in SPARK-2420, this task covers the work of shading Guava in > Spark deliverables so that they don't conflict with the Hadoop classpath (nor > user's classpath). > Since one Guava class is exposed through Spark's API, that class will be > forked from 14.0.1 (current version used by Spark) and excluded from any > shading. > The end result is that Spark's Guava won't be exposed to users anymore. This > has the side-effect of effectively downgrading to version 11 (the one used by > Hadoop) for those that do not explicitly depend on / package Guava with their > apps. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143835#comment-14143835 ] Sean Owen commented on SPARK-3431: -- For your experiments, scalatest just copies an old subset of surefire's config: http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin vs http://maven.apache.org/surefire/maven-surefire-plugin/test-mojo.html You can see discussion of how forkMode works: http://maven.apache.org/surefire/maven-surefire-plugin/examples/fork-options-and-parallel-execution.html Bad news is that scalatest's support is much more limited, but parallel=true and forkMode=once might do the trick. Otherwise... I guess we can figure out if it's realistic to use standard surefire instead of scalatest. > Parallelize execution of tests > -- > > Key: SPARK-3431 > URL: https://issues.apache.org/jira/browse/SPARK-3431 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas > > Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common > strategy to cut test time down is to parallelize the execution of the tests. > Doing that may in turn require some prerequisite changes to be made to how > certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143820#comment-14143820 ] Apache Spark commented on SPARK-3614: - User 'rnowling' has created a pull request for this issue: https://github.com/apache/spark/pull/2494 > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143814#comment-14143814 ] Mark Hamstra commented on SPARK-2321: - Which would be kind of the opposite half of the SparkListenerJobStart event, which includes an array of the StageIds in a Job. I included that way back when as a suggestion of at least some of what might be needed to implement better job-based progress reporting. I'd have to look, but I don't believe anything is actually using that stage-reporting on JobStart right now. In any event, any proper progress reporting should rationalize, extend or eliminate that part of SparkListenerJobStart. > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result
[ https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph E. Gonzalez resolved SPARK-3578. --- Resolution: Fixed Fix Version/s: 1.2.0 Resolved by https://github.com/apache/spark/pull/2439 > GraphGenerators.sampleLogNormal sometimes returns too-large result > -- > > Key: SPARK-3578 > URL: https://issues.apache.org/jira/browse/SPARK-3578 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 1.2.0 >Reporter: Ankur Dave >Assignee: Ankur Dave >Priority: Minor > Fix For: 1.2.0 > > > GraphGenerators.sampleLogNormal is supposed to return an integer strictly > less than maxVal. However, it violates this guarantee. It generates its > return value as follows: > {code} > var X: Double = maxVal > while (X >= maxVal) { > val Z = rand.nextGaussian() > X = math.exp(mu + sigma*Z) > } > math.round(X.toFloat) > {code} > When X is sampled to be close to (but less than) maxVal, then it will pass > the while loop condition, but the rounded result will be equal to maxVal, > which will fail the test. > For example, if maxVal is 5 and X is 4.9, then X < maxVal, but > math.round(X.toFloat) is 5. > A solution is to round X down instead of to the nearest integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3647) Shaded Guava patch causes access issues with package private classes
[ https://issues.apache.org/jira/browse/SPARK-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143796#comment-14143796 ] Marcelo Vanzin commented on SPARK-3647: --- There are two options I see here: - extend the hack to also not relocate the affected classes (Absent and Present should be enough) - fork some code from Guava and modify it to avoid the issue. I'll go on a limb and say the first option is easier. > Shaded Guava patch causes access issues with package private classes > > > Key: SPARK-3647 > URL: https://issues.apache.org/jira/browse/SPARK-3647 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Critical > > The patch that introduced shading to Guava (SPARK-2848) tried to maintain > backwards compatibility in the Java API by not relocating the "Optional" > class. That causes problems when that class references package private > members in the Absent and Present classes, which are now in a different > package: > {noformat} > Exception in thread "main" java.lang.IllegalAccessError: tried to access > class org.spark-project.guava.common.base.Present from class > com.google.common.base.Optional > at com.google.common.base.Optional.of(Optional.java:86) > at > org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25) > at > org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3647) Shaded Guava patch causes access issues with package private classes
Marcelo Vanzin created SPARK-3647: - Summary: Shaded Guava patch causes access issues with package private classes Key: SPARK-3647 URL: https://issues.apache.org/jira/browse/SPARK-3647 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Marcelo Vanzin Priority: Critical The patch that introduced shading to Guava (SPARK-2848) tried to maintain backwards compatibility in the Java API by not relocating the "Optional" class. That causes problems when that class references package private members in the Absent and Present classes, which are now in a different package: {noformat} Exception in thread "main" java.lang.IllegalAccessError: tried to access class org.spark-project.guava.common.base.Present from class com.google.common.base.Optional at com.google.common.base.Optional.of(Optional.java:86) at org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:25) at org.apache.spark.api.java.JavaSparkContext.getSparkHome(JavaSparkContext.scala:542) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143792#comment-14143792 ] Aaron Davidson commented on SPARK-3032: --- [~matei] any thoughts on this issue? > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143775#comment-14143775 ] Hari Shreedharan commented on SPARK-3129: - I did multiple rounds of testing and it looks like on average total rate for writing and flushing is around 100 MB/s. There are a couple of outliers, but that is likely due to flakey networking on EC2. Barring the one outlier, the least I got was 79 MB/s and max was 142 MB/s, but most were near 100. > Prevent data loss in Spark Streaming > > > Key: SPARK-3129 > URL: https://issues.apache.org/jira/browse/SPARK-3129 > Project: Spark > Issue Type: New Feature > Components: Streaming >Reporter: Hari Shreedharan >Assignee: Hari Shreedharan > Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf > > > Spark Streaming can small amounts of data when the driver goes down - and the > sending system cannot re-send the data (or the data has already expired on > the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext
[ https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143750#comment-14143750 ] Yin Huai commented on SPARK-3641: - No, I have not started. I can start after your caching stuff is in. > Correctly populate SparkPlan.currentContext > --- > > Key: SPARK-3641 > URL: https://issues.apache.org/jira/browse/SPARK-3641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > After creating a new SQLContext, we need to populate SparkPlan.currentContext > before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD > populate SparkPlan.currentContext. SQLContext.applySchema is missing this > call and we can have NPE as described in > http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143744#comment-14143744 ] Nicholas Chammas commented on SPARK-3431: - I see. I'll try to look into it then. I don't know much about Maven, frankly, but this sounds doable for the relative n00b. Since for starters we're just gonna try parallelizing the execution of entire test suites, we may not need to make many modifications to the tests upfront. We'll see. > Parallelize execution of tests > -- > > Key: SPARK-3431 > URL: https://issues.apache.org/jira/browse/SPARK-3431 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas > > Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common > strategy to cut test time down is to parallelize the execution of the tests. > Doing that may in turn require some prerequisite changes to be made to how > certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext
[ https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143737#comment-14143737 ] Michael Armbrust commented on SPARK-3641: - Hey [~yhuai] have you started on this yet? I think the addition of a logical plan for existing RDD is going to conflict with some work on caching that I'm doing. > Correctly populate SparkPlan.currentContext > --- > > Key: SPARK-3641 > URL: https://issues.apache.org/jira/browse/SPARK-3641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > After creating a new SQLContext, we need to populate SparkPlan.currentContext > before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD > populate SparkPlan.currentContext. SQLContext.applySchema is missing this > call and we can have NPE as described in > http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3270) Spark API for Application Extensions
[ https://issues.apache.org/jira/browse/SPARK-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143732#comment-14143732 ] Michal Malohlava commented on SPARK-3270: - Hi Patrick, you are right - in the case of an independent component, we can initialize them lazily with a task. Nevertheless, if all components inside all Executors need to share a common knowledge, then lazy initialization is little bit cumbersome. In this JIRA, we do not want to propose any heavy-weight generic discovery system, but just a lightweight way of running code inside Spark infrastructure without modifying Spark core code (i would compare it to Linux kernel drivers). > Spark API for Application Extensions > > > Key: SPARK-3270 > URL: https://issues.apache.org/jira/browse/SPARK-3270 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Michal Malohlava > > Any application should be able to enrich spark infrastructure by services > which are not available by default. > Hence, to support such application extensions (aka "extesions"/"plugins") > Spark platform should provide: > - an API to register an extension > - an API to register a "service" (meaning provided functionality) > - well-defined points in Spark infrastructure which can be enriched/hooked > by extension > - a way of deploying extension (for example, simply putting the extension > on classpath and using Java service interface) > - a way to access extension from application > Overall proposal is available here: > https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing > Note: In this context, I do not mean reinventing OSGi (or another plugin > platform) but it can serve as a good starting point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143724#comment-14143724 ] Sean Owen commented on SPARK-3431: -- It's trivial to configure Maven surefire/failsafe to execute tests in parallel. It can parallelize by class or method, fork or not, control number of concurrent forks as a multiple of cores, etc. For example, it's no problem to make test classes use their own JVM, and not even reuse JVMs if you don't want. The harder part is making the tests play nice with each other on one machine when it comes to shared resources: files and ports, really. I think the tests have had several passes of improvements to reliably use their own temp space, and try to use an unused port, but this is one typical cause of test breakage. It's not yet clear that tests don't clobber each other by trying to use the same default Spark working dir or something. Finally, some tests that depend on a certain sequence of random numbers may need to be made more robust. but the parallelization is trivial in Maven, at least. > Parallelize execution of tests > -- > > Key: SPARK-3431 > URL: https://issues.apache.org/jira/browse/SPARK-3431 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas > > Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common > strategy to cut test time down is to parallelize the execution of the tests. > Doing that may in turn require some prerequisite changes to be made to how > certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries
[ https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143720#comment-14143720 ] Nicholas Chammas commented on SPARK-2870: - [~marmbrus] - API-wise, how are you thinking of exposing this functionality when implemented? Would it make sense to add an additional input parameter like {{sampleFraction}} to {{SQLContext.inferSchema()}}? So, for example, if you want the inference to run on the whole RDD, you pass {{sampleFraction=1.0}}. And if you don't specify this parameter, it defaults to a very small fraction, or maybe even the current behavior of looking at just the first element. This could perhaps call {{RDD.sample()}} under the sheets. > Thorough schema inference directly on RDDs of Python dictionaries > - > > Key: SPARK-2870 > URL: https://issues.apache.org/jira/browse/SPARK-2870 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Reporter: Nicholas Chammas > > h4. Background > I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. > They process JSON text directly and infer a schema that covers the entire > source data set. > This is very important with semi-structured data like JSON since individual > elements in the data set are free to have different structures. Matching > fields across elements may even have different value types. > For example: > {code} > {"a": 5} > {"a": "cow"} > {code} > To get a queryable schema that covers the whole data set, you need to infer a > schema by looking at the whole data set. The aforementioned > {{SQLContext.json...()}} methods do this very well. > h4. Feature Request > What we need is for {{SQlContext.inferSchema()}} to do this, too. > Alternatively, we need a new {{SQLContext}} method that works on RDDs of > Python dictionaries and does something functionally equivalent to this: > {code} > SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x))) > {code} > As of 1.0.2, > [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema] > just looks at the first element in the data set. This won't help much when > the structure of the elements in the target RDD is variable. > h4. Example Use Case > * You have some JSON text data that you want to analyze using Spark SQL. > * You would use one of the {{SQLContext.json...()}} methods, but you need to > do some filtering on the data first to remove bad elements--basically, some > minimal schema validation. > * You deserialize the JSON objects to Python {{dict}} s and filter out the > bad ones. You now have an RDD of dictionaries. > * From this RDD, you want a SchemaRDD that captures the schema for the whole > data set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143716#comment-14143716 ] Evan Chan commented on SPARK-3298: -- Sounds good, thanks! -Evan "Never doubt that a small group of thoughtful, committed citizens can change the world" - M. Mead > [SQL] registerAsTable / registerTempTable overwrites old tables > --- > > Key: SPARK-3298 > URL: https://issues.apache.org/jira/browse/SPARK-3298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Evan Chan >Assignee: Michael Armbrust >Priority: Minor > Labels: newbie > > At least in Spark 1.0.2, calling registerAsTable("a") when "a" had been > registered before does not cause an error. However, there is no way to > access the old table, even though it may be cached and taking up space. > How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3610) History server log name should not be based on user input
[ https://issues.apache.org/jira/browse/SPARK-3610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143709#comment-14143709 ] Andrew Or commented on SPARK-3610: -- Hi all, I don't have the time to fix this, but this is where we generate the name for these event log files: https://github.com/apache/spark/blob/56dae30ca70489a62686cb245728b09b2179bb5a/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L61 I think we should try to keep the name of the application so the user can still associate with logs are with which application, and coming up with a random GUID makes this difficult. Maybe instead we should just escape more characters (there are only so many). > History server log name should not be based on user input > - > > Key: SPARK-3610 > URL: https://issues.apache.org/jira/browse/SPARK-3610 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.0 >Reporter: SK >Priority: Critical > > Right now we use the user-defined application name when creating the logging > file for the history server. We should use some type of GUID generated from > inside of Spark instead of allowing user input here. It can cause errors if > users provide characters that are not valid in filesystem paths. > Original bug report: > {quote} > The default log files for the Mllib examples use a rather long naming > convention that includes special characters like parentheses and comma.For > e.g. one of my log files is named > "binaryclassifier-with-params(input.txt,100,1.0,svm,l2,0.1)-1410566770032". > When I click on the program on the history server page (at port 18080), to > view the detailed application logs, the history server crashes and I need to > restart it. I am using Spark 1.1 on a mesos cluster. > I renamed the log file by removing the special characters and then it loads > up correctly. I am not sure which program is creating the log files. Can it > be changed so that the default log file naming convention does not include > special characters? > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143707#comment-14143707 ] Nicholas Chammas commented on SPARK-3431: - {quote} Do you know how maven / sbt plugins handle this? {quote} Not really. What I can do for starters is just experiment with GNU parallel and see how it works. {quote} The GNU parallel approach ... has the nice advantage of only affecting Jenkins {quote} Well, if we are modifying {{dev/run-tests}} then developers should also be able to use it locally. The contributing guide recommends running tests using that script. If we do go the GNU parallel route, we can have it trigger only if it detects GNU parallel on the host. > Parallelize execution of tests > -- > > Key: SPARK-3431 > URL: https://issues.apache.org/jira/browse/SPARK-3431 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas > > Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common > strategy to cut test time down is to parallelize the execution of the tests. > Doing that may in turn require some prerequisite changes to be made to how > certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3646) Copy SQL options from the spark context
[ https://issues.apache.org/jira/browse/SPARK-3646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143693#comment-14143693 ] Apache Spark commented on SPARK-3646: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2493 > Copy SQL options from the spark context > --- > > Key: SPARK-3646 > URL: https://issues.apache.org/jira/browse/SPARK-3646 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143684#comment-14143684 ] Josh Rosen commented on SPARK-3431: --- [~nchammas] I'm not sure. The different test suites depend on the same build artifacts, but it looks like we call {{sbt assembly}} before running any tests. The GNU parallel approach would certainly be easy to implement and it has the nice advantage of only affecting Jenkins, but I have one concern about test reporting. How will output from tests be printed and will the test report XML files be generated at the same locations? It might be confusing to see the output of several test suites interleaved in an arbitrary way. Do you know how maven / sbt plugins handle this? > Parallelize execution of tests > -- > > Key: SPARK-3431 > URL: https://issues.apache.org/jira/browse/SPARK-3431 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas > > Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common > strategy to cut test time down is to parallelize the execution of the tests. > Doing that may in turn require some prerequisite changes to be made to how > certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3646) Copy SQL options from the spark context
Michael Armbrust created SPARK-3646: --- Summary: Copy SQL options from the spark context Key: SPARK-3646 URL: https://issues.apache.org/jira/browse/SPARK-3646 Project: Spark Issue Type: Improvement Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2062) VertexRDD.apply does not use the mergeFunc
[ https://issues.apache.org/jira/browse/SPARK-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave resolved SPARK-2062. --- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Assignee: Larry Xiao (was: Ankur Dave) Resolved by https://github.com/apache/spark/pull/1903 > VertexRDD.apply does not use the mergeFunc > -- > > Key: SPARK-2062 > URL: https://issues.apache.org/jira/browse/SPARK-2062 > Project: Spark > Issue Type: Bug > Components: GraphX >Reporter: Ankur Dave >Assignee: Larry Xiao > Fix For: 1.1.1, 1.2.0 > > > Here: > https://github.com/apache/spark/blob/b1feb60209174433262de2a26d39616ba00edcc8/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L410 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143656#comment-14143656 ] Nicholas Chammas commented on SPARK-3431: - [~joshrosen] I can take a crack at this in the next week or so if it's a simple matter of breaking up [this line|https://github.com/apache/spark/blob/56dae30ca70489a62686cb245728b09b2179bb5a/dev/run-tests#L170] into several invocations of {{sbt}} and parallelizing them with [GNU parallel|http://www.gnu.org/software/parallel/]. Would that work? I remember on the dev list we were discussing using some plugin to Maven to parallelize tests, but I don't know much about that at this time. > Parallelize execution of tests > -- > > Key: SPARK-3431 > URL: https://issues.apache.org/jira/browse/SPARK-3431 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas > > Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common > strategy to cut test time down is to parallelize the execution of the tests. > Doing that may in turn require some prerequisite changes to be made to how > certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3645) Make caching using SQL commands eager by default, with the option of being lazy
Michael Armbrust created SPARK-3645: --- Summary: Make caching using SQL commands eager by default, with the option of being lazy Key: SPARK-3645 URL: https://issues.apache.org/jira/browse/SPARK-3645 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3634) Python modules added through addPyFile should take precedence over system modules
[ https://issues.apache.org/jira/browse/SPARK-3634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143641#comment-14143641 ] Apache Spark commented on SPARK-3634: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/2492 > Python modules added through addPyFile should take precedence over system > modules > - > > Key: SPARK-3634 > URL: https://issues.apache.org/jira/browse/SPARK-3634 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 1.0.2, 1.1.0 >Reporter: Josh Rosen > > Python modules added through {{SparkContext.addPyFile()}} are currently > _appended_ to {{sys.path}}; this is probably the opposite of the behavior > that we want, since it causes system versions of modules to take precedence > over versions explicitly added by users. > To fix this, we should change the {{sys.path}} manipulation code in > {{context.py}} and {{worker.py}} to prepend files to {{sys.path}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143637#comment-14143637 ] Michael Armbrust commented on SPARK-3298: - I think the plan here is to add an allowExisting flag to registerTempTable that checks to see if the table exists and throws an exception. This flag will default to false. I'll add this as part of the work I'm going to fix our caching behavior. > [SQL] registerAsTable / registerTempTable overwrites old tables > --- > > Key: SPARK-3298 > URL: https://issues.apache.org/jira/browse/SPARK-3298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Evan Chan >Assignee: Michael Armbrust >Priority: Minor > Labels: newbie > > At least in Spark 1.0.2, calling registerAsTable("a") when "a" had been > registered before does not cause an error. However, there is no way to > access the old table, even though it may be cached and taking up space. > How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-3298: --- Assignee: Michael Armbrust > [SQL] registerAsTable / registerTempTable overwrites old tables > --- > > Key: SPARK-3298 > URL: https://issues.apache.org/jira/browse/SPARK-3298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Evan Chan >Assignee: Michael Armbrust >Priority: Minor > Labels: newbie > > At least in Spark 1.0.2, calling registerAsTable("a") when "a" had been > registered before does not cause an error. However, there is no way to > access the old table, even though it may be cached and taking up space. > How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3298) [SQL] registerAsTable / registerTempTable overwrites old tables
[ https://issues.apache.org/jira/browse/SPARK-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3298: Target Version/s: 1.2.0 > [SQL] registerAsTable / registerTempTable overwrites old tables > --- > > Key: SPARK-3298 > URL: https://issues.apache.org/jira/browse/SPARK-3298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.0.2 >Reporter: Evan Chan >Assignee: Michael Armbrust >Priority: Minor > Labels: newbie > > At least in Spark 1.0.2, calling registerAsTable("a") when "a" had been > registered before does not cause an error. However, there is no way to > access the old table, even though it may be cached and taking up space. > How about at least throwing an error? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3644: -- Description: This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, how the API will be documented and tested, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger was: This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger > REST API for Spark application info (jobs / stages / tasks / storage info) > -- > > Key: SPARK-3644 > URL: https://issues.apache.org/jira/browse/SPARK-3644 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Reporter: Josh Rosen > > This JIRA is a forum to draft a design proposal for a REST interface for > accessing information about Spark applications, such as job / stage / task / > storage status. > There have been a number of proposals to serve JSON representations of the > information displayed in Spark's web UI. Given that we might redesign the > pages of the web UI (and possibly re-implement the UI as a client of a REST > API), the API endpoints and their responses should be independent of what we > choose to display on particular web UI pages / layouts. > Let's start a discussion of what a good REST API would look like from > first-principles. We can discuss what urls / endpoints expose access to > data, how our JSON responses will be formatted, how fields will be named, how > the API will be documented and tested, etc. > Some links for inspiration: > https://developer.github.com/v3/ > http://developer.netflix.com/docs/REST_API_Reference > https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
[ https://issues.apache.org/jira/browse/SPARK-3644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3644: -- Assignee: (was: Josh Rosen) > REST API for Spark application info (jobs / stages / tasks / storage info) > -- > > Key: SPARK-3644 > URL: https://issues.apache.org/jira/browse/SPARK-3644 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Reporter: Josh Rosen > > This JIRA is a forum to draft a design proposal for a REST interface for > accessing information about Spark applications, such as job / stage / task / > storage status. > There have been a number of proposals to serve JSON representations of the > information displayed in Spark's web UI. Given that we might redesign the > pages of the web UI (and possibly re-implement the UI as a client of a REST > API), the API endpoints and their responses should be independent of what we > choose to display on particular web UI pages / layouts. > Let's start a discussion of what a good REST API would look like from > first-principles. We can discuss what urls / endpoints expose access to > data, how our JSON responses will be formatted, how fields will be named, etc. > Some links for inspiration: > https://developer.github.com/v3/ > http://developer.netflix.com/docs/REST_API_Reference > https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3644) REST API for Spark application info (jobs / stages / tasks / storage info)
Josh Rosen created SPARK-3644: - Summary: REST API for Spark application info (jobs / stages / tasks / storage info) Key: SPARK-3644 URL: https://issues.apache.org/jira/browse/SPARK-3644 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Reporter: Josh Rosen This JIRA is a forum to draft a design proposal for a REST interface for accessing information about Spark applications, such as job / stage / task / storage status. There have been a number of proposals to serve JSON representations of the information displayed in Spark's web UI. Given that we might redesign the pages of the web UI (and possibly re-implement the UI as a client of a REST API), the API endpoints and their responses should be independent of what we choose to display on particular web UI pages / layouts. Let's start a discussion of what a good REST API would look like from first-principles. We can discuss what urls / endpoints expose access to data, how our JSON responses will be formatted, how fields will be named, etc. Some links for inspiration: https://developer.github.com/v3/ http://developer.netflix.com/docs/REST_API_Reference https://helloreverb.com/developers/swagger -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3643) Add cluster-specific config settings to configuration page
Matei Zaharia created SPARK-3643: Summary: Add cluster-specific config settings to configuration page Key: SPARK-3643 URL: https://issues.apache.org/jira/browse/SPARK-3643 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Matei Zaharia This would make it easier to search a single page for these options. The downside is that we'd have to maintain them in 2 places (cluster-specific pages and this one). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2373) RDD add span function (split an RDD to two RDD based on user's function)
[ https://issues.apache.org/jira/browse/SPARK-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2373. --- Resolution: Won't Fix Resolving this as "Won't Fix", per discussion on the PR. [Matei said|https://github.com/apache/spark/pull/1306#issuecomment-53838250]: {quote} IMO this is too specialized to include. It's small enough that applications can do it themselves, but also fairly confusing unless your RDD is already sorted in some way. I think we should just leave it for applications to do it. If you are doing a skewed join operator for example, you can do it within the implementation of that but not show it to the user. {quote} > RDD add span function (split an RDD to two RDD based on user's function) > - > > Key: SPARK-2373 > URL: https://issues.apache.org/jira/browse/SPARK-2373 > Project: Spark > Issue Type: New Feature >Reporter: Yanjie Gao > > Splits this RDD into a prefix/suffix pair according to a predicate . > returns > a pair consisting of the longest prefix of this RDD whose elements all > satisfy p, and the rest of this list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3629) Improvements to YARN doc
[ https://issues.apache.org/jira/browse/SPARK-3629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia updated SPARK-3629: - Labels: starter (was: ) > Improvements to YARN doc > > > Key: SPARK-3629 > URL: https://issues.apache.org/jira/browse/SPARK-3629 > Project: Spark > Issue Type: Documentation > Components: Documentation, YARN >Reporter: Matei Zaharia > Labels: starter > > Right now this doc starts off with a big list of config options, and only > then tells you how to submit an app. It would be better to put that part and > the packaging part first, and the config options only at the end. > In addition, the doc mentions yarn-cluster vs yarn-client as separate > masters, which is inconsistent with the help output from spark-submit (which > says to always use "yarn"). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext
[ https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143566#comment-14143566 ] Yin Huai commented on SPARK-3641: - Sounds good. Let me fix it. > Correctly populate SparkPlan.currentContext > --- > > Key: SPARK-3641 > URL: https://issues.apache.org/jira/browse/SPARK-3641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yin Huai >Priority: Critical > > After creating a new SQLContext, we need to populate SparkPlan.currentContext > before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD > populate SparkPlan.currentContext. SQLContext.applySchema is missing this > call and we can have NPE as described in > http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3431) Parallelize execution of tests
[ https://issues.apache.org/jira/browse/SPARK-3431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143559#comment-14143559 ] Josh Rosen commented on SPARK-3431: --- It would be great to address this soon, since several open PRs plan to add expensive new test suites (Hive integration tests, Selenium tests for the web UI, etc.). There are some thread-safety issues when running multiple SparkContexts in the same JVM, so for now we're restricted to running one test suite per JVM. However, I think we should be able to parallelize the execution of tests from different subprojects, e.g. by running Spark SQL tests in parallel with Spark Streaming tests (each using its own JVM). Our Jenkins cluster is pretty underutilized, so I don't think this will cause problems. We also recently increased the file descriptor ulimits, so this shouldn't cause any issues with port exhaustion, etc. > Parallelize execution of tests > -- > > Key: SPARK-3431 > URL: https://issues.apache.org/jira/browse/SPARK-3431 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Nicholas Chammas > > Running all the tests in {{dev/run-tests}} takes up to 2 hours. A common > strategy to cut test time down is to parallelize the execution of the tests. > Doing that may in turn require some prerequisite changes to be made to how > certain tests run. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext
[ https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143540#comment-14143540 ] Michael Armbrust commented on SPARK-3641: - The idea here is to be able to support more than one SQL context, so I think we will always need populate this field before constructing physical operators. To avoid bugs like this, it would be good to limit the number of places where physical plans are constructed. Right now its kind of a hack that we use SparkLogicalPlan as a connector and manually create the physical ExistingRDD operator. If we instead had a true logical concept for ExistingRDDs then this bug would not have occurred > Correctly populate SparkPlan.currentContext > --- > > Key: SPARK-3641 > URL: https://issues.apache.org/jira/browse/SPARK-3641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yin Huai >Priority: Critical > > After creating a new SQLContext, we need to populate SparkPlan.currentContext > before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD > populate SparkPlan.currentContext. SQLContext.applySchema is missing this > call and we can have NPE as described in > http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1655) In naive Bayes, store conditional probabilities distributively.
[ https://issues.apache.org/jira/browse/SPARK-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143550#comment-14143550 ] Apache Spark commented on SPARK-1655: - User 'staple' has created a pull request for this issue: https://github.com/apache/spark/pull/2491 > In naive Bayes, store conditional probabilities distributively. > --- > > Key: SPARK-1655 > URL: https://issues.apache.org/jira/browse/SPARK-1655 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Xiangrui Meng > > In the current implementation, we collect all conditional probabilities to > the driver node. When there are many labels and many features, this puts > heavy load on the driver. For scalability, we should provide a way to store > conditional probabilities distributively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143517#comment-14143517 ] Adam Kawa commented on SPARK-3561: -- We also would be very interested in trying this out (especially for large, batch applications that we wish to run on Spark). > Native Hadoop/YARN integration for batch/ETL workloads > -- > > Key: SPARK-3561 > URL: https://issues.apache.org/jira/browse/SPARK-3561 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Oleg Zhurakousky > Labels: features > Fix For: 1.2.0 > > Attachments: SPARK-3561.pdf > > > Currently Spark provides integration with external resource-managers such as > Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the > current architecture of Spark-on-YARN can be enhanced to provide > significantly better utilization of cluster resources for large scale, batch > and/or ETL applications when run alongside other applications (Spark and > others) and services in YARN. > Proposal: > The proposed approach would introduce a pluggable JobExecutionContext (trait) > - a gateway and a delegate to Hadoop execution environment - as a non-public > api (@DeveloperAPI) not exposed to end users of Spark. > The trait will define 4 only operations: > * hadoopFile > * newAPIHadoopFile > * broadcast > * runJob > Each method directly maps to the corresponding methods in current version of > SparkContext. JobExecutionContext implementation will be accessed by > SparkContext via master URL as > "execution-context:foo.bar.MyJobExecutionContext" with default implementation > containing the existing code from SparkContext, thus allowing current > (corresponding) methods of SparkContext to delegate to such implementation. > An integrator will now have an option to provide custom implementation of > DefaultExecutionContext by either implementing it from scratch or extending > form DefaultExecutionContext. > Please see the attached design doc for more details. > Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3614) Filter on minimum occurrences of a term in IDF
[ https://issues.apache.org/jira/browse/SPARK-3614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14142860#comment-14142860 ] RJ Nowling edited comment on SPARK-3614 at 9/22/14 5:52 PM: Thanks, Andrew! I'll do that. was (Author: rnowling): Thanks, Andrew! I'll do that. -- em rnowl...@gmail.com c 954.496.2314 > Filter on minimum occurrences of a term in IDF > --- > > Key: SPARK-3614 > URL: https://issues.apache.org/jira/browse/SPARK-3614 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Jatinpreet Singh >Assignee: RJ Nowling >Priority: Minor > Labels: TFIDF > > The IDF class in MLlib does not provide the capability of defining a minimum > number of documents a term should appear in the corpus. The idea is to have a > cutoff variable which defines this minimum occurrence value, and the terms > which have lower frequency are ignored. > Mathematically, > IDF(t,D)=log( (|D|+1)/(DF(t,D)+1) ), for DF(t,D) >=minimumOccurance > where, > D is the total number of documents in the corpus > DF(t,D) is the number of documents that contain the term t > minimumOccurance is the minimum number of documents the term appears in the > document corpus > This would have an impact on accuracy as terms that appear in less than a > certain limit of documents, have low or no importance in TFIDF vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3627) spark on yarn reports success even though job fails
[ https://issues.apache.org/jira/browse/SPARK-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143523#comment-14143523 ] Thomas Graves commented on SPARK-3627: -- this might be the same as SPARK-3293 > spark on yarn reports success even though job fails > --- > > Key: SPARK-3627 > URL: https://issues.apache.org/jira/browse/SPARK-3627 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a wordcount and saving the output to hdfs. If the output > directory already exists, yarn reports success even though the job fails > since it requires the output directory to not be there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3631) Add docs for checkpoint usage
[ https://issues.apache.org/jira/browse/SPARK-3631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143484#comment-14143484 ] Burak Yavuz commented on SPARK-3631: Thanks for setting this up [~aash]! [~pwendell], [~tdas], [~joshrosen] could you please confirm/correct/add to my explanation above. Thanks! > Add docs for checkpoint usage > - > > Key: SPARK-3631 > URL: https://issues.apache.org/jira/browse/SPARK-3631 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.1.0 >Reporter: Andrew Ash >Assignee: Andrew Ash > > We should include general documentation on using checkpoints. Right now the > docs only cover checkpoints in the Spark Streaming use case which is slightly > different from Core. > Some content to consider for inclusion from [~brkyvz]: > {quote} > If you set the checkpointing directory however, the intermediate state of the > RDDs will be saved in HDFS, and the lineage will pick off from there. > You won't need to keep the shuffle data before the checkpointed state, > therefore those can be safely removed (will be removed automatically). > However, checkpoint must be called explicitly as in > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala#L291 > ,just setting the directory will not be enough. > {quote} > {quote} > Yes, writing to HDFS is more expensive, but I feel it is still a small price > to pay when compared to having a Disk Space Full error three hours in > and having to start from scratch. > The main goal of checkpointing is to truncate the lineage. Clearing up > shuffle writes come as a bonus to checkpointing, it is not the main goal. The > subtlety here is that .checkpoint() is just like .cache(). Until you call an > action, nothing happens. Therefore, if you're going to do 1000 maps in a > row and you don't want to checkpoint in the meantime until a shuffle happens, > you will still get a StackOverflowError, because the lineage is too long. > I went through some of the code for checkpointing. As far as I can tell, it > materializes the data in HDFS, and resets all its dependencies, so you start > a fresh lineage. My understanding would be that checkpointing still should be > done every N operations to reset the lineage. However, an action must be > performed before the lineage grows too long. > {quote} > A good place to put this information would be at > https://spark.apache.org/docs/latest/programming-guide.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3588) Gaussian Mixture Model clustering
[ https://issues.apache.org/jira/browse/SPARK-3588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3588: -- Assignee: Meethu Mathew > Gaussian Mixture Model clustering > - > > Key: SPARK-3588 > URL: https://issues.apache.org/jira/browse/SPARK-3588 > Project: Spark > Issue Type: New Feature > Components: MLlib, PySpark >Reporter: Meethu Mathew >Assignee: Meethu Mathew > Attachments: GMMSpark.py > > > Gaussian Mixture Models (GMM) is a popular technique for soft clustering. GMM > models the entire data set as a finite mixture of Gaussian distributions,each > parameterized by a mean vector µ ,a covariance matrix ∑ and a mixture weight > π. In this technique, probability of each point to belong to each cluster is > computed along with the cluster statistics. > We have come up with an initial distributed implementation of GMM in pyspark > where the parameters are estimated using the Expectation-Maximization > algorithm.Our current implementation considers diagonal covariance matrix for > each component. > We did an initial benchmark study on a 2 node Spark standalone cluster setup > where each node config is 8 Cores,8 GB RAM, the spark version used is 1.0.0. > We also evaluated python version of k-means available in spark on the same > datasets. > Below are the results from this benchmark study. The reported stats are > average from 10 runs.Tests were done on multiple datasets with varying number > of features and instances. > || Dataset > || Gaussian > mixture model || > Kmeans(Python) || > > |Instances|Dimensions |Avg time per iteration|Time for 100 iterations |Avg > time per iteration |Time for 100 iterations | > |0.7million| 13 > | > 7s > | > 12min > | > 13s > | 26min > | > |1.8million| 11 > | > 17s > | > 29min > | > 33s > | 53min > | > |10million| 16 > | > 1.6min > | 2.7hr > | > 1.2min | > 2hr > | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1475) Drain event logging queue before stopping event logger
[ https://issues.apache.org/jira/browse/SPARK-1475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kan Zhang updated SPARK-1475: - Summary: Drain event logging queue before stopping event logger (was: Draining event logging queue before stopping event logger) > Drain event logging queue before stopping event logger > -- > > Key: SPARK-1475 > URL: https://issues.apache.org/jira/browse/SPARK-1475 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Kan Zhang >Assignee: Kan Zhang >Priority: Blocker > Fix For: 1.0.0 > > > When stopping SparkListenerBus, its event queue needs to be drained. And this > needs to happen before event logger is stopped. Otherwise, any event still > waiting to be processed in the queue may be lost and consequently event log > file may be incomplete. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3627) spark on yarn reports success even though job fails
[ https://issues.apache.org/jira/browse/SPARK-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143438#comment-14143438 ] Thomas Graves commented on SPARK-3627: -- We could make this a separate issue, but I've also seen it report failure when it actually succeeded. In that case I believe it did an sc.stop() and System.exit(0). > spark on yarn reports success even though job fails > --- > > Key: SPARK-3627 > URL: https://issues.apache.org/jira/browse/SPARK-3627 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.2.0 >Reporter: Thomas Graves >Priority: Critical > > I was running a wordcount and saving the output to hdfs. If the output > directory already exists, yarn reports success even though the job fails > since it requires the output directory to not be there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting & event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143425#comment-14143425 ] Josh Rosen commented on SPARK-2321: --- {quote} ... maybe we should redesign the SparkListener event API, and add job id info into Stage/Task event in Scheduler before post it to listener bus. {quote} A stage may be used by multiple jobs, so we'd have to think carefully about how the API should reflect this. It looks like DAGScheduler's internal {{Stage}} class tracks the id of the job that first submitted the stage, and {{activeJobForStage}} finds "the earliest-created active job that needs the stage." It might make sense to associate Stage/Task start events with the list of active jobs that depend on them. > Design a proper progress reporting & event listener API > --- > > Key: SPARK-2321 > URL: https://issues.apache.org/jira/browse/SPARK-2321 > Project: Spark > Issue Type: Improvement > Components: Java API, Spark Core >Affects Versions: 1.0.0 >Reporter: Reynold Xin >Assignee: Josh Rosen >Priority: Critical > > This is a ticket to track progress on redesigning the SparkListener and > JobProgressListener API. > There are multiple problems with the current design, including: > 0. I'm not sure if the API is usable in Java (there are at least some enums > we used in Scala and a bunch of case classes that might complicate things). > 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of > attention to it yet. Something as important as progress reporting deserves a > more stable API. > 2. There is no easy way to connect jobs with stages. Similarly, there is no > easy way to connect job groups with jobs / stages. > 3. JobProgressListener itself has no encapsulation at all. States can be > arbitrarily mutated by external programs. Variable names are sort of randomly > decided and inconsistent. > We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3625: --- Description: The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} This limit is too strict , This makes it difficult to implement SPARK-3623 . was: The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} SPARK-3623 > In some cases, the RDD.checkpoint does not work > --- > > Key: SPARK-3625 > URL: https://issues.apache.org/jira/browse/SPARK-3625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > The reproduce code: > {code} > sc.setCheckpointDir(checkpointDir) > val c = sc.parallelize((1 to 1000)).map(_ + 1) > c.count > val dep = c.dependencies.head.rdd > c.checkpoint() > c.count > assert(dep != c.dependencies.head.rdd) > {code} > This limit is too strict , This makes it difficult to implement SPARK-3623 . -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3642) Better document the nuances of shared variables
[ https://issues.apache.org/jira/browse/SPARK-3642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143423#comment-14143423 ] Apache Spark commented on SPARK-3642: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/2490 > Better document the nuances of shared variables > --- > > Key: SPARK-3642 > URL: https://issues.apache.org/jira/browse/SPARK-3642 > Project: Spark > Issue Type: Improvement > Components: Documentation >Reporter: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3625: --- Description: The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} SPARK-3623 was: The reproduce code: {code} sc.setCheckpointDir(checkpointDir) val c = sc.parallelize((1 to 1000)).map(_ + 1) c.count val dep = c.dependencies.head.rdd c.checkpoint() c.count assert(dep != c.dependencies.head.rdd) {code} > In some cases, the RDD.checkpoint does not work > --- > > Key: SPARK-3625 > URL: https://issues.apache.org/jira/browse/SPARK-3625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > The reproduce code: > {code} > sc.setCheckpointDir(checkpointDir) > val c = sc.parallelize((1 to 1000)).map(_ + 1) > c.count > val dep = c.dependencies.head.rdd > c.checkpoint() > c.count > assert(dep != c.dependencies.head.rdd) > {code} > SPARK-3623 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143410#comment-14143410 ] Guoqiang Li commented on SPARK-3625: Ok, it has been modified to improvement This limit is too strict , SPARK-3623 relies on here. > In some cases, the RDD.checkpoint does not work > --- > > Key: SPARK-3625 > URL: https://issues.apache.org/jira/browse/SPARK-3625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > The reproduce code: > {code} > sc.setCheckpointDir(checkpointDir) > val c = sc.parallelize((1 to 1000)).map(_ + 1) > c.count > val dep = c.dependencies.head.rdd > c.checkpoint() > c.count > assert(dep != c.dependencies.head.rdd) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3625: --- Priority: Major (was: Blocker) > In some cases, the RDD.checkpoint does not work > --- > > Key: SPARK-3625 > URL: https://issues.apache.org/jira/browse/SPARK-3625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li > > The reproduce code: > {code} > sc.setCheckpointDir(checkpointDir) > val c = sc.parallelize((1 to 1000)).map(_ + 1) > c.count > val dep = c.dependencies.head.rdd > c.checkpoint() > c.count > assert(dep != c.dependencies.head.rdd) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3625) In some cases, the RDD.checkpoint does not work
[ https://issues.apache.org/jira/browse/SPARK-3625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-3625: --- Issue Type: Improvement (was: Bug) > In some cases, the RDD.checkpoint does not work > --- > > Key: SPARK-3625 > URL: https://issues.apache.org/jira/browse/SPARK-3625 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.2, 1.1.0 >Reporter: Guoqiang Li >Assignee: Guoqiang Li >Priority: Blocker > > The reproduce code: > {code} > sc.setCheckpointDir(checkpointDir) > val c = sc.parallelize((1 to 1000)).map(_ + 1) > c.count > val dep = c.dependencies.head.rdd > c.checkpoint() > c.count > assert(dep != c.dependencies.head.rdd) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3642) Better document the nuances of shared variables
Sandy Ryza created SPARK-3642: - Summary: Better document the nuances of shared variables Key: SPARK-3642 URL: https://issues.apache.org/jira/browse/SPARK-3642 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3641) Correctly populate SparkPlan.currentContext
[ https://issues.apache.org/jira/browse/SPARK-3641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143347#comment-14143347 ] Yin Huai commented on SPARK-3641: - [~marmbrus] Can we populate SparkPlan.currentContext in the constructor of SQLContext instead of populate it every time before using ExistingRDD? > Correctly populate SparkPlan.currentContext > --- > > Key: SPARK-3641 > URL: https://issues.apache.org/jira/browse/SPARK-3641 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Yin Huai >Priority: Critical > > After creating a new SQLContext, we need to populate SparkPlan.currentContext > before we create any SparkPlan. Right now, only SQLContext.createSchemaRDD > populate SparkPlan.currentContext. SQLContext.applySchema is missing this > call and we can have NPE as described in > http://qnalist.com/questions/5162981/spark-sql-1-1-0-npe-when-join-two-cached-table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org