[jira] [Updated] (SPARK-3562) Periodic cleanup event logs
[ https://issues.apache.org/jira/browse/SPARK-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xukun updated SPARK-3562: - Summary: Periodic cleanup event logs (was: Periodic cleanup) Periodic cleanup event logs --- Key: SPARK-3562 URL: https://issues.apache.org/jira/browse/SPARK-3562 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: xukun If we run spark application frequently, it will write many spark event log into spark.eventLog.dir. After a long time later, there will be many spark event log that we do not concern in the spark.eventLog.dir.Periodic cleanups will ensure that logs older than this duration will be forgotten. It is no need to clean logs by hands. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong updated SPARK-3563: Affects Version/s: 1.0.2 Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong updated SPARK-3563: Description: In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, there is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, there is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong updated SPARK-3563: Description: In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, there is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. was: In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, there is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, there is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3560) In yarn-cluster mode, jars are distributed through multiple mechanisms.
[ https://issues.apache.org/jira/browse/SPARK-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136882#comment-14136882 ] Sandy Ryza commented on SPARK-3560: --- Right. I believe Min from LinkedIn who discovered the issue is planning to submit a patch. In yarn-cluster mode, jars are distributed through multiple mechanisms. --- Key: SPARK-3560 URL: https://issues.apache.org/jira/browse/SPARK-3560 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.0 Reporter: Sandy Ryza Priority: Critical In yarn-cluster mode, jars given to spark-submit's --jars argument should be distributed to executors through the distributed cache, not through fetching. Currently, Spark tries to distribute the jars both ways, which can cause executor errors related to trying to overwrite symlinks without write permissions. It looks like this was introduced by SPARK-2260, which sets spark.jars in yarn-cluster mode. Setting spark.jars is necessary for standalone cluster deploy mode, but harmful for yarn cluster deploy mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong updated SPARK-3563: Description: In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. was: In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, there is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shenhong updated SPARK-3563: Description: In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. was: In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be clear, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136899#comment-14136899 ] Sean Owen commented on SPARK-3563: -- I am no expert, but I believe this is on purpose, in order to reuse the shuffle if the RDD partition needs to be recomputed. You may need to set a lower spark.cleaner.ttl? Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3550) Disable automatic rdd caching in python api for relevant learners
[ https://issues.apache.org/jira/browse/SPARK-3550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136907#comment-14136907 ] Apache Spark commented on SPARK-3550: - User 'OdinLin' has created a pull request for this issue: https://github.com/apache/spark/pull/2423 Disable automatic rdd caching in python api for relevant learners - Key: SPARK-3550 URL: https://issues.apache.org/jira/browse/SPARK-3550 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Reporter: Aaron Staple The python mllib api automatically caches training rdds. However, the NaiveBayes, ALS, and DecisionTree learners do not require external caching to prevent repeated RDD re-evaluation during learning. NaiveBayes only evaluates its input RDD once, while ALS and DecisionTree internally persist transformations of their input RDDs. For these learners, we should disable the automatic caching in the python mllib api. See discussion here: https://github.com/apache/spark/pull/2362#issuecomment-55637953 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136920#comment-14136920 ] Evan Chan commented on SPARK-2593: -- [~pwendell] I'd have to agree with Helena and Tupshin; I'd like to see the usage of Akka increase; it is underutilized. Also note that it's not true that Akka is only used for RPC; there are Actors used in several places. Performance and resilience would likely improve significantly in several areas with more usage of Akka; the current threads spun up by each driver for various things like a file server are quite wasteful. On Tue, Sep 16, 2014 at 5:17 PM, Tupshin Harper (JIRA) j...@apache.org -- The fruit of silence is prayer; the fruit of prayer is faith; the fruit of faith is love; the fruit of love is service; the fruit of service is peace. -- Mother Teresa Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3564) Display App ID on HistoryPage
[ https://issues.apache.org/jira/browse/SPARK-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136929#comment-14136929 ] Apache Spark commented on SPARK-3564: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2424 Display App ID on HistoryPage - Key: SPARK-3564 URL: https://issues.apache.org/jira/browse/SPARK-3564 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Kousuke Saruta Current HistoryPage display doesn't display App ID so if there are lots of applications which have same name, it's difficult to find an application we'd like to know it's status. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3566) .gitignore and .rat-excludes should consider cmd file and Emacs' backup files
[ https://issues.apache.org/jira/browse/SPARK-3566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136947#comment-14136947 ] Apache Spark commented on SPARK-3566: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2426 .gitignore and .rat-excludes should consider cmd file and Emacs' backup files - Key: SPARK-3566 URL: https://issues.apache.org/jira/browse/SPARK-3566 Project: Spark Issue Type: Improvement Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor Current .gitignore and .rat-excludes does not consider spark-env.cmd. Also, .gitignore doesn't consider emacs' meta files (backup file which starts with and ends with # and lock file which starts with .#) even though considers vi's meta file (*.swp). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3565) make code consistent with document
[ https://issues.apache.org/jira/browse/SPARK-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136948#comment-14136948 ] Apache Spark commented on SPARK-3565: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2427 make code consistent with document -- Key: SPARK-3565 URL: https://issues.apache.org/jira/browse/SPARK-3565 Project: Spark Issue Type: Bug Components: core Reporter: WangTaoTheTonic Priority: Minor The configuration item represent Default number of retries in binding to a port in code is spark.ports.maxRetries while spark.port.maxRetries in document configuration.md. We need to make them consistent. In org.apache.spark.util.Utils.scala: /** * Default number of retries in binding to a port. */ val portMaxRetries: Int = { if (sys.props.contains(spark.testing)) { // Set a higher number of retries for tests... sys.props.get(spark.port.maxRetries).map(_.toInt).getOrElse(100) } else { Option(SparkEnv.get) .flatMap(_.conf.getOption(spark.port.maxRetries)) .map(_.toInt) .getOrElse(16) } } In configuration.md: tr tdcodespark.port.maxRetries/code/td td16/td td Maximum number of retries when binding to a port before giving up. /td /tr -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs
[ https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14136999#comment-14136999 ] guowei commented on SPARK-3292: --- [~saisai_shao] i test the scenario with windowing operators, it seems OK. WindowedDstream compute base on the slice RDD cached between the from and end time, it seems does not need commit job Shuffle Tasks run incessantly even though there's no inputs --- Key: SPARK-3292 URL: https://issues.apache.org/jira/browse/SPARK-3292 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.0.2 Reporter: guowei such as repartition groupby join and cogroup for example. if i want the shuffle outputs save as hadoop file ,even though there is no inputs , many emtpy file generate too. it's too expensive , -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137014#comment-14137014 ] shenhong commented on SPARK-3563: - Thanks Sean Owen! I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why some shuffle stage data have been cleaned, but the other are not. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137014#comment-14137014 ] shenhong edited comment on SPARK-3563 at 9/17/14 9:55 AM: -- Thanks Sean Owen! I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why some shuffle stages data have been cleaned, but the others are not. was (Author: shenhong): Thanks Sean Owen! I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why some shuffle stage data have been cleaned, but the other are not. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137014#comment-14137014 ] shenhong edited comment on SPARK-3563 at 9/17/14 9:59 AM: -- Thanks Sean Owen! I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why some shuffle stages data have been cleaned, while the others are not. was (Author: shenhong): Thanks Sean Owen! I don‘t have set spark.cleaner.ttl, maybe it will work, but my point is why some shuffle stages data have been cleaned, but the others are not. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile
[ https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137033#comment-14137033 ] Apache Spark commented on SPARK-3567: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2428 appId field in SparkDeploySchedulerBackend should be volatile - Key: SPARK-3567 URL: https://issues.apache.org/jira/browse/SPARK-3567 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta appId field in SparkDeploySchedulerBackend is set by AppClient.ClientActor#receiveWithLogging and appId is referred via SparkDeploySchedulerBackend#applicationId. A thread which runs AppClient.ClientActor and a thread invoking SparkDeploySchedulerBackend#applicationId can be another threads so appId should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile
Kousuke Saruta created SPARK-3567: - Summary: appId field in SparkDeploySchedulerBackend should be volatile Key: SPARK-3567 URL: https://issues.apache.org/jira/browse/SPARK-3567 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta appId field in SparkDeploySchedulerBackend is set by AppClient.ClientActor#receiveWithLogging and appId is referred via SparkDeploySchedulerBackend#applicationId. A thread which runs AppClient.ClientActor and a thread invoking SparkDeploySchedulerBackend#applicationId can be another threads so appId should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1719) spark.executor.extraLibraryPath isn't applied on yarn
[ https://issues.apache.org/jira/browse/SPARK-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137176#comment-14137176 ] Wilfred Spiegelenburg commented on SPARK-1719: -- oops, not sure how I missed that last file change :-( spark.executor.extraLibraryPath isn't applied on yarn - Key: SPARK-1719 URL: https://issues.apache.org/jira/browse/SPARK-1719 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Guoqiang Li Fix For: 1.2.0 Looking through the code for spark on yarn I don't see that spark.executor.extraLibraryPath is being properly applied when it launches executors. It is using the spark.driver.libraryPath in the ClientBase. Note I didn't actually test it so its possible I missed something. I also think better to use LD_LIBRARY_PATH rather then -Djava.library.path. once java.library.path is set, it doesn't search LD_LIBRARY_PATH. In Hadoop we switched to use LD_LIBRARY_PATH instead of java.library.path. See https://issues.apache.org/jira/browse/MAPREDUCE-4072. I'll split this into separate jira. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137246#comment-14137246 ] Saisai Shao commented on SPARK-3563: In my thought, I think it relies on JVM's GC strategy, it is a unpredictable behavior when this object will be gc-ed and the shuffle files be cleaned. Anyway, finally the shuffle files will be cleaned when the application is exited. Just my thought, please correct me if I'm wrong. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326 ] Oleg Zhurakousky commented on SPARK-3561: - Patrick, thanks for following up. Indeed Spark does provide first-class extensibility mechanism at many different levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing a crucial one and that is the execution context”. And while SparkContext itself could easily be extended or mixed in with a custom trait to achieve such customization, it is less then ideal extension mechanism, since it would require code modification every time user wants to swap an execution environment (e.g., from “local” in testing to “yarn” in prod). And in fact Spark already supports an externally configurable model where the target execution environment is managed through “master URL. However, the _nature_, _implementation_ and most importantly _customization_ of these environments are internal to Spark. {code} master match { case yarn-client = case mesosUrl @ MESOS_REGEX(_) = . . . } {code} Further more, any additional integration and/or customization work that may come in the future would require modification to the above _case_ statement which I am also sure you’d agree is less then ideal integration style, since it would require a new release of Spark every time new _case_ statement is added. So essentially what we’re proposing is to formalize what has always been supported by Spark to an externally configurable model so customization around _*native functionality*_ of the target execution environment could be handled in a flexible and pluggable way. So in this model we are simply proposing a variation of the chain of responsibility pattern” where DAG execution could be delegated to an _execution context_ with no change to end user programs or semantics. Based on our investigation we’ve identified 4 core operations which you can see in _JobExecutionContext_. Two of them provide access to source RDD creation thus allowing customization of data _sourcing_ (custom readers, direct block access etc.). One for _broadcast_ to integrate with broadcast capabilities provided natively. And last but not least is the main _execution delegate_ for the job - “runJob”. And while I am sure there will be more questions, I hope the above response clarifies the overall intention of this proposal Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137338#comment-14137338 ] Helena Edelson commented on SPARK-2593: --- [~pwendell] I forgot to not this on my reply above but I was offering to do the work myself - or at least start it. What would be ideal is to have both of these: - Expose the spark actor system which also requires insuring all spark actors are on specified dispatchers (very important) - Optionally allow spark users to pass in their existing actor system I feel both are incredibly important for users. Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137338#comment-14137338 ] Helena Edelson edited comment on SPARK-2593 at 9/17/14 2:42 PM: [~pwendell] I forgot to note this on my reply above but I was offering to do the work myself - or at least start it. What would be ideal is to have both of these: - Expose the spark actor system which also requires insuring all spark actors are on specified dispatchers (very important) - Optionally allow spark users to pass in their existing actor system I feel both are incredibly important for users. was (Author: helena_e): [~pwendell] I forgot to not this on my reply above but I was offering to do the work myself - or at least start it. What would be ideal is to have both of these: - Expose the spark actor system which also requires insuring all spark actors are on specified dispatchers (very important) - Optionally allow spark users to pass in their existing actor system I feel both are incredibly important for users. Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137338#comment-14137338 ] Helena Edelson edited comment on SPARK-2593 at 9/17/14 2:44 PM: [~pwendell] I forgot to note this on my reply above but I was offering to do the work myself - or at least start it. What would be ideal is to have both of these: - Expose the spark actor system which also requires insuring all spark actors are on specified dispatchers (very important) - Optionally allow spark users to pass in their existing actor system - Add a logical naming convention for spark streaming actors or a function to get it I feel both are incredibly important for users. was (Author: helena_e): [~pwendell] I forgot to note this on my reply above but I was offering to do the work myself - or at least start it. What would be ideal is to have both of these: - Expose the spark actor system which also requires insuring all spark actors are on specified dispatchers (very important) - Optionally allow spark users to pass in their existing actor system I feel both are incredibly important for users. Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137348#comment-14137348 ] shenhong commented on SPARK-3563: - Thanks, Saisai. I thank you are right, it depend on JVM's GC. But in my streaming job, one minute a batch, two stage a batch, and it contain a shuffle stage. In the first hour(60 batches), shuffle data had been cleaned, but after that, shuffle data not always be cleaned. And streaming job won't stop. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: core Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137326#comment-14137326 ] Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 2:55 PM: -- Patrick, thanks for following up. Indeed Spark does provide first-class extensibility mechanism at many different levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing a crucial one and that is the execution context”. And while SparkContext itself could easily be extended or mixed in with a custom trait to achieve such customization, it is less then ideal extension mechanism, since it would require code modification every time user wants to swap an execution environment (e.g., from “local” in testing to “yarn” in prod) if such environment is not supported. And in fact Spark already supports an externally configurable model where the target execution environment is managed through “master URL. However, the _nature_, _implementation_ and most importantly _customization_ of these environments are internal to Spark. {code} master match { case yarn-client = case mesosUrl @ MESOS_REGEX(_) = . . . } {code} Further more, any additional integration and/or customization work that may come in the future would require modification to the above _case_ statement which I am also sure you’d agree is less then ideal integration style, since it would require a new release of Spark every time new _case_ statement is added. So essentially what we’re proposing is to formalize what has always been supported by Spark to an externally configurable model so customization around _*native functionality*_ of the target execution environment could be handled in a flexible and pluggable way. So in this model we are simply proposing a variation of the chain of responsibility pattern” where DAG execution could be delegated to an _execution context_ with no change to end user programs or semantics. Based on our investigation we’ve identified 4 core operations which you can see in _JobExecutionContext_. Two of them provide access to source RDD creation thus allowing customization of data _sourcing_ (custom readers, direct block access etc.). One for _broadcast_ to integrate with broadcast capabilities provided natively. And last but not least is the main _execution delegate_ for the job - “runJob”. And while I am sure there will be more questions, I hope the above response clarifies the overall intention of this proposal was (Author: ozhurakousky): Patrick, thanks for following up. Indeed Spark does provide first-class extensibility mechanism at many different levels (shuffle, rdd, readers/writers, etc.), however, we believe it is missing a crucial one and that is the execution context”. And while SparkContext itself could easily be extended or mixed in with a custom trait to achieve such customization, it is less then ideal extension mechanism, since it would require code modification every time user wants to swap an execution environment (e.g., from “local” in testing to “yarn” in prod). And in fact Spark already supports an externally configurable model where the target execution environment is managed through “master URL. However, the _nature_, _implementation_ and most importantly _customization_ of these environments are internal to Spark. {code} master match { case yarn-client = case mesosUrl @ MESOS_REGEX(_) = . . . } {code} Further more, any additional integration and/or customization work that may come in the future would require modification to the above _case_ statement which I am also sure you’d agree is less then ideal integration style, since it would require a new release of Spark every time new _case_ statement is added. So essentially what we’re proposing is to formalize what has always been supported by Spark to an externally configurable model so customization around _*native functionality*_ of the target execution environment could be handled in a flexible and pluggable way. So in this model we are simply proposing a variation of the chain of responsibility pattern” where DAG execution could be delegated to an _execution context_ with no change to end user programs or semantics. Based on our investigation we’ve identified 4 core operations which you can see in _JobExecutionContext_. Two of them provide access to source RDD creation thus allowing customization of data _sourcing_ (custom readers, direct block access etc.). One for _broadcast_ to integrate with broadcast capabilities provided natively. And last but not least is the main _execution delegate_ for the job - “runJob”. And while I am sure there will be more questions, I hope the above response clarifies the overall intention of this proposal Native Hadoop/YARN integration for batch/ETL workloads
[jira] [Commented] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378 ] Oleg Zhurakousky commented on SPARK-3561: - Patrick, sorry as I feel like I missed the core emphasis of what we are trying to accomplish with this. Our main goal is to expose Spark to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark such as interactive, in-memory and streaming to batch and ETL in a shared, multi-tenant environments, thus benefiting Spark community considerably by allowing Spark to be applied for all use-cases and capabilties on and in Hadoop. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137378#comment-14137378 ] Oleg Zhurakousky edited comment on SPARK-3561 at 9/17/14 3:24 PM: -- Patrick, sorry as I feel like I missed the core emphasis of what we are trying to accomplish with this. As described in the design document attached, our main goal is to expose Spark to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark such as interactive, in-memory and streaming to batch and ETL in a shared, multi-tenant environments, thus benefiting Spark community considerably by allowing Spark to be applied for all use-cases and capabilties on and in Hadoop. was (Author: ozhurakousky): Patrick, sorry as I feel like I missed the core emphasis of what we are trying to accomplish with this. Our main goal is to expose Spark to native Hadoop features (i.e., stateless YARN shuffle, Tez etc.), thus increasing the existing capabilities of Spark such as interactive, in-memory and streaming to batch and ETL in a shared, multi-tenant environments, thus benefiting Spark community considerably by allowing Spark to be applied for all use-cases and capabilties on and in Hadoop. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3177) Yarn-alpha ClientBaseSuite Unit test failed
[ https://issues.apache.org/jira/browse/SPARK-3177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-3177. -- Resolution: Fixed Fix Version/s: (was: 1.1.1) 1.2.0 Yarn-alpha ClientBaseSuite Unit test failed --- Key: SPARK-3177 URL: https://issues.apache.org/jira/browse/SPARK-3177 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.1.1 Reporter: Chester Priority: Minor Labels: test Fix For: 1.2.0 Original Estimate: 1h Remaining Estimate: 1h Yarn-alpha ClientBaseSuite Unit test failed due to differences of MRJobConfig API between yarn-stable and yarn-alpha. The class field MRJobConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH in yarn-alpha returns String Array in yarn returns String the method will works for yarn-stable but will fail as it try to cast String Array to String. val knownDefMRAppCP: Seq[String] = getFieldValue[String, Seq[String]](classOf[MRJobConfig], DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH, Seq[String]())(a = a.split(,)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2593) Add ability to pass an existing Akka ActorSystem into Spark
[ https://issues.apache.org/jira/browse/SPARK-2593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137529#comment-14137529 ] kannapiran commented on SPARK-2593: --- Is there a way to add akka system in to spark streaming context and write the output to hdfs from spark streaming. I tried ActorReceiver rec = new ActorReceiver(props, TestActor, StorageLevel.MEMORY_AND_DISK_2(), SupervisorStrategy.defaultStrategy(), null); ReceiverInputDStreamObject receiverStream = context.receiverStream(rec, null); receiverStream.print(); But the above code doesnt stream anything from actor system Add ability to pass an existing Akka ActorSystem into Spark --- Key: SPARK-2593 URL: https://issues.apache.org/jira/browse/SPARK-2593 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Helena Edelson As a developer I want to pass an existing ActorSystem into StreamingContext in load-time so that I do not have 2 actor systems running on a node in an Akka application. This would mean having spark's actor system on its own named-dispatchers as well as exposing the new private creation of its own actor system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3074) support groupByKey() with hot keys in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-3074: -- Component/s: PySpark support groupByKey() with hot keys in PySpark - Key: SPARK-3074 URL: https://issues.apache.org/jira/browse/SPARK-3074 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Davies Liu Assignee: Davies Liu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3371) Spark SQL: Renaming a function expression with group by gives error
[ https://issues.apache.org/jira/browse/SPARK-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3371: Target Version/s: 1.2.0 Spark SQL: Renaming a function expression with group by gives error --- Key: SPARK-3371 URL: https://issues.apache.org/jira/browse/SPARK-3371 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Pei-Lun Lee {code} val sqlContext = new org.apache.spark.sql.SQLContext(sc) val rdd = sc.parallelize(List({foo:bar})) sqlContext.jsonRDD(rdd).registerAsTable(t1) sqlContext.registerFunction(len, (s: String) = s.length) sqlContext.sql(select len(foo) as a, count(1) from t1 group by len(foo)).collect() {code} running above code in spark-shell gives the following error {noformat} 14/09/03 17:20:13 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 214) org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: foo#0 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:47) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:43) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:42) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$2.apply(TreeNode.scala:201) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:199) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildrenDown(TreeNode.scala:212) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:168) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:183) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) {noformat} remove as a in the query causes no error -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3537) Statistics for cached RDDs
[ https://issues.apache.org/jira/browse/SPARK-3537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3537: Assignee: Cheng Lian Statistics for cached RDDs -- Key: SPARK-3537 URL: https://issues.apache.org/jira/browse/SPARK-3537 Project: Spark Issue Type: New Feature Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian Right now we only have limited statistics for hive tables. We could easily collect this data when caching an RDD as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3377) Metrics can be accidentally aggregated against our intention
[ https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137669#comment-14137669 ] Apache Spark commented on SPARK-3377: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2432 Metrics can be accidentally aggregated against our intention Key: SPARK-3377 URL: https://issues.apache.org/jira/browse/SPARK-3377 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kousuke Saruta Priority: Critical I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems. (1) When applications which have same spark.app.name run on cluster at the same time, some metrics names are mixed. For instance, if 2+ application is running on the cluster at the same time, each application emits the same named metric like SparkPi.DAGScheduler.stage.failedStages and Graphite cannot distinguish the metrics is for which application. (2) When 2+ executors run on the same machine, JVM metrics of each executors are mixed. For instance, 2+ executors running on the same node can emit the same named metric jvm.memory and Graphite cannot distinguish the metrics is from which application. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3568) Add metrics for ranking algorithms
Shuo Xiang created SPARK-3568: - Summary: Add metrics for ranking algorithms Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2902) Change default options to be more agressive
[ https://issues.apache.org/jira/browse/SPARK-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2902: Summary: Change default options to be more agressive (was: Enable compression for in-memory columnar storage by default) Change default options to be more agressive --- Key: SPARK-2902 URL: https://issues.apache.org/jira/browse/SPARK-2902 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian Compression for in-memory columnar storage is disabled by default, it's time to enable it. Also, it help alleviating OOM mentioned in SPARK-2650 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2902) Change default options to be more agressive
[ https://issues.apache.org/jira/browse/SPARK-2902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137698#comment-14137698 ] Michael Armbrust commented on SPARK-2902: - Things we might consider changing: parquet auto conversion, increase batch size, turn on compression. As part of this it would be great if we read from spark defaults at startup so these could be turned off without recompiling. Change default options to be more agressive --- Key: SPARK-2902 URL: https://issues.apache.org/jira/browse/SPARK-2902 Project: Spark Issue Type: Task Components: SQL Affects Versions: 1.0.1, 1.0.2 Reporter: Cheng Lian Assignee: Cheng Lian Compression for in-memory columnar storage is disabled by default, it's time to enable it. Also, it help alleviating OOM mentioned in SPARK-2650 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2271) Use Hive's high performance Decimal128 to replace BigDecimal
[ https://issues.apache.org/jira/browse/SPARK-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2271: Target Version/s: 1.2.0 Use Hive's high performance Decimal128 to replace BigDecimal Key: SPARK-2271 URL: https://issues.apache.org/jira/browse/SPARK-2271 Project: Spark Issue Type: Improvement Components: SQL Reporter: Reynold Xin Assignee: Cheng Lian Hive JIRA: https://issues.apache.org/jira/browse/HIVE-6017 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types
[ https://issues.apache.org/jira/browse/SPARK-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-2063. - Resolution: Duplicate Fix Version/s: 1.2.0 Creating a SchemaRDD via sql() does not correctly resolve nested types -- Key: SPARK-2063 URL: https://issues.apache.org/jira/browse/SPARK-2063 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Cheng Lian Fix For: 1.2.0 For example, from the typical twitter dataset: {code} scala val popularTweets = sql(SELECT retweeted_status.text, MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30) scala popularTweets.toString 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifiers on unresolved object, tree: 'retweeted_status.text at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:217) at
[jira] [Resolved] (SPARK-1694) Simplify ColumnBuilder/Accessor class hierarchy
[ https://issues.apache.org/jira/browse/SPARK-1694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-1694. - Resolution: Fixed Simplify ColumnBuilder/Accessor class hierarchy --- Key: SPARK-1694 URL: https://issues.apache.org/jira/browse/SPARK-1694 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: Cheng Lian Assignee: Cheng Lian Priority: Minor Fix For: 1.2.0 Current {{ColumnBuilder/Accessor}} class hierarchy design was largely refactored from the in-memory columnar storage component of Shark. Code related to null values and compression were factored into {{NullableColumnBuilder/Accessor}} and {{CompressibleColumnBuilder/Accessor}} and then mixed in as stackable traits. The drawback is: # Interactions among these classes were unnecessarily complicated and error prone. # Flexibility provided by this design now seems useless To simplify this, we can merge {{CompressibleColumnBuilder/Accessor}} and {{NullableColumnBuilder/Accessor}} into {{NativeColumnBuilder/Accessor}}, simply hard code null value processing and compression logic together. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3568: - Priority: Minor (was: Major) Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Priority: Minor Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3568: - Assignee: Shuo Xiang Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137719#comment-14137719 ] Matei Zaharia commented on SPARK-3530: -- To comment on the versioning stuff here, deprecated doesn't mean unsupported, it just means we encourage using something else. So the old MLlib API will remain in 1.x, and will continue getting tested and bug-fixed, but it will not get new features. Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3553) Spark Streaming app streams files that have already been streamed in an endless loop
[ https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ezequiel Bella updated SPARK-3553: -- Description: We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each. The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)= true, true ) The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds. The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed. For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files. Any thoughts what can be causing this? Regards, Easyb was: We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each. The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)= true, true ) The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds. The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already being streamed. For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly. Any thoughts what can be causing this? Regards, Easyb Spark Streaming app streams files that have already been streamed in an endless loop Key: SPARK-3553 URL: https://issues.apache.org/jira/browse/SPARK-3553 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.1 Environment: Ec2 cluster - YARN Reporter: Ezequiel Bella Labels: S3, Streaming, YARN We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB of RAM each. The app streams from a directory in S3 which is constantly being written; this is the line of code that achieves that: val lines = ssc.fileStream[LongWritable, Text, TextInputFormat](Settings.S3RequestsHost , (f:Path)= true, true ) The purpose of using fileStream instead of textFileStream is to customize the way that spark handles existing files when the process starts. We want to process just the new files that are added after the process launched and omit the existing ones. We configured a batch duration of 10 seconds. The process goes fine while we add a small number of files to s3, let's say 4 or 5. We can see in the streaming UI how the stages are executed successfully in the executors, one for each file that is processed. But when we try to add a larger number of files, we face a strange behavior; the application starts streaming files that have already been streamed. For example, I add 20 files to s3. The files are processed in 3 batches. The first batch processes 7 files, the second 8 and the third 5. No more files are added to S3 at this point, but spark start repeating these phases endlessly with the same files. Any thoughts what can be causing this? Regards, Easyb -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (SPARK-3530) Pipeline and Parameters
[ https://issues.apache.org/jira/browse/SPARK-3530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137758#comment-14137758 ] Eustache commented on SPARK-3530: - Great to see the design docs ! A few questions/remarks: - Big +1 for Pipeline and Dataset as first-class abstractions - being a long time sklearn user Pipelines are a very convenient way to think for many problems, e.g. implementing Cascades of models integrate unsupervised steps for feature transformation in a supervised task etc - Isn't the fit multiple models at once part a bit of an early optimization ? How many users would benefit from it ? IMHO it complicates the API for most users. - I'm also wondering if a meta class wouldn't be capable of doing multiple models. AFAICT fitting multiple models at once resembles a parameter grid search isn't it? I assume the later would return evaluation metrics for each parameter set as well as the model itself, right ? - It seems to me that multi-task learning would be a good example for the multiple models at once but is maybe not a typical example of what most users would want. Also I'm not 100% sure the implementation should necessarily profit from such an API Pipeline and Parameters --- Key: SPARK-3530 URL: https://issues.apache.org/jira/browse/SPARK-3530 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical This part of the design doc is for pipelines and parameters. I put the design doc at https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing I will copy the proposed interfaces to this JIRA later. Some sample code can be viewed at: https://github.com/mengxr/spark-ml/ Please help review the design and post your comments here. Thanks! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format
[ https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137759#comment-14137759 ] Zhan Zhang commented on SPARK-2883: --- I am starting to prototyping the last feature with OrcFile and saveAsOrcFile. Spark Support for ORCFile format Key: SPARK-2883 URL: https://issues.apache.org/jira/browse/SPARK-2883 Project: Spark Issue Type: Bug Components: Input/Output, SQL Reporter: Zhan Zhang Priority: Blocker Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 pm jobtracker.png Verify the support of OrcInputFormat in spark, fix issues if exists and add documentation of its usage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137780#comment-14137780 ] lee mighdoll commented on SPARK-2707: - Now that 1.1 is out, hopefully a solution for early 1.2 milestone builds? Perhaps offer a contract for someone to make a shader plugin for sbt. I suspect some people still need scala 2.10.x and some need 2.11, so I'd recommend a cross build: http://www.scala-sbt.org/0.13.5/docs/Detailed-Topics/Cross-Build.html Upgrade to Akka 2.3 --- Key: SPARK-2707 URL: https://issues.apache.org/jira/browse/SPARK-2707 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.0.0 Reporter: Yardena Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray features directly in the same project. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3569) Add metadata field to StructField
Xiangrui Meng created SPARK-3569: Summary: Add metadata field to StructField Key: SPARK-3569 URL: https://issues.apache.org/jira/browse/SPARK-3569 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng Want to add a metadata field to StructField that can be used by other applications like ML to embed more information about the column. {code} case class case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Map[String, Any] = Map.empty) {code} For ML, we can store feature information like categorical/continuous, number categories, category-to-index map, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3534) Avoid running MLlib and Streaming tests when testing SQL PRs
[ https://issues.apache.org/jira/browse/SPARK-3534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-3534. - Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Nicholas Chammas Avoid running MLlib and Streaming tests when testing SQL PRs Key: SPARK-3534 URL: https://issues.apache.org/jira/browse/SPARK-3534 Project: Spark Issue Type: Bug Components: Project Infra, SQL Reporter: Michael Armbrust Assignee: Nicholas Chammas Priority: Blocker Fix For: 1.2.0 We are bumping up against the 120 minute time limit for tests pretty regularly now. Since we have decreased the number of shuffle partitions and up-ed the parallelism I don't think there is much low hanging fruit to speed up the SQL tests. (The tests that are listed as taking 2-3 minutes are actually 100s of tests that I think are valuable). Instead I propose we avoid running tests that we don't need to. This will have the added benefit of eliminating failures in SQL due to flaky streaming tests. Note that this won't fix the full builds that are run for every commit. There I think we just just up the test timeout. cc: [~joshrosen] [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3569) Add metadata field to StructField
[ https://issues.apache.org/jira/browse/SPARK-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3569: - Component/s: MLlib ML Add metadata field to StructField - Key: SPARK-3569 URL: https://issues.apache.org/jira/browse/SPARK-3569 Project: Spark Issue Type: New Feature Components: ML, MLlib, SQL Reporter: Xiangrui Meng Want to add a metadata field to StructField that can be used by other applications like ML to embed more information about the column. {code} case class case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Map[String, Any] = Map.empty) {code} For ML, we can store feature information like categorical/continuous, number categories, category-to-index map, etc -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3570) Shuffle write time does not include time to open shuffle files
Kay Ousterhout created SPARK-3570: - Summary: Shuffle write time does not include time to open shuffle files Key: SPARK-3570 URL: https://issues.apache.org/jira/browse/SPARK-3570 Project: Spark Issue Type: Bug Affects Versions: 1.1.0, 1.0.2, 0.9.2 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Currently, the reported shuffle write time does not include time to open the shuffle files. This time can be very significant when the disk is highly utilized and many shuffle files exist on the machine (I'm not sure how severe this is in 1.0 onward -- since shuffle files are automatically deleted, this may be less of an issue because there are fewer old files sitting around). In experiments I did, in extreme cases, adding the time to open files can increase the shuffle write time from 5ms (of a 2 second task) to 1 second. We should fix this for better performance debugging. Thanks [~shivaram] for helping to diagnose this problem. cc [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3570) Shuffle write time does not include time to open shuffle files
[ https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3570: -- Attachment: 3a_1410957857_0_job_log_waterfall.pdf Shuffle write time does not include time to open shuffle files -- Key: SPARK-3570 URL: https://issues.apache.org/jira/browse/SPARK-3570 Project: Spark Issue Type: Bug Affects Versions: 0.9.2, 1.0.2, 1.1.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 3a_1410943402_0_job_log_waterfall.pdf, 3a_1410957857_0_job_log_waterfall.pdf Currently, the reported shuffle write time does not include time to open the shuffle files. This time can be very significant when the disk is highly utilized and many shuffle files exist on the machine (I'm not sure how severe this is in 1.0 onward -- since shuffle files are automatically deleted, this may be less of an issue because there are fewer old files sitting around). In experiments I did, in extreme cases, adding the time to open files can increase the shuffle write time from 5ms (of a 2 second task) to 1 second. We should fix this for better performance debugging. Thanks [~shivaram] for helping to diagnose this problem. cc [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3570) Shuffle write time does not include time to open shuffle files
[ https://issues.apache.org/jira/browse/SPARK-3570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3570: -- Attachment: (was: 3a_1410943402_0_job_log_waterfall.pdf) Shuffle write time does not include time to open shuffle files -- Key: SPARK-3570 URL: https://issues.apache.org/jira/browse/SPARK-3570 Project: Spark Issue Type: Bug Affects Versions: 0.9.2, 1.0.2, 1.1.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Attachments: 3a_1410854905_0_job_log_waterfall.pdf, 3a_1410957857_0_job_log_waterfall.pdf Currently, the reported shuffle write time does not include time to open the shuffle files. This time can be very significant when the disk is highly utilized and many shuffle files exist on the machine (I'm not sure how severe this is in 1.0 onward -- since shuffle files are automatically deleted, this may be less of an issue because there are fewer old files sitting around). In experiments I did, in extreme cases, adding the time to open files can increase the shuffle write time from 5ms (of a 2 second task) to 1 second. We should fix this for better performance debugging. Thanks [~shivaram] for helping to diagnose this problem. cc [~pwendell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3571) Spark standalone cluster mode doesn't work.
Kousuke Saruta created SPARK-3571: - Summary: Spark standalone cluster mode doesn't work. Key: SPARK-3571 URL: https://issues.apache.org/jira/browse/SPARK-3571 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Blocker Recent changes of Master.scala causes Spark standalone cluster mode not working. I think, the loop in Master#schedule never assign worker for driver. {code} for (driver - waitingDrivers.toList) { // iterate over a copy of waitingDrivers // We assign workers to each waiting driver in a round-robin fashion. For each driver, we // start from the last worker that was assigned a driver, and continue onwards until we have // explored all alive workers. curPos = (curPos + 1) % aliveWorkerNum val startPos = curPos var launched = false while (curPos != startPos !launched) { val worker = shuffledAliveWorkers(curPos) if (worker.memoryFree = driver.desc.mem worker.coresFree = driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver launched = true } curPos = (curPos + 1) % aliveWorkerNum } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3571) Spark standalone cluster mode doesn't work.
[ https://issues.apache.org/jira/browse/SPARK-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14137884#comment-14137884 ] Apache Spark commented on SPARK-3571: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2436 Spark standalone cluster mode doesn't work. --- Key: SPARK-3571 URL: https://issues.apache.org/jira/browse/SPARK-3571 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Blocker Recent changes of Master.scala causes Spark standalone cluster mode not working. I think, the loop in Master#schedule never assign worker for driver. {code} for (driver - waitingDrivers.toList) { // iterate over a copy of waitingDrivers // We assign workers to each waiting driver in a round-robin fashion. For each driver, we // start from the last worker that was assigned a driver, and continue onwards until we have // explored all alive workers. curPos = (curPos + 1) % aliveWorkerNum val startPos = curPos var launched = false while (curPos != startPos !launched) { val worker = shuffledAliveWorkers(curPos) if (worker.memoryFree = driver.desc.mem worker.coresFree = driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver launched = true } curPos = (curPos + 1) % aliveWorkerNum } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3572) Support register UserType in SQL
Xiangrui Meng created SPARK-3572: Summary: Support register UserType in SQL Key: SPARK-3572 URL: https://issues.apache.org/jira/browse/SPARK-3572 Project: Spark Issue Type: New Feature Components: SQL Reporter: Xiangrui Meng If a user knows how to map a class to a struct type in Spark SQL, he should be able to register this mapping through sqlContext and hence SQL can figure out the schema automatically. {code} trait RowSerializer[T] { def dataType: StructType def serialize(obj: T): Row def deserialize(row: Row): T } sqlContext.registerUserType[T](clazz: classOf[T], serializer: classOf[RowSerializer[T]]) {code} In sqlContext, we can maintain a class-to-serializer map and use it for conversion. The serializer class can be embedded into the metadata, so when `select` is called, we know we want to deserialize the result. {code} sqlContext.registerUserType(classOf[Vector], classOf[VectorRowSerializer]) val points: RDD[LabeledPoint] = ... val features: RDD[Vector] = points.select('features).map { case Row(v: Vector) = v } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3573) Dataset
[ https://issues.apache.org/jira/browse/SPARK-3573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3573: - Shepherd: Michael Armbrust Dataset --- Key: SPARK-3573 URL: https://issues.apache.org/jira/browse/SPARK-3573 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Priority: Critical This JIRA is for discussion of ML dataset, essentially a SchemaRDD with extra ML-specific metadata embedded in its schema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3574) Shuffle finish time always reported as -1
Kay Ousterhout created SPARK-3574: - Summary: Shuffle finish time always reported as -1 Key: SPARK-3574 URL: https://issues.apache.org/jira/browse/SPARK-3574 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Kay Ousterhout shuffleFinishTime is always reported as -1. I think the way we fix this should be to set the shuffleFinishTime in each ShuffleWriteMetrics as the shuffles finish, but when aggregating the metrics, only report shuffleFinishTime as something other than -1 when *all* of the shuffles have completed. [~sandyr], it looks like this was introduced in your recent patch to incrementally report metrics. Any chance you can fix this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called `RankingMetrics` under `org.apache.spark.mllib.evaluation`, which accepts input (prediction and label pairs) as `RDD[Array[Double], Array[Double]]`. Methods of `meanAveragePrecision`, `topKPrecision` and `ndcg` will be included. was: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Priority: Minor Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called `RankingMetrics` under `org.apache.spark.mllib.evaluation`, which accepts input (prediction and label pairs) as `RDD[Array[Double], Array[Double]]`. Methods of `meanAveragePrecision`, `topKPrecision` and `ndcg` will be included. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. Methods of *meanAveragePrecision*, *topKPrecision* and *ndcg* will be included. was: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called `RankingMetrics` under `org.apache.spark.mllib.evaluation`, which accepts input (prediction and label pairs) as `RDD[Array[Double], Array[Double]]`. Methods of `meanAveragePrecision`, `topKPrecision` and `ndcg` will be included. Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Priority: Minor Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. Methods of *meanAveragePrecision*, *topKPrecision* and *ndcg* will be included. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: - *averagePrecision(val position: Int): Double* this is the presicion@n - *meanAveragePrecision*: the average of precision@n for all values of n - *ndcg* was: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. Methods of *meanAveragePrecision*, *topKPrecision* and *ndcg* will be included. Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Priority: Minor Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: - *averagePrecision(val position: Int): Double* this is the presicion@n - *meanAveragePrecision*: the average of precision@n for all values of n - *ndcg* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3568) Add metrics for ranking algorithms
[ https://issues.apache.org/jira/browse/SPARK-3568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shuo Xiang updated SPARK-3568: -- Description: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: - *averagePrecision(val position: Int): Double* this is the presicion@position - *meanAveragePrecision*: the average of precision@n for all values of n - *ndcg* was: Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: - *averagePrecision(val position: Int): Double* this is the presicion@n - *meanAveragePrecision*: the average of precision@n for all values of n - *ndcg* Add metrics for ranking algorithms -- Key: SPARK-3568 URL: https://issues.apache.org/jira/browse/SPARK-3568 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Shuo Xiang Assignee: Shuo Xiang Priority: Minor Include widely-used metrics for ranking algorithms, including: - Mean Average Precision - Precision@n: top-n precision - Discounted cumulative gain (DCG) and NDCG This implementation attempts to create a new class called *RankingMetrics* under *org.apache.spark.mllib.evaluation*, which accepts input (prediction and label pairs) as *RDD[Array[Double], Array[Double]]*. The following methods will be implemented: - *averagePrecision(val position: Int): Double* this is the presicion@position - *meanAveragePrecision*: the average of precision@n for all values of n - *ndcg* -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3569) Add metadata field to StructField
[ https://issues.apache.org/jira/browse/SPARK-3569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3569: - Description: Want to add a metadata field to StructField that can be used by other applications like ML to embed more information about the column. {code} case class case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Map[String, Any] = Map.empty) {code} For ML, we can store feature information like categorical/continuous, number categories, category-to-index map, etc. One question is how to carry over the metadata in query execution. For example: {code} val features = schemaRDD.select('features) val featuresDesc = features.schema(features).metadata {code} was: Want to add a metadata field to StructField that can be used by other applications like ML to embed more information about the column. {code} case class case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Map[String, Any] = Map.empty) {code} For ML, we can store feature information like categorical/continuous, number categories, category-to-index map, etc Add metadata field to StructField - Key: SPARK-3569 URL: https://issues.apache.org/jira/browse/SPARK-3569 Project: Spark Issue Type: New Feature Components: ML, MLlib, SQL Reporter: Xiangrui Meng Want to add a metadata field to StructField that can be used by other applications like ML to embed more information about the column. {code} case class case class StructField(name: String, dataType: DataType, nullable: Boolean, metadata: Map[String, Any] = Map.empty) {code} For ML, we can store feature information like categorical/continuous, number categories, category-to-index map, etc. One question is how to carry over the metadata in query execution. For example: {code} val features = schemaRDD.select('features) val featuresDesc = features.schema(features).metadata {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3051) Support looking-up named accumulators in a registry
[ https://issues.apache.org/jira/browse/SPARK-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138095#comment-14138095 ] Apache Spark commented on SPARK-3051: - User 'nfergu' has created a pull request for this issue: https://github.com/apache/spark/pull/2438 Support looking-up named accumulators in a registry --- Key: SPARK-3051 URL: https://issues.apache.org/jira/browse/SPARK-3051 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Neil Ferguson This is a proposed enhancement to Spark based on the following mailing list discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/quot-Dynamic-variables-quot-in-Spark-td7450.html. This proposal builds on SPARK-2380 (Support displaying accumulator values in the web UI) to allow named accumulables to be looked-up in a registry, as opposed to having to be passed to every method that need to access them. The use case was described well by [~shivaram], as follows: Lets say you have two functions you use in a map call and want to measure how much time each of them takes. For example, if you have a code block like the one below and you want to measure how much time f1 takes as a fraction of the task. {noformat} a.map { l = val f = f1(l) ... some work here ... } {noformat} It would be really cool if we could do something like {noformat} a.map { l = val start = System.nanoTime val f = f1(l) TaskMetrics.get(f1-time).add(System.nanoTime - start) } {noformat} SPARK-2380 provides a partial solution to this problem -- however the accumulables would still need to be passed to every function that needs them, which I think would be cumbersome in any application of reasonable complexity. The proposal, as suggested by [~pwendell], is to have a registry of accumulables, that can be looked-up by name. Regarding the implementation details, I'd propose that we broadcast a serialized version of all named accumulables in the DAGScheduler (similar to what SPARK-2521 does for Tasks). These can then be deserialized in the Executor. Accumulables are already stored in thread-local variables in the Accumulators object, so exposing these in the registry should be simply a matter of wrapping this object, and keying the accumulables by name (they are currently keyed by ID). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training
[ https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3161: - Description: Improvement: worker computation When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level). This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level. Proposal: Cache this mapping. * Pro: O(1) lookup instead of O(level). * Con: Extra RDD which must share the same partitioning as the training data. Design: * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree). * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances. This has more overhead in data structures but could be more efficient since not all nodes are split on each iteration. was: Improvement: worker computation When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level). This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level. Proposal: Cache this mapping. * Pro: O(1) lookup instead of O(level). * Con: Extra RDD which must share the same partitioning as the training data. Cache example-node map for DecisionTree training Key: SPARK-3161 URL: https://issues.apache.org/jira/browse/SPARK-3161 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Improvement: worker computation When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level). This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level. Proposal: Cache this mapping. * Pro: O(1) lookup instead of O(level). * Con: Extra RDD which must share the same partitioning as the training data. Design: * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree). * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances. This has more overhead in data structures but could be more efficient since not all nodes are split on each iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned
[ https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138100#comment-14138100 ] Patrick Wendell commented on SPARK-3563: Eventually the references should be garbage collected... what happens if you manually trigger GC? Does it work? It's also possible there is a bug somewhere, or that your program is storing other references to created RDDs. Shuffle data not always be cleaned -- Key: SPARK-3563 URL: https://issues.apache.org/jira/browse/SPARK-3563 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.0.2 Reporter: shenhong In our cluster, when we run a spark streaming job, after running for many hours, the shuffle data seems not all be cleaned, here is the shuffle data: -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0 -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1 -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0 -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1 -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0 -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1 -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1 -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0 -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0 -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1 -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1 -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0 -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1 The shuffle data of stage 63, 65, 84, 132... are not cleaned. In ContextCleaner, it maintains a weak reference for each RDD, ShuffleDependency, and Broadcast of interest, to be processed when the associated object goes out of scope of the application. Actual cleanup is performed in a separate daemon thread. There must be some reference for ShuffleDependency , and it's hard to find out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3565) make code consistent with document
[ https://issues.apache.org/jira/browse/SPARK-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3565: --- Component/s: (was: core) Spark Core make code consistent with document -- Key: SPARK-3565 URL: https://issues.apache.org/jira/browse/SPARK-3565 Project: Spark Issue Type: Bug Components: Spark Core Reporter: WangTaoTheTonic Priority: Minor The configuration item represent Default number of retries in binding to a port in code is spark.ports.maxRetries while spark.port.maxRetries in document configuration.md. We need to make them consistent. In org.apache.spark.util.Utils.scala: /** * Default number of retries in binding to a port. */ val portMaxRetries: Int = { if (sys.props.contains(spark.testing)) { // Set a higher number of retries for tests... sys.props.get(spark.port.maxRetries).map(_.toInt).getOrElse(100) } else { Option(SparkEnv.get) .flatMap(_.conf.getOption(spark.port.maxRetries)) .map(_.toInt) .getOrElse(16) } } In configuration.md: tr tdcodespark.port.maxRetries/code/td td16/td td Maximum number of retries when binding to a port before giving up. /td /tr -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3565) make code consistent with document
[ https://issues.apache.org/jira/browse/SPARK-3565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138101#comment-14138101 ] Patrick Wendell commented on SPARK-3565: Please use Spark Core component and not core make code consistent with document -- Key: SPARK-3565 URL: https://issues.apache.org/jira/browse/SPARK-3565 Project: Spark Issue Type: Bug Components: Spark Core Reporter: WangTaoTheTonic Priority: Minor The configuration item represent Default number of retries in binding to a port in code is spark.ports.maxRetries while spark.port.maxRetries in document configuration.md. We need to make them consistent. In org.apache.spark.util.Utils.scala: /** * Default number of retries in binding to a port. */ val portMaxRetries: Int = { if (sys.props.contains(spark.testing)) { // Set a higher number of retries for tests... sys.props.get(spark.port.maxRetries).map(_.toInt).getOrElse(100) } else { Option(SparkEnv.get) .flatMap(_.conf.getOption(spark.port.maxRetries)) .map(_.toInt) .getOrElse(16) } } In configuration.md: tr tdcodespark.port.maxRetries/code/td td16/td td Maximum number of retries when binding to a port before giving up. /td /tr -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3161) Cache example-node map for DecisionTree training
[ https://issues.apache.org/jira/browse/SPARK-3161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3161: - Description: Improvement: worker computation When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level). This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level. Proposal: Cache this mapping. * Pro: O(1) lookup instead of O(level). * Con: Extra RDD which must share the same partitioning as the training data. Design: * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree). * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances. This has more overhead in data structures but could be more efficient: not all nodes are split on each iteration, and this would allow each executor to ignore instances which are not used for the current node set. was: Improvement: worker computation When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level). This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level. Proposal: Cache this mapping. * Pro: O(1) lookup instead of O(level). * Con: Extra RDD which must share the same partitioning as the training data. Design: * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree). * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances. This has more overhead in data structures but could be more efficient since not all nodes are split on each iteration. Cache example-node map for DecisionTree training Key: SPARK-3161 URL: https://issues.apache.org/jira/browse/SPARK-3161 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor Improvement: worker computation When training each level of a DecisionTree, each example needs to be mapped to a node in the current level (or to none if it does not reach that level). This is currently done via the function predictNodeIndex(), which traces from the current tree’s root node to the given level. Proposal: Cache this mapping. * Pro: O(1) lookup instead of O(level). * Con: Extra RDD which must share the same partitioning as the training data. Design: * (option 1) This could be done as in [Sequoia Forests | https://github.com/AlpineNow/SparkML2] where each instance is stored with an array of node indices (1 node per tree). * (option 2) This could also be done by storing an RDD\[Array\[Map\[Int, Array\[TreePoint\]\]\]\], where each partition stores an array of maps from node indices to an array of instances. This has more overhead in data structures but could be more efficient: not all nodes are split on each iteration, and this would allow each executor to ignore instances which are not used for the current node set. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-901) UISuite jetty port increases under contention fails if startPort is in use
[ https://issues.apache.org/jira/browse/SPARK-901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-901. --- Resolution: Fixed This is fixed by SPARK-3555 since we no longer chose a specific starting port. UISuite jetty port increases under contention fails if startPort is in use Key: SPARK-901 URL: https://issues.apache.org/jira/browse/SPARK-901 Project: Spark Issue Type: Bug Components: Build, Spark Core, Web UI Affects Versions: 0.8.0 Reporter: Mark Hamstra Priority: Minor Recent change of startPort to 3030 conflicts with IANA assignment for arepa-cas. If 3030 is already in use, the UISuite fails. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1739) Close PR's after 30 days of inactivity
[ https://issues.apache.org/jira/browse/SPARK-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-1739. Resolution: Won't Fix We've introduced a different mechanism for manual closing, so I don't think this is necessary. Close PR's after 30 days of inactivity -- Key: SPARK-1739 URL: https://issues.apache.org/jira/browse/SPARK-1739 Project: Spark Issue Type: Task Components: Project Infra Reporter: Patrick Wendell Assignee: Patrick Wendell Fix For: 1.2.0 Sometimes PR's get abandoned if people aren't responsive to feedback or it just falls to a lower priority. We should automatically close stale PR's in order to keep the queue from growing infinitely. I think we just want to do this with a friendly message that says This seems inactive, please re-open this if you are interested in contributing the patch.. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3575: Component/s: SQL Hive Schema is ignored when using convertMetastoreParquet - Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
Michael Armbrust created SPARK-3575: --- Summary: Hive Schema is ignored when using convertMetastoreParquet Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Reporter: Michael Armbrust Assignee: Cheng Lian This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3575) Hive Schema is ignored when using convertMetastoreParquet
[ https://issues.apache.org/jira/browse/SPARK-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-3575: Target Version/s: 1.2.0 Hive Schema is ignored when using convertMetastoreParquet - Key: SPARK-3575 URL: https://issues.apache.org/jira/browse/SPARK-3575 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Cheng Lian This can cause problems when for example one of the columns is defined as TINYINT. A class cast exception will be thrown since the parquet table scan produces INTs while the rest of the execution is expecting bytes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138130#comment-14138130 ] Sean McNamara edited comment on SPARK-3561 at 9/17/14 10:45 PM: We have some workload use-cases and this would potentially be very helpful in lieu of SPARK-3174. was (Author: seanmcn): We have some workload use-cases and would potentially be very helpful in lieu of SPARK-3174. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3561) Native Hadoop/YARN integration for batch/ETL workloads
[ https://issues.apache.org/jira/browse/SPARK-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138130#comment-14138130 ] Sean McNamara edited comment on SPARK-3561 at 9/17/14 10:47 PM: We have some workload use-cases and this would potentially be very helpful, especially in lieu of SPARK-3174. was (Author: seanmcn): We have some workload use-cases and this would potentially be very helpful in lieu of SPARK-3174. Native Hadoop/YARN integration for batch/ETL workloads -- Key: SPARK-3561 URL: https://issues.apache.org/jira/browse/SPARK-3561 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0 Reporter: Oleg Zhurakousky Labels: features Fix For: 1.2.0 Attachments: SPARK-3561.pdf Currently Spark provides integration with external resource-managers such as Apache Hadoop YARN, Mesos etc. Specifically in the context of YARN, the current architecture of Spark-on-YARN can be enhanced to provide significantly better utilization of cluster resources for large scale, batch and/or ETL applications when run alongside other applications (Spark and others) and services in YARN. Proposal: The proposed approach would introduce a pluggable JobExecutionContext (trait) - a gateway and a delegate to Hadoop execution environment - as a non-public api (@DeveloperAPI) not exposed to end users of Spark. The trait will define 4 only operations: * hadoopFile * newAPIHadoopFile * broadcast * runJob Each method directly maps to the corresponding methods in current version of SparkContext. JobExecutionContext implementation will be accessed by SparkContext via master URL as execution-context:foo.bar.MyJobExecutionContext with default implementation containing the existing code from SparkContext, thus allowing current (corresponding) methods of SparkContext to delegate to such implementation. An integrator will now have an option to provide custom implementation of DefaultExecutionContext by either implementing it from scratch or extending form DefaultExecutionContext. Please see the attached design doc for more details. Pull Request will be posted shortly as well -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3576) Provide script for creating the Spark AMI from scratch
Patrick Wendell created SPARK-3576: -- Summary: Provide script for creating the Spark AMI from scratch Key: SPARK-3576 URL: https://issues.apache.org/jira/browse/SPARK-3576 Project: Spark Issue Type: Bug Components: EC2 Reporter: Patrick Wendell Assignee: Patrick Wendell -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3577) Shuffle write time incorrect for sort-based shuffle
Kay Ousterhout created SPARK-3577: - Summary: Shuffle write time incorrect for sort-based shuffle Key: SPARK-3577 URL: https://issues.apache.org/jira/browse/SPARK-3577 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kay Ousterhout After this change https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 (cc [~sandyr] [~pwendell]) the ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter. The write time recorded in those metrics is never used -- meaning that when someone is using sort-based shuffle, the shuffle write time will be recorded as 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3270) Spark API for Application Extensions
[ https://issues.apache.org/jira/browse/SPARK-3270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michal Malohlava updated SPARK-3270: Description: Any application should be able to enrich spark infrastructure by services which are not available by default. Hence, to support such application extensions (aka extesions/plugins) Spark platform should provide: - an API to register an extension - an API to register a service (meaning provided functionality) - well-defined points in Spark infrastructure which can be enriched/hooked by extension - a way of deploying extension (for example, simply putting the extension on classpath and using Java service interface) - a way to access extension from application Overall proposal is available here: https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing Note: In this context, I do not mean reinventing OSGi (or another plugin platform) but it can serve as a good starting point. was: At the begining, let's clarify my motivation - I would like to extend Spark platform by an embedded application (e.g., monitoring network performance in the context of selected applications) which will be launched on particular nodes in cluster with their launch. Nevertheless, I do not want to modify Spark code directly and hardcode my code in, but I would prefer to provide a jar which would be registered and launched by Spark itself. Hence, to support such 3rd party applications (aka extesions/plugins) Spark platform should provide at least: - an API to register an extension - an API to register a service (meaning provided functionality) - well-defined points in Spark infrastructure which can be enriched/hooked by extension - in master/worker lifecycle - in applications lifecycle - in RDDs lifecycle - monitoring/reporting - ... - a way of deploying extension (for example, simply putting the extension on classpath and using Java service interface) In this context, I do not mean reinventing OSGi (or another plugin platform) but it can serve as a good starting point. Spark API for Application Extensions Key: SPARK-3270 URL: https://issues.apache.org/jira/browse/SPARK-3270 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Michal Malohlava Any application should be able to enrich spark infrastructure by services which are not available by default. Hence, to support such application extensions (aka extesions/plugins) Spark platform should provide: - an API to register an extension - an API to register a service (meaning provided functionality) - well-defined points in Spark infrastructure which can be enriched/hooked by extension - a way of deploying extension (for example, simply putting the extension on classpath and using Java service interface) - a way to access extension from application Overall proposal is available here: https://docs.google.com/document/d/1dHF9zi7GzFbYnbV2PwaOQ2eLPoTeiN9IogUe4PAOtrQ/edit?usp=sharing Note: In this context, I do not mean reinventing OSGi (or another plugin platform) but it can serve as a good starting point. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3577) Shuffle write time incorrect for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-3577: -- Description: After this change https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 (cc [~sandyr] [~pwendell]) the ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter. The write time recorded in those metrics is never used -- meaning that when someone is using sort-based shuffle, the shuffle write time will be recorded as 0. [~sandyr] do you have time to take a look at this? was:After this change https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 (cc [~sandyr] [~pwendell]) the ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter. The write time recorded in those metrics is never used -- meaning that when someone is using sort-based shuffle, the shuffle write time will be recorded as 0. Priority: Blocker (was: Major) Target Version/s: 1.2.0 Shuffle write time incorrect for sort-based shuffle --- Key: SPARK-3577 URL: https://issues.apache.org/jira/browse/SPARK-3577 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kay Ousterhout Priority: Blocker After this change https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 (cc [~sandyr] [~pwendell]) the ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter. The write time recorded in those metrics is never used -- meaning that when someone is using sort-based shuffle, the shuffle write time will be recorded as 0. [~sandyr] do you have time to take a look at this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result
Ankur Dave created SPARK-3578: - Summary: GraphGenerators.sampleLogNormal sometimes returns too-large result Key: SPARK-3578 URL: https://issues.apache.org/jira/browse/SPARK-3578 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Ankur Dave Assignee: Ankur Dave Priority: Minor GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows: {code} var X: Double = maxVal while (X = maxVal) { val Z = rand.nextGaussian() X = math.exp(mu + sigma*Z) } math.round(X.toFloat) {code} When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will fail the test. For example, if maxVal is 5 and X is 4.9, then X maxVal, but math.round(X.toFloat) is 5. A solution is to truncate X instead of rounding it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result
[ https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ankur Dave updated SPARK-3578: -- Description: GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows: {code} var X: Double = maxVal while (X = maxVal) { val Z = rand.nextGaussian() X = math.exp(mu + sigma*Z) } math.round(X.toFloat) {code} When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will fail the test. For example, if maxVal is 5 and X is 4.9, then X maxVal, but math.round(X.toFloat) is 5. A solution is to round X down instead of to the nearest integer. was: GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows: {code} var X: Double = maxVal while (X = maxVal) { val Z = rand.nextGaussian() X = math.exp(mu + sigma*Z) } math.round(X.toFloat) {code} When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will fail the test. For example, if maxVal is 5 and X is 4.9, then X maxVal, but math.round(X.toFloat) is 5. A solution is to truncate X instead of rounding it. GraphGenerators.sampleLogNormal sometimes returns too-large result -- Key: SPARK-3578 URL: https://issues.apache.org/jira/browse/SPARK-3578 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Ankur Dave Assignee: Ankur Dave Priority: Minor GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows: {code} var X: Double = maxVal while (X = maxVal) { val Z = rand.nextGaussian() X = math.exp(mu + sigma*Z) } math.round(X.toFloat) {code} When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will fail the test. For example, if maxVal is 5 and X is 4.9, then X maxVal, but math.round(X.toFloat) is 5. A solution is to round X down instead of to the nearest integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3578) GraphGenerators.sampleLogNormal sometimes returns too-large result
[ https://issues.apache.org/jira/browse/SPARK-3578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138188#comment-14138188 ] Apache Spark commented on SPARK-3578: - User 'ankurdave' has created a pull request for this issue: https://github.com/apache/spark/pull/2439 GraphGenerators.sampleLogNormal sometimes returns too-large result -- Key: SPARK-3578 URL: https://issues.apache.org/jira/browse/SPARK-3578 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Ankur Dave Assignee: Ankur Dave Priority: Minor GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows: {code} var X: Double = maxVal while (X = maxVal) { val Z = rand.nextGaussian() X = math.exp(mu + sigma*Z) } math.round(X.toFloat) {code} When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will fail the test. For example, if maxVal is 5 and X is 4.9, then X maxVal, but math.round(X.toFloat) is 5. A solution is to round X down instead of to the nearest integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3571) Spark standalone cluster mode doesn't work.
[ https://issues.apache.org/jira/browse/SPARK-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3571. Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 2436 [https://github.com/apache/spark/pull/2436] Spark standalone cluster mode doesn't work. --- Key: SPARK-3571 URL: https://issues.apache.org/jira/browse/SPARK-3571 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Blocker Fix For: 1.2.0 Recent changes of Master.scala causes Spark standalone cluster mode not working. I think, the loop in Master#schedule never assign worker for driver. {code} for (driver - waitingDrivers.toList) { // iterate over a copy of waitingDrivers // We assign workers to each waiting driver in a round-robin fashion. For each driver, we // start from the last worker that was assigned a driver, and continue onwards until we have // explored all alive workers. curPos = (curPos + 1) % aliveWorkerNum val startPos = curPos var launched = false while (curPos != startPos !launched) { val worker = shuffledAliveWorkers(curPos) if (worker.memoryFree = driver.desc.mem worker.coresFree = driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver launched = true } curPos = (curPos + 1) % aliveWorkerNum } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3571) Spark standalone cluster mode doesn't work.
[ https://issues.apache.org/jira/browse/SPARK-3571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3571: - Assignee: Kousuke Saruta Spark standalone cluster mode doesn't work. --- Key: SPARK-3571 URL: https://issues.apache.org/jira/browse/SPARK-3571 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Priority: Blocker Fix For: 1.2.0 Recent changes of Master.scala causes Spark standalone cluster mode not working. I think, the loop in Master#schedule never assign worker for driver. {code} for (driver - waitingDrivers.toList) { // iterate over a copy of waitingDrivers // We assign workers to each waiting driver in a round-robin fashion. For each driver, we // start from the last worker that was assigned a driver, and continue onwards until we have // explored all alive workers. curPos = (curPos + 1) % aliveWorkerNum val startPos = curPos var launched = false while (curPos != startPos !launched) { val worker = shuffledAliveWorkers(curPos) if (worker.memoryFree = driver.desc.mem worker.coresFree = driver.desc.cores) { launchDriver(worker, driver) waitingDrivers -= driver launched = true } curPos = (curPos + 1) % aliveWorkerNum } {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3564) Display App ID on HistoryPage
[ https://issues.apache.org/jira/browse/SPARK-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3564: - Assignee: Kousuke Saruta Display App ID on HistoryPage - Key: SPARK-3564 URL: https://issues.apache.org/jira/browse/SPARK-3564 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Fix For: 1.1.1, 1.2.0 Current HistoryPage display doesn't display App ID so if there are lots of applications which have same name, it's difficult to find an application we'd like to know it's status. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3564) Display App ID on HistoryPage
[ https://issues.apache.org/jira/browse/SPARK-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-3564. Resolution: Fixed Fix Version/s: 1.1.1 1.2.0 Issue resolved by pull request 2424 [https://github.com/apache/spark/pull/2424] Display App ID on HistoryPage - Key: SPARK-3564 URL: https://issues.apache.org/jira/browse/SPARK-3564 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Kousuke Saruta Fix For: 1.2.0, 1.1.1 Current HistoryPage display doesn't display App ID so if there are lots of applications which have same name, it's difficult to find an application we'd like to know it's status. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile
[ https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3567: - Target Version/s: 1.2.0 (was: 1.1.1, 1.2.0) appId field in SparkDeploySchedulerBackend should be volatile - Key: SPARK-3567 URL: https://issues.apache.org/jira/browse/SPARK-3567 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Fix For: 1.2.0 appId field in SparkDeploySchedulerBackend is set by AppClient.ClientActor#receiveWithLogging and appId is referred via SparkDeploySchedulerBackend#applicationId. A thread which runs AppClient.ClientActor and a thread invoking SparkDeploySchedulerBackend#applicationId can be another threads so appId should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile
[ https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3567: - Fix Version/s: 1.2.0 appId field in SparkDeploySchedulerBackend should be volatile - Key: SPARK-3567 URL: https://issues.apache.org/jira/browse/SPARK-3567 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Fix For: 1.2.0 appId field in SparkDeploySchedulerBackend is set by AppClient.ClientActor#receiveWithLogging and appId is referred via SparkDeploySchedulerBackend#applicationId. A thread which runs AppClient.ClientActor and a thread invoking SparkDeploySchedulerBackend#applicationId can be another threads so appId should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile
[ https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-3567: - Assignee: Kousuke Saruta appId field in SparkDeploySchedulerBackend should be volatile - Key: SPARK-3567 URL: https://issues.apache.org/jira/browse/SPARK-3567 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Fix For: 1.2.0 appId field in SparkDeploySchedulerBackend is set by AppClient.ClientActor#receiveWithLogging and appId is referred via SparkDeploySchedulerBackend#applicationId. A thread which runs AppClient.ClientActor and a thread invoking SparkDeploySchedulerBackend#applicationId can be another threads so appId should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3579) Jekyll doc generation is different across environments
Patrick Wendell created SPARK-3579: -- Summary: Jekyll doc generation is different across environments Key: SPARK-3579 URL: https://issues.apache.org/jira/browse/SPARK-3579 Project: Spark Issue Type: Bug Components: Documentation Reporter: Patrick Wendell Assignee: Patrick Wendell This can result in a lot of false changes when someone alters something with the docs. It only is relevant to the website subversion repo, but the fix might need to go into the Spark codebase. There are at least two issues here. One is that the HTML character escaping can be different in certain cases. Another is that the highlighting output seems a bit different depending on (I think) what version of pygments is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API
[ https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138242#comment-14138242 ] Josh Rosen commented on SPARK-2321: --- I agree that this should be a pull API. A pull-based API will be easier to expose in Python and Java and might be sufficient for many of the common use-cases. With a push-based API, we might have to worry about things like callback processing speed, rate-limiting, etc. (the Rx frameworks have interesting ways of addressing these concerns, but I'm not sure whether we want to add those dependencies); the pull approach won't face these issues. Design a proper progress reporting event listener API --- Key: SPARK-2321 URL: https://issues.apache.org/jira/browse/SPARK-2321 Project: Spark Issue Type: Improvement Components: Java API, Spark Core Affects Versions: 1.0.0 Reporter: Reynold Xin Assignee: Josh Rosen Priority: Critical This is a ticket to track progress on redesigning the SparkListener and JobProgressListener API. There are multiple problems with the current design, including: 0. I'm not sure if the API is usable in Java (there are at least some enums we used in Scala and a bunch of case classes that might complicate things). 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of attention to it yet. Something as important as progress reporting deserves a more stable API. 2. There is no easy way to connect jobs with stages. Similarly, there is no easy way to connect job groups with jobs / stages. 3. JobProgressListener itself has no encapsulation at all. States can be arbitrarily mutated by external programs. Variable names are sort of randomly decided and inconsistent. We should just revisit these and propose a new, concrete design. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3574) Shuffle finish time always reported as -1
[ https://issues.apache.org/jira/browse/SPARK-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138246#comment-14138246 ] Sandy Ryza commented on SPARK-3574: --- On it Shuffle finish time always reported as -1 - Key: SPARK-3574 URL: https://issues.apache.org/jira/browse/SPARK-3574 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Kay Ousterhout shuffleFinishTime is always reported as -1. I think the way we fix this should be to set the shuffleFinishTime in each ShuffleWriteMetrics as the shuffles finish, but when aggregating the metrics, only report shuffleFinishTime as something other than -1 when *all* of the shuffles have completed. [~sandyr], it looks like this was introduced in your recent patch to incrementally report metrics. Any chance you can fix this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3577) Shuffle write time incorrect for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-3577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138245#comment-14138245 ] Sandy Ryza commented on SPARK-3577: --- On it Shuffle write time incorrect for sort-based shuffle --- Key: SPARK-3577 URL: https://issues.apache.org/jira/browse/SPARK-3577 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Kay Ousterhout Priority: Blocker After this change https://github.com/apache/spark/commit/4e982364426c7d65032e8006c63ca4f9a0d40470 (cc [~sandyr] [~pwendell]) the ExternalSorter passes its own ShuffleWriteMetrics into ExternalSorter. The write time recorded in those metrics is never used -- meaning that when someone is using sort-based shuffle, the shuffle write time will be recorded as 0. [~sandyr] do you have time to take a look at this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3574) Shuffle finish time always reported as -1
[ https://issues.apache.org/jira/browse/SPARK-3574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138247#comment-14138247 ] Kay Ousterhout commented on SPARK-3574: --- Thanks Sandy!! Shuffle finish time always reported as -1 - Key: SPARK-3574 URL: https://issues.apache.org/jira/browse/SPARK-3574 Project: Spark Issue Type: Bug Affects Versions: 1.1.0 Reporter: Kay Ousterhout shuffleFinishTime is always reported as -1. I think the way we fix this should be to set the shuffleFinishTime in each ShuffleWriteMetrics as the shuffles finish, but when aggregating the metrics, only report shuffleFinishTime as something other than -1 when *all* of the shuffles have completed. [~sandyr], it looks like this was introduced in your recent patch to incrementally report metrics. Any chance you can fix this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138281#comment-14138281 ] Matei Zaharia commented on SPARK-3129: -- Great, it will be nice to see how fast this is. I also think the rate per node doesn't need to be enormous for this to be useful, since we can also parallelize receiving over multiple nodes. Prevent data loss in Spark Streaming Key: SPARK-3129 URL: https://issues.apache.org/jira/browse/SPARK-3129 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Hari Shreedharan Assignee: Hari Shreedharan Attachments: SecurityFix.diff, StreamingPreventDataLoss.pdf Spark Streaming can small amounts of data when the driver goes down - and the sending system cannot re-send the data (or the data has already expired on the sender side). The document attached has more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3580) Add Consistent Method For Number of RDD Partitions Across Differnet Languages
Pat McDonough created SPARK-3580: Summary: Add Consistent Method For Number of RDD Partitions Across Differnet Languages Key: SPARK-3580 URL: https://issues.apache.org/jira/browse/SPARK-3580 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Pat McDonough Programmatically retrieving the number of partitions is not consistent between python and scala. A consistent method should be defined and made public across both languages. RDD.partitions.size is also used quite frequently throughout the internal code, so that might be worth refactoring as well once the new method is available. What we have today is below. In Scala: {code} scala someRDD.partitions.size res0: Int = 30 {code} In Python: {code} In [2]: someRDD.getNumPartitions() Out[2]: 30 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Differnet Languages
[ https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat McDonough updated SPARK-3580: - Summary: Add Consistent Method To Get Number of RDD Partitions Across Differnet Languages (was: Add Consistent Method For Number of RDD Partitions Across Differnet Languages) Add Consistent Method To Get Number of RDD Partitions Across Differnet Languages Key: SPARK-3580 URL: https://issues.apache.org/jira/browse/SPARK-3580 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Pat McDonough Programmatically retrieving the number of partitions is not consistent between python and scala. A consistent method should be defined and made public across both languages. RDD.partitions.size is also used quite frequently throughout the internal code, so that might be worth refactoring as well once the new method is available. What we have today is below. In Scala: {code} scala someRDD.partitions.size res0: Int = 30 {code} In Python: {code} In [2]: someRDD.getNumPartitions() Out[2]: 30 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3580) Add Consistent Method To Get Number of RDD Partitions Across Different Languages
[ https://issues.apache.org/jira/browse/SPARK-3580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pat McDonough updated SPARK-3580: - Summary: Add Consistent Method To Get Number of RDD Partitions Across Different Languages (was: Add Consistent Method To Get Number of RDD Partitions Across Differnet Languages) Add Consistent Method To Get Number of RDD Partitions Across Different Languages Key: SPARK-3580 URL: https://issues.apache.org/jira/browse/SPARK-3580 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 1.1.0 Reporter: Pat McDonough Programmatically retrieving the number of partitions is not consistent between python and scala. A consistent method should be defined and made public across both languages. RDD.partitions.size is also used quite frequently throughout the internal code, so that might be worth refactoring as well once the new method is available. What we have today is below. In Scala: {code} scala someRDD.partitions.size res0: Int = 30 {code} In Python: {code} In [2]: someRDD.getNumPartitions() Out[2]: 30 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org