[jira] [Commented] (SPARK-8119) HeartbeatReceiver should not adjust application executor resources
[ https://issues.apache.org/jira/browse/SPARK-8119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150055#comment-15150055 ] Zhen Peng commented on SPARK-8119: -- Hi [~srowen], I think it's really a serious bug, do you have any reason for not back-porting it to 1.4.x? > HeartbeatReceiver should not adjust application executor resources > -- > > Key: SPARK-8119 > URL: https://issues.apache.org/jira/browse/SPARK-8119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: SaintBacchus >Assignee: Andrew Or >Priority: Critical > Fix For: 1.5.0 > > > DynamicAllocation will set the total executor to a little number when it > wants to kill some executors. > But in no-DynamicAllocation scenario, Spark will also set the total executor. > So it will cause such problem: sometimes an executor fails down, there is no > more executor which will be pull up by spark. > === EDIT by andrewor14 === > The issue is that the AM forgets about the original number of executors it > wants after calling sc.killExecutor. Even if dynamic allocation is not > enabled, this is still possible because of heartbeat timeouts. > I think the problem is that sc.killExecutor is used incorrectly in > HeartbeatReceiver. The intention of the method is to permanently adjust the > number of executors the application will get. In HeartbeatReceiver, however, > this is used as a best-effort mechanism to ensure that the timed out executor > is dead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150053#comment-15150053 ] Xiao Li commented on SPARK-1: - Yeah, the same series of random number. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150050#comment-15150050 ] Liang-Chi Hsieh commented on SPARK-1: - But when you set deterministic to true, your each data partition will get same random values, right? > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150042#comment-15150042 ] Xiao Li commented on SPARK-1: - Yeah. I realized it when fixing this problem. Thus, in the PR, I just added another parameter `deterministic` for rand and randn. If necessary, users can set `deterministic` to true. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150008#comment-15150008 ] Liang-Chi Hsieh commented on SPARK-1: - If you don't attach a partition id, wouldn't your each data partition have same random number? > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13249) Filter null keys for inner join
[ https://issues.apache.org/jira/browse/SPARK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149986#comment-15149986 ] Apache Spark commented on SPARK-13249: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/11235 > Filter null keys for inner join > --- > > Key: SPARK-13249 > URL: https://issues.apache.org/jira/browse/SPARK-13249 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > For inner join, the join key with null in it will not match each other, so we > could insert a Filter before inner join (could be pushed down), then we don't > need to check nullability of keys while joining. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13249) Filter null keys for inner join
[ https://issues.apache.org/jira/browse/SPARK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13249: Assignee: Apache Spark > Filter null keys for inner join > --- > > Key: SPARK-13249 > URL: https://issues.apache.org/jira/browse/SPARK-13249 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > For inner join, the join key with null in it will not match each other, so we > could insert a Filter before inner join (could be pushed down), then we don't > need to check nullability of keys while joining. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13249) Filter null keys for inner join
[ https://issues.apache.org/jira/browse/SPARK-13249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13249: Assignee: (was: Apache Spark) > Filter null keys for inner join > --- > > Key: SPARK-13249 > URL: https://issues.apache.org/jira/browse/SPARK-13249 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu > > For inner join, the join key with null in it will not match each other, so we > could insert a Filter before inner join (could be pushed down), then we don't > need to check nullability of keys while joining. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization
[ https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-13322: Description: This bug is reported by Stuti Awasthi. https://www.mail-archive.com/user@spark.apache.org/msg45643.html The lossSum has possibility of infinity because we do not standardize the feature before fitting model, we should support feature standardization. Another benefit is that standardization will improve the convergence rate. was: This bug is reported by Stuti Awasthi. https://www.mail-archive.com/user@spark.apache.org/msg45643.html The lossSum has possibility of infinity because we do not standardize the feature before fitting model, we should support feature standardization. > AFTSurvivalRegression should support feature standardization > > > Key: SPARK-13322 > URL: https://issues.apache.org/jira/browse/SPARK-13322 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > This bug is reported by Stuti Awasthi. > https://www.mail-archive.com/user@spark.apache.org/msg45643.html > The lossSum has possibility of infinity because we do not standardize the > feature before fitting model, we should support feature standardization. > Another benefit is that standardization will improve the convergence rate. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization
[ https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-13322: Description: This bug is reported by Stuti Awasthi. https://www.mail-archive.com/user@spark.apache.org/msg45643.html The lossSum has possibility of infinity because we do not standardize the feature before fitting model, we should support feature standardization. was: This bug is reported by Stuti Awasthi. https://www.mail-archive.com/user@spark.apache.org/msg45643.html The lossSum has possibility of infinity because we do not standardize the feature before fitting model, we should handle this. > AFTSurvivalRegression should support feature standardization > > > Key: SPARK-13322 > URL: https://issues.apache.org/jira/browse/SPARK-13322 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > This bug is reported by Stuti Awasthi. > https://www.mail-archive.com/user@spark.apache.org/msg45643.html > The lossSum has possibility of infinity because we do not standardize the > feature before fitting model, we should support feature standardization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization
[ https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-13322: Description: This bug is reported by Stuti Awasthi. https://www.mail-archive.com/user@spark.apache.org/msg45643.html The lossSum has possibility of infinity because we do not standardize the feature before fitting model, we should handle this. was: This bug is reported by Stuti Awasthi. https://www.mail-archive.com/user@spark.apache.org/msg45643.html The lossSum has possibility of infinity, so we should handle it properly. > AFTSurvivalRegression should support feature standardization > > > Key: SPARK-13322 > URL: https://issues.apache.org/jira/browse/SPARK-13322 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > This bug is reported by Stuti Awasthi. > https://www.mail-archive.com/user@spark.apache.org/msg45643.html > The lossSum has possibility of infinity because we do not standardize the > feature before fitting model, we should handle this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13322) AFTSurvivalRegression should support feature standardization
[ https://issues.apache.org/jira/browse/SPARK-13322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-13322: Summary: AFTSurvivalRegression should support feature standardization (was: AFTSurvivalRegression should handle lossSum infinity) > AFTSurvivalRegression should support feature standardization > > > Key: SPARK-13322 > URL: https://issues.apache.org/jira/browse/SPARK-13322 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > This bug is reported by Stuti Awasthi. > https://www.mail-archive.com/user@spark.apache.org/msg45643.html > The lossSum has possibility of infinity, so we should handle it properly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13359) ArrayType(_, true) should also accept ArrayType(_, false) fix for branch-1.6
Earthson Lu created SPARK-13359: --- Summary: ArrayType(_, true) should also accept ArrayType(_, false) fix for branch-1.6 Key: SPARK-13359 URL: https://issues.apache.org/jira/browse/SPARK-13359 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.6.0 Reporter: Earthson Lu Priority: Minor Fix For: 1.6.1 backport fix for https://issues.apache.org/jira/browse/SPARK-12746 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13275) With dynamic allocation, executors appear to be added before job starts
[ https://issues.apache.org/jira/browse/SPARK-13275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149939#comment-15149939 ] Saisai Shao commented on SPARK-13275: - would you please clarify the specific problem you mentioned, is it UI problem or dynamic allocation problem? > With dynamic allocation, executors appear to be added before job starts > --- > > Key: SPARK-13275 > URL: https://issues.apache.org/jira/browse/SPARK-13275 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.5.0 >Reporter: Stephanie Bodoff >Priority: Minor > Attachments: webui.png > > > When I look at the timeline in the Spark Web UI I see the job starting and > then executors being added. The blue lines and dots hitting the timeline show > that the executors were added after the job started. But the way the Executor > box is rendered it looks like the executors started before the job. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row
[ https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13354: Assignee: Apache Spark (was: Davies Liu) > Push filter throughout outer join when the condition can filter out empty row > -- > > Key: SPARK-13354 > URL: https://issues.apache.org/jira/browse/SPARK-13354 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > For a query > {code} > select * from a left outer join b on a.a = b.a where b.b > 10 > {code} > The condition `b.b > 10` will filter out all the row that the b part of it is > empty. > In this case, we should use Inner join, and push down the filter into b. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row
[ https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149923#comment-15149923 ] Apache Spark commented on SPARK-13354: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11234 > Push filter throughout outer join when the condition can filter out empty row > -- > > Key: SPARK-13354 > URL: https://issues.apache.org/jira/browse/SPARK-13354 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > For a query > {code} > select * from a left outer join b on a.a = b.a where b.b > 10 > {code} > The condition `b.b > 10` will filter out all the row that the b part of it is > empty. > In this case, we should use Inner join, and push down the filter into b. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row
[ https://issues.apache.org/jira/browse/SPARK-13354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13354: Assignee: Davies Liu (was: Apache Spark) > Push filter throughout outer join when the condition can filter out empty row > -- > > Key: SPARK-13354 > URL: https://issues.apache.org/jira/browse/SPARK-13354 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > For a query > {code} > select * from a left outer join b on a.a = b.a where b.b > 10 > {code} > The condition `b.b > 10` will filter out all the row that the b part of it is > empty. > In this case, we should use Inner join, and push down the filter into b. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1: Assignee: (was: Apache Spark) > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149850#comment-15149850 ] Apache Spark commented on SPARK-1: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/11232 > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-1: Assignee: Apache Spark > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13358) Retrieve grep path when doing Benchmark
[ https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13358: Assignee: (was: Apache Spark) > Retrieve grep path when doing Benchmark > --- > > Key: SPARK-13358 > URL: https://issues.apache.org/jira/browse/SPARK-13358 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13358) Retrieve grep path when doing Benchmark
[ https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13358: Assignee: Apache Spark > Retrieve grep path when doing Benchmark > --- > > Key: SPARK-13358 > URL: https://issues.apache.org/jira/browse/SPARK-13358 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13358) Retrieve grep path when doing Benchmark
Liang-Chi Hsieh created SPARK-13358: --- Summary: Retrieve grep path when doing Benchmark Key: SPARK-13358 URL: https://issues.apache.org/jira/browse/SPARK-13358 Project: Spark Issue Type: Bug Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13358) Retrieve grep path when doing Benchmark
[ https://issues.apache.org/jira/browse/SPARK-13358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149837#comment-15149837 ] Apache Spark commented on SPARK-13358: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/11231 > Retrieve grep path when doing Benchmark > --- > > Key: SPARK-13358 > URL: https://issues.apache.org/jira/browse/SPARK-13358 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Liang-Chi Hsieh >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12316) Stack overflow with endless call of `Delegation token thread` when application end.
[ https://issues.apache.org/jira/browse/SPARK-12316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149810#comment-15149810 ] SaintBacchus commented on SPARK-12316: -- [~tgraves] The function of listFilesSorted will not throw the exception, it only log the exception. So it will not schedule it an hour later and it will schedule it immediately and then go into another loop. > Stack overflow with endless call of `Delegation token thread` when > application end. > --- > > Key: SPARK-12316 > URL: https://issues.apache.org/jira/browse/SPARK-12316 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.0 >Reporter: SaintBacchus >Assignee: SaintBacchus > Attachments: 20151210045149.jpg, 20151210045533.jpg > > > When application end, AM will clean the staging dir. > But if the driver trigger to update the delegation token, it will can't find > the right token file and then it will endless cycle call the method > 'updateCredentialsIfRequired'. > Then it lead to StackOverflowError. > !https://issues.apache.org/jira/secure/attachment/12779495/20151210045149.jpg! > !https://issues.apache.org/jira/secure/attachment/12779496/20151210045533.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode
[ https://issues.apache.org/jira/browse/SPARK-13357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13357: Assignee: Apache Spark > Use generated projection and ordering for TakeOrderedAndProjectNode > --- > > Key: SPARK-13357 > URL: https://issues.apache.org/jira/browse/SPARK-13357 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin >Assignee: Apache Spark > > {{TakeOrderedAndProjectNode}} should use generated projection and ordering > like other {{LocalNode}} s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode
[ https://issues.apache.org/jira/browse/SPARK-13357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149807#comment-15149807 ] Apache Spark commented on SPARK-13357: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/11230 > Use generated projection and ordering for TakeOrderedAndProjectNode > --- > > Key: SPARK-13357 > URL: https://issues.apache.org/jira/browse/SPARK-13357 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{TakeOrderedAndProjectNode}} should use generated projection and ordering > like other {{LocalNode}} s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode
[ https://issues.apache.org/jira/browse/SPARK-13357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13357: Assignee: (was: Apache Spark) > Use generated projection and ordering for TakeOrderedAndProjectNode > --- > > Key: SPARK-13357 > URL: https://issues.apache.org/jira/browse/SPARK-13357 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Takuya Ueshin > > {{TakeOrderedAndProjectNode}} should use generated projection and ordering > like other {{LocalNode}} s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13357) Use generated projection and ordering for TakeOrderedAndProjectNode
Takuya Ueshin created SPARK-13357: - Summary: Use generated projection and ordering for TakeOrderedAndProjectNode Key: SPARK-13357 URL: https://issues.apache.org/jira/browse/SPARK-13357 Project: Spark Issue Type: Bug Components: SQL Reporter: Takuya Ueshin {{TakeOrderedAndProjectNode}} should use generated projection and ordering like other {{LocalNode}} s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"
[ https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149795#comment-15149795 ] Apache Spark commented on SPARK-13220: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/11229 > Deprecate "yarn-client" and "yarn-cluster" > -- > > Key: SPARK-13220 > URL: https://issues.apache.org/jira/browse/SPARK-13220 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Andrew Or >Assignee: Saisai Shao > > We currently allow `\-\-master yarn-client`. Instead, the user should do > `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more > consistent with other cluster managers and obviates the need to do special > parsing of the master string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"
[ https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13220: Assignee: Apache Spark (was: Saisai Shao) > Deprecate "yarn-client" and "yarn-cluster" > -- > > Key: SPARK-13220 > URL: https://issues.apache.org/jira/browse/SPARK-13220 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Andrew Or >Assignee: Apache Spark > > We currently allow `\-\-master yarn-client`. Instead, the user should do > `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more > consistent with other cluster managers and obviates the need to do special > parsing of the master string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13220) Deprecate "yarn-client" and "yarn-cluster"
[ https://issues.apache.org/jira/browse/SPARK-13220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13220: Assignee: Saisai Shao (was: Apache Spark) > Deprecate "yarn-client" and "yarn-cluster" > -- > > Key: SPARK-13220 > URL: https://issues.apache.org/jira/browse/SPARK-13220 > Project: Spark > Issue Type: Sub-task > Components: YARN >Reporter: Andrew Or >Assignee: Saisai Shao > > We currently allow `\-\-master yarn-client`. Instead, the user should do > `\-\-master yarn \-\-deploy-mode client` to be more explicit. This is more > consistent with other cluster managers and obviates the need to do special > parsing of the master string. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149794#comment-15149794 ] Sateesh Babu G commented on SPARK-9273: --- Hi Alexander, Thank you very much for your help! Can I use any one of the mentioned implementations for CNN regression? Which one is more suitable? I also found deeplearning4j.org has CNN implementation in Spark but with only one Convoultional and one pooling layer. Do you suggest deeplearning4j.org's CNN implementation in Spark. Best, Sateesh > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11627) Spark Streaming backpressure mechanism has no initial input rate limit,receivers receive data at the maximum speed , it might cause OOM exception
[ https://issues.apache.org/jira/browse/SPARK-11627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-11627: - Affects Version/s: 1.6.0 > Spark Streaming backpressure mechanism has no initial input rate > limit,receivers receive data at the maximum speed , it might cause OOM > exception > -- > > Key: SPARK-11627 > URL: https://issues.apache.org/jira/browse/SPARK-11627 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.1, 1.6.0 >Reporter: junhaoMg >Assignee: junhaoMg > Fix For: 2.0.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Spark Streaming backpressure mechanism has no initial input rate limit, > receivers receive data at the maximum speed they can reach in the first > batch, the data received might exhaust executors memory resources and cause > out of memory exception. Eventually the streaming job would failed, the > backpressure mechanism become invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11627) Spark Streaming backpressure mechanism has no initial input rate limit,receivers receive data at the maximum speed , it might cause OOM exception
[ https://issues.apache.org/jira/browse/SPARK-11627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-11627. -- Resolution: Fixed Assignee: junhaoMg Fix Version/s: 2.0.0 > Spark Streaming backpressure mechanism has no initial input rate > limit,receivers receive data at the maximum speed , it might cause OOM > exception > -- > > Key: SPARK-11627 > URL: https://issues.apache.org/jira/browse/SPARK-11627 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.5.1, 1.6.0 >Reporter: junhaoMg >Assignee: junhaoMg > Fix For: 2.0.0 > > Original Estimate: 72h > Remaining Estimate: 72h > > Spark Streaming backpressure mechanism has no initial input rate limit, > receivers receive data at the maximum speed they can reach in the first > batch, the data received might exhaust executors memory resources and cause > out of memory exception. Eventually the streaming job would failed, the > backpressure mechanism become invalid. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13356) WebUI missing input informations when recovering from dirver failure
[ https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13356: Assignee: Apache Spark > WebUI missing input informations when recovering from dirver failure > > > Key: SPARK-13356 > URL: https://issues.apache.org/jira/browse/SPARK-13356 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: jeanlyn >Assignee: Apache Spark > Attachments: DirectKafkaScreenshot.jpg > > > WebUI missing some input information when streaming recover from checkpoint, > it may confuse people the data had lose when recover from failure. > For example: > !DirectKafkaScreenshot.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13356) WebUI missing input informations when recovering from dirver failure
[ https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13356: Assignee: (was: Apache Spark) > WebUI missing input informations when recovering from dirver failure > > > Key: SPARK-13356 > URL: https://issues.apache.org/jira/browse/SPARK-13356 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: jeanlyn > Attachments: DirectKafkaScreenshot.jpg > > > WebUI missing some input information when streaming recover from checkpoint, > it may confuse people the data had lose when recover from failure. > For example: > !DirectKafkaScreenshot.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13356) WebUI missing input informations when recovering from dirver failure
[ https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149727#comment-15149727 ] Apache Spark commented on SPARK-13356: -- User 'jeanlyn' has created a pull request for this issue: https://github.com/apache/spark/pull/11228 > WebUI missing input informations when recovering from dirver failure > > > Key: SPARK-13356 > URL: https://issues.apache.org/jira/browse/SPARK-13356 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: jeanlyn > Attachments: DirectKafkaScreenshot.jpg > > > WebUI missing some input information when streaming recover from checkpoint, > it may confuse people the data had lose when recover from failure. > For example: > !DirectKafkaScreenshot.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13356) WebUI missing input informations when recovering from dirver failure
jeanlyn created SPARK-13356: --- Summary: WebUI missing input informations when recovering from dirver failure Key: SPARK-13356 URL: https://issues.apache.org/jira/browse/SPARK-13356 Project: Spark Issue Type: Bug Affects Versions: 1.6.0, 1.5.2, 1.5.1, 1.5.0 Reporter: jeanlyn WebUI missing some input information when streaming recover from checkpoint, it may confuse people the data had lose when recover from failure. For example: !DirectKafkaScreenshot.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13356) WebUI missing input informations when recovering from dirver failure
[ https://issues.apache.org/jira/browse/SPARK-13356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jeanlyn updated SPARK-13356: Attachment: DirectKafkaScreenshot.jpg > WebUI missing input informations when recovering from dirver failure > > > Key: SPARK-13356 > URL: https://issues.apache.org/jira/browse/SPARK-13356 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0 >Reporter: jeanlyn > Attachments: DirectKafkaScreenshot.jpg > > > WebUI missing some input information when streaming recover from checkpoint, > it may confuse people the data had lose when recover from failure. > For example: > !DirectKafkaScreenshot.jpg! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application cause big performance hit
[ https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] krishna ramachandran updated SPARK-13349: - Description: We have a streaming application containing approximately 12 jobs every batch, running in streaming mode (4 sec batches). Each job writes output to cassandra each job can contain several stages. job 1 ---> receive Stream A --> map --> filter -> (union with another stream B) --> map --> groupbykey --> transform --> reducebykey --> map we go thro' few more jobs of transforms and save to database. Around stage 5, we union the output of Dstream from job 1 (in red) with another stream (generated by split during job 2) and save that state It appears the whole execution thus far is repeated which is redundant (I can see this in execution graph & also performance -> processing time). Processing time per batch nearly doubles or triples. This additional & redundant processing cause each batch to run as much as 2.5 times slower compared to runs without the union - union for most batches does not alter the original DStream (union with an empty set). If I cache the DStream from job 1(red block output), performance improves substantially but hit out of memory errors within few hours. What is the recommended way to cache/unpersist in such a scenario? there is no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. Still seeing out of memory errors was: We have a streaming application containing approximately 12 stages every batch, running in streaming mode (4 sec batches). Each stage persists output to cassandra the pipeline stages stage 1 ---> receive Stream A --> map --> filter -> (union with another stream B) --> map --> groupbykey --> transform --> reducebykey --> map we go thro' few more stages of transforms and save to database. Around stage 5, we union the output of Dstream from stage 1 (in red) with another stream (generated by split during stage 2) and save that state It appears the whole execution thus far is repeated which is redundant (I can see this in execution graph & also performance -> processing time). Processing time per batch nearly doubles or triples. This additional & redundant processing cause each batch to run as much as 2.5 times slower compared to runs without the union - union for most batches does not alter the original DStream (union with an empty set). If I cache the DStream (red block output), performance improves substantially but hit out of memory errors within few hours. What is the recommended way to cache/unpersist in such a scenario? there is no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. Still seeing out of memory errors > adding a split and union to a streaming application cause big performance hit > - > > Key: SPARK-13349 > URL: https://issues.apache.org/jira/browse/SPARK-13349 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: krishna ramachandran >Priority: Critical > Fix For: 1.4.2 > > > We have a streaming application containing approximately 12 jobs every batch, > running in streaming mode (4 sec batches). Each job writes output to cassandra > each job can contain several stages. > job 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) --> > map --> groupbykey --> transform --> reducebykey --> map > we go thro' few more jobs of transforms and save to database. > Around stage 5, we union the output of Dstream from job 1 (in red) with > another stream (generated by split during job 2) and save that state > It appears the whole execution thus far is repeated which is redundant (I can > see this in execution graph & also performance -> processing time). > Processing time per batch nearly doubles or triples. > This additional & redundant processing cause each batch to run as much as 2.5 > times slower compared to runs without the union - union for most batches does > not alter the original DStream (union with an empty set). If I cache the > DStream from job 1(red block output), performance improves substantially but > hit out of memory errors within few hours. > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. Still seeing out of > memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail
[jira] [Assigned] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph
[ https://issues.apache.org/jira/browse/SPARK-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13355: Assignee: Xiangrui Meng (was: Apache Spark) > Replace GraphImpl.fromExistingRDDs by Graph > --- > > Key: SPARK-13355 > URL: https://issues.apache.org/jira/browse/SPARK-13355 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We > call it in LDA without validating this requirement. So it might introduce > errors. Replacing it by `Gpaph.apply` would be safer and more proper because > it is a public API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph
[ https://issues.apache.org/jira/browse/SPARK-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149685#comment-15149685 ] Apache Spark commented on SPARK-13355: -- User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/11226 > Replace GraphImpl.fromExistingRDDs by Graph > --- > > Key: SPARK-13355 > URL: https://issues.apache.org/jira/browse/SPARK-13355 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We > call it in LDA without validating this requirement. So it might introduce > errors. Replacing it by `Gpaph.apply` would be safer and more proper because > it is a public API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph
[ https://issues.apache.org/jira/browse/SPARK-13355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13355: Assignee: Apache Spark (was: Xiangrui Meng) > Replace GraphImpl.fromExistingRDDs by Graph > --- > > Key: SPARK-13355 > URL: https://issues.apache.org/jira/browse/SPARK-13355 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.3.1, 1.4.1, 1.5.2, 1.6.0, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We > call it in LDA without validating this requirement. So it might introduce > errors. Replacing it by `Gpaph.apply` would be safer and more proper because > it is a public API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13355) Replace GraphImpl.fromExistingRDDs by Graph
Xiangrui Meng created SPARK-13355: - Summary: Replace GraphImpl.fromExistingRDDs by Graph Key: SPARK-13355 URL: https://issues.apache.org/jira/browse/SPARK-13355 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 1.6.0, 1.5.2, 1.4.1, 1.3.1, 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng `GraphImpl.fromExistingRDDs` expects preprocessed vertex RDD as input. We call it in LDA without validating this requirement. So it might introduce errors. Replacing it by `Gpaph.apply` would be safer and more proper because it is a public API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12776) Implement Python API for Datasets
[ https://issues.apache.org/jira/browse/SPARK-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149621#comment-15149621 ] Gustavo Salazar Torres commented on SPARK-12776: I will work on some code following what was did at Dataset,scala > Implement Python API for Datasets > - > > Key: SPARK-12776 > URL: https://issues.apache.org/jira/browse/SPARK-12776 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Kevin Cox >Priority: Minor > > Now that the Dataset API is in Scala and Java it would be awesome to see it > show up in PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13354) Push filter throughout outer join when the condition can filter out empty row
Davies Liu created SPARK-13354: -- Summary: Push filter throughout outer join when the condition can filter out empty row Key: SPARK-13354 URL: https://issues.apache.org/jira/browse/SPARK-13354 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu Assignee: Davies Liu For a query {code} select * from a left outer join b on a.a = b.a where b.b > 10 {code} The condition `b.b > 10` will filter out all the row that the b part of it is empty. In this case, we should use Inner join, and push down the filter into b. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13353) Use UnsafeRowSerializer to collect DataFrame
Davies Liu created SPARK-13353: -- Summary: Use UnsafeRowSerializer to collect DataFrame Key: SPARK-13353 URL: https://issues.apache.org/jira/browse/SPARK-13353 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu UnsafeRowSerializer should be more efficient than JavaSerializer or KyroSerializer for DataFrame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13351) Column pruning fails on expand
[ https://issues.apache.org/jira/browse/SPARK-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13351: Assignee: Apache Spark (was: Davies Liu) > Column pruning fails on expand > -- > > Key: SPARK-13351 > URL: https://issues.apache.org/jira/browse/SPARK-13351 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > Optimizer can't pruning the columns in Expand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13351) Column pruning fails on expand
[ https://issues.apache.org/jira/browse/SPARK-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149571#comment-15149571 ] Apache Spark commented on SPARK-13351: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/11225 > Column pruning fails on expand > -- > > Key: SPARK-13351 > URL: https://issues.apache.org/jira/browse/SPARK-13351 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Optimizer can't pruning the columns in Expand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13351) Column pruning fails on expand
[ https://issues.apache.org/jira/browse/SPARK-13351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13351: Assignee: Davies Liu (was: Apache Spark) > Column pruning fails on expand > -- > > Key: SPARK-13351 > URL: https://issues.apache.org/jira/browse/SPARK-13351 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Davies Liu >Assignee: Davies Liu > > Optimizer can't pruning the columns in Expand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13352) BlockFetch does not scale well on large block
Davies Liu created SPARK-13352: -- Summary: BlockFetch does not scale well on large block Key: SPARK-13352 URL: https://issues.apache.org/jira/browse/SPARK-13352 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu BlockManager.getRemoteBytes() perform poorly on large block {code} test("block manager") { val N = 500 << 20 val bm = sc.env.blockManager val blockId = TaskResultBlockId(0) val buffer = ByteBuffer.allocate(N) buffer.limit(N) bm.putBytes(blockId, buffer, StorageLevel.MEMORY_AND_DISK_SER) val result = bm.getRemoteBytes(blockId) assert(result.isDefined) assert(result.get.limit() === (N)) } {code} Here are runtime for different block sizes: {code} 50M3 seconds 100M 7 seconds 250M 33 seconds 500M 2 min {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13351) Column pruning fails on expand
Davies Liu created SPARK-13351: -- Summary: Column pruning fails on expand Key: SPARK-13351 URL: https://issues.apache.org/jira/browse/SPARK-13351 Project: Spark Issue Type: Bug Components: SQL Reporter: Davies Liu Assignee: Davies Liu Optimizer can't pruning the columns in Expand. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12776) Implement Python API for Datasets
[ https://issues.apache.org/jira/browse/SPARK-12776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149525#comment-15149525 ] Gustavo Salazar Torres commented on SPARK-12776: I can work on this, any pointers? > Implement Python API for Datasets > - > > Key: SPARK-12776 > URL: https://issues.apache.org/jira/browse/SPARK-12776 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Kevin Cox >Priority: Minor > > Now that the Dataset API is in Scala and Java it would be awesome to see it > show up in PySpark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application cause big performance hit
[ https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] krishna ramachandran updated SPARK-13349: - Summary: adding a split and union to a streaming application cause big performance hit (was: adding a split and union to a streaming application causes big performance hit) > adding a split and union to a streaming application cause big performance hit > - > > Key: SPARK-13349 > URL: https://issues.apache.org/jira/browse/SPARK-13349 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: krishna ramachandran >Priority: Critical > Fix For: 1.4.2 > > > We have a streaming application containing approximately 12 stages every > batch, running in streaming mode (4 sec batches). Each stage persists output > to cassandra > the pipeline stages > stage 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) --> > map --> groupbykey --> transform --> reducebykey --> map > we go thro' few more stages of transforms and save to database. > Around stage 5, we union the output of Dstream from stage 1 (in red) with > another stream (generated by split during stage 2) and save that state > It appears the whole execution thus far is repeated which is redundant (I can > see this in execution graph & also performance -> processing time). > Processing time per batch nearly doubles or triples. > This additional & redundant processing cause each batch to run as much as 2.5 > times slower compared to runs without the union - union for most batches does > not alter the original DStream (union with an empty set). If I cache the > DStream (red block output), performance improves substantially but hit out of > memory errors within few hours. > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. Still seeing out of > memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13349) adding a split and union to a streaming application causes big performance hit
[ https://issues.apache.org/jira/browse/SPARK-13349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] krishna ramachandran updated SPARK-13349: - Summary: adding a split and union to a streaming application causes big performance hit (was: enabling cache causes out of memory error. Caching DStream helps reduce processing time in a streaming application but get out of memory errors) > adding a split and union to a streaming application causes big performance hit > -- > > Key: SPARK-13349 > URL: https://issues.apache.org/jira/browse/SPARK-13349 > Project: Spark > Issue Type: Improvement >Affects Versions: 1.4.1 >Reporter: krishna ramachandran >Priority: Critical > Fix For: 1.4.2 > > > We have a streaming application containing approximately 12 stages every > batch, running in streaming mode (4 sec batches). Each stage persists output > to cassandra > the pipeline stages > stage 1 > ---> receive Stream A --> map --> filter -> (union with another stream B) --> > map --> groupbykey --> transform --> reducebykey --> map > we go thro' few more stages of transforms and save to database. > Around stage 5, we union the output of Dstream from stage 1 (in red) with > another stream (generated by split during stage 2) and save that state > It appears the whole execution thus far is repeated which is redundant (I can > see this in execution graph & also performance -> processing time). > Processing time per batch nearly doubles or triples. > This additional & redundant processing cause each batch to run as much as 2.5 > times slower compared to runs without the union - union for most batches does > not alter the original DStream (union with an empty set). If I cache the > DStream (red block output), performance improves substantially but hit out of > memory errors within few hours. > What is the recommended way to cache/unpersist in such a scenario? there is > no dstream level "unpersist" > setting "spark.streaming.unpersist" to true and > streamingContext.remember("duration") did not help. Still seeing out of > memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13350) Configuration documentation incorrectly states that PYSPARK_PYTHON's default is "python"
Christopher Aycock created SPARK-13350: -- Summary: Configuration documentation incorrectly states that PYSPARK_PYTHON's default is "python" Key: SPARK-13350 URL: https://issues.apache.org/jira/browse/SPARK-13350 Project: Spark Issue Type: Documentation Components: Documentation Reporter: Christopher Aycock Priority: Trivial The configuration documentation states that the environment variable PYSPARK_PYTHON has a default value of {{python}}: http://spark.apache.org/docs/latest/configuration.html In fact, the default is {{python2.7}}: https://github.com/apache/spark/blob/4f60651cbec1b4c9cc2e6d832ace77e89a233f3a/bin/pyspark#L39-L45 The change that introduced this was discussed here: https://github.com/apache/spark/pull/2651 Would it be possible to highlight this in the documentation? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149507#comment-15149507 ] Xiao Li commented on SPARK-1: - Will try to submit a PR tonight. When users specify a seed, I am unable to find a reason why we need to add partition id into the seed value. > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13349) enabling cache causes out of memory error. Caching DStream helps reduce processing time in a streaming application but get out of memory errors
krishna ramachandran created SPARK-13349: Summary: enabling cache causes out of memory error. Caching DStream helps reduce processing time in a streaming application but get out of memory errors Key: SPARK-13349 URL: https://issues.apache.org/jira/browse/SPARK-13349 Project: Spark Issue Type: Improvement Affects Versions: 1.4.1 Reporter: krishna ramachandran Priority: Critical Fix For: 1.4.2 We have a streaming application containing approximately 12 stages every batch, running in streaming mode (4 sec batches). Each stage persists output to cassandra the pipeline stages stage 1 ---> receive Stream A --> map --> filter -> (union with another stream B) --> map --> groupbykey --> transform --> reducebykey --> map we go thro' few more stages of transforms and save to database. Around stage 5, we union the output of Dstream from stage 1 (in red) with another stream (generated by split during stage 2) and save that state It appears the whole execution thus far is repeated which is redundant (I can see this in execution graph & also performance -> processing time). Processing time per batch nearly doubles or triples. This additional & redundant processing cause each batch to run as much as 2.5 times slower compared to runs without the union - union for most batches does not alter the original DStream (union with an empty set). If I cache the DStream (red block output), performance improves substantially but hit out of memory errors within few hours. What is the recommended way to cache/unpersist in such a scenario? there is no dstream level "unpersist" setting "spark.streaming.unpersist" to true and streamingContext.remember("duration") did not help. Still seeing out of memory errors -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13298) DAG visualization does not render correctly for jobs
[ https://issues.apache.org/jira/browse/SPARK-13298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149457#comment-15149457 ] Shixiong Zhu commented on SPARK-13298: -- Do you have a reproducer? > DAG visualization does not render correctly for jobs > > > Key: SPARK-13298 > URL: https://issues.apache.org/jira/browse/SPARK-13298 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 >Reporter: Lucas Woltmann > Attachments: dag_full.png, dag_viz.png > > > Whenever I try to open the DAG for a job, I get something like this: > !dag_viz.png! > Obviously the svg doesn't get resized, but if I resize it manually, only the > first of four stages in the DAG is shown. > The js console says (variable v is null in peg$c34): > {code:javascript} > Uncaught TypeError: Cannot read property '3' of null > peg$c34 @ graphlib-dot.min.js:1 > peg$parseidDef @ graphlib-dot.min.js:1 > peg$parseaList @ graphlib-dot.min.js:1 > peg$parseattrListBlock @ graphlib-dot.min.js:1 > peg$parseattrList @ graphlib-dot.min.js:1 > peg$parsenodeStmt @ graphlib-dot.min.js:1 > peg$parsestmt @ graphlib-dot.min.js:1 > peg$parsestmtList @ graphlib-dot.min.js:1 > peg$parsesubgraphStmt @ graphlib-dot.min.js:1 > peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1 > peg$parseedgeStmt @ graphlib-dot.min.js:1 > peg$parsestmt @ graphlib-dot.min.js:1 > peg$parsestmtList @ graphlib-dot.min.js:1 > peg$parsesubgraphStmt @ graphlib-dot.min.js:1 > peg$parsenodeIdOrSubgraph @ graphlib-dot.min.js:1 > peg$parseedgeStmt @ graphlib-dot.min.js:1 > peg$parsestmt @ graphlib-dot.min.js:1 > peg$parsestmtList @ graphlib-dot.min.js:1 > peg$parsegraphStmt @ graphlib-dot.min.js:1 > parse @ graphlib-dot.min.js:2 > readOne @ graphlib-dot.min.js:2 > renderDot @ spark-dag-viz.js:281 > (anonymous function) @ spark-dag-viz.js:248 > (anonymous function) @ d3.min.js: > 3Y @ d3.min.js:1 > _a.each @ d3.min.js:3 > renderDagVizForJob @ spark-dag-viz.js:207 > renderDagViz @ spark-dag-viz.js:163 > toggleDagViz @ spark-dag-viz.js:100 > onclick @ ?id=2:153 > {code} > (tested in FIrefox 44.0.1 and Chromium 48.0.2564.103) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-10759) Missing Python code example in ML Programming guide
[ https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeremy updated SPARK-10759: --- Comment: was deleted (was: Cannot add example for code that doesn't exist.) > Missing Python code example in ML Programming guide > --- > > Key: SPARK-10759 > URL: https://issues.apache.org/jira/browse/SPARK-10759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Raela Wang >Assignee: Apache Spark >Priority: Minor > Labels: starter > > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13346) Using DataFrames iteratively leads to massive query plans, which slows execution
[ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-13346: -- Summary: Using DataFrames iteratively leads to massive query plans, which slows execution (was: DataFrame caching is not handled well during planning or execution) > Using DataFrames iteratively leads to massive query plans, which slows > execution > > > Key: SPARK-13346 > URL: https://issues.apache.org/jira/browse/SPARK-13346 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley > > I have an iterative algorithm based on DataFrames, and the query plan grows > very quickly with each iteration. Caching the current DataFrame at the end > of an iteration does not fix the problem. However, converting the DataFrame > to an RDD and back at the end of each iteration does fix the problem. > Printing the query plans shows that the plan explodes quickly (10 lines, to > several hundred lines, to several thousand lines, ...) with successive > iterations. > The desired behavior is for the analyzer to recognize that a big chunk of the > query plan does not need to be computed since it is already cached. The > computation on each iteration should be the same. > If useful, I can push (complex) code to reproduce the issue. But it should > be simple to see if you create an iterative algorithm which produces a new > DataFrame from an old one on each iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13283: Assignee: Apache Spark > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński >Assignee: Apache Spark > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13283: Assignee: (was: Apache Spark) > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149396#comment-15149396 ] Apache Spark commented on SPARK-13283: -- User 'xguo27' has created a pull request for this issue: https://github.com/apache/spark/pull/11224 > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13283) Spark doesn't escape column names when creating table on JDBC
[ https://issues.apache.org/jira/browse/SPARK-13283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149387#comment-15149387 ] Xiu (Joe) Guo commented on SPARK-13283: --- Yes, it is a different problem from [SPARK-13297|https://issues.apache.org/jira/browse/SPARK-13297]. We should escape the column name based on JdbcDialect. > Spark doesn't escape column names when creating table on JDBC > - > > Key: SPARK-13283 > URL: https://issues.apache.org/jira/browse/SPARK-13283 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Maciej Bryński > > Hi, > I have following problem. > I have DF where one of the columns has 'from' name. > {code} > root > |-- from: decimal(20,0) (nullable = true) > {code} > When I'm saving it to MySQL database I'm getting error: > {code} > Py4JJavaError: An error occurred while calling o183.jdbc. > : com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: You have an > error in your SQL syntax; check the manual that corresponds to your MySQL > server version for the right syntax to use near 'from DECIMAL(20,0) , ' at > line 1 > {code} > I think the problem is that Spark doesn't escape column names with ` sign on > creating table. > {code} > `from` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code
[ https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149349#comment-15149349 ] Felix Cheung commented on SPARK-12846: -- changes to fix Jenkins was in PR https://github.com/apache/spark/pull/10792 > Follow up SPARK-12707, Update documentation and other related code > -- > > Key: SPARK-12846 > URL: https://issues.apache.org/jira/browse/SPARK-12846 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Jeff Zhang > > Add the background context mail thread > http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code
[ https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149343#comment-15149343 ] Felix Cheung commented on SPARK-12846: -- actually I was referring to how Jenkins/tests were broken by https://github.com/apache/spark/pull/10658 not the documentation... > Follow up SPARK-12707, Update documentation and other related code > -- > > Key: SPARK-12846 > URL: https://issues.apache.org/jira/browse/SPARK-12846 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Reporter: Jeff Zhang > > Add the background context mail thread > http://apache-spark-developers-list.1001551.n3.nabble.com/Are-we-running-SparkR-tests-in-Jenkins-td16034.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13348) Avoid duplicated broadcasts
Davies Liu created SPARK-13348: -- Summary: Avoid duplicated broadcasts Key: SPARK-13348 URL: https://issues.apache.org/jira/browse/SPARK-13348 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu An broadcasted table could be used multiple times in a query, we should cache them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13347) Reuse the shuffle for duplicated exchange
Davies Liu created SPARK-13347: -- Summary: Reuse the shuffle for duplicated exchange Key: SPARK-13347 URL: https://issues.apache.org/jira/browse/SPARK-13347 Project: Spark Issue Type: Improvement Components: SQL Reporter: Davies Liu In TPCDS query 47, the same exchange is used three times, we should re-use the ShuffleRowRDD to skip the duplicated stages. {code} with v1 as( select i_category, i_brand, s_store_name, s_company_name, d_year, d_moy, sum(ss_sales_price) sum_sales, avg(sum(ss_sales_price)) over (partition by i_category, i_brand, s_store_name, s_company_name, d_year) avg_monthly_sales, rank() over (partition by i_category, i_brand, s_store_name, s_company_name order by d_year, d_moy) rn from item, store_sales, date_dim, store where ss_item_sk = i_item_sk and ss_sold_date_sk = d_date_sk and ss_store_sk = s_store_sk and ( d_year = 1999 or ( d_year = 1999-1 and d_moy =12) or ( d_year = 1999+1 and d_moy =1) ) group by i_category, i_brand, s_store_name, s_company_name, d_year, d_moy), v2 as( select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, v1.d_year, v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, v1_lag.sum_sales psum, v1_lead.sum_sales nsum from v1, v1 v1_lag, v1 v1_lead where v1.i_category = v1_lag.i_category and v1.i_category = v1_lead.i_category and v1.i_brand = v1_lag.i_brand and v1.i_brand = v1_lead.i_brand and v1.s_store_name = v1_lag.s_store_name and v1.s_store_name = v1_lead.s_store_name and v1.s_company_name = v1_lag.s_company_name and v1.s_company_name = v1_lead.s_company_name and v1.rn = v1_lag.rn + 1 and v1.rn = v1_lead.rn - 1) select * from v2 where d_year = 1999 and avg_monthly_sales > 0 and case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1 order by sum_sales - avg_monthly_sales, 3 limit 100 {code} Since the SparkPlan is just a tree (not DAG), we can only do this in SparkPlan.execute() or final rule. And we should also have a way to compare two SparkPlan whether they have same result or not (they may have different exprId, we should compare them after bind). An quick experiment showed that we could have 2X improvement on this query. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149309#comment-15149309 ] Henry Saputra edited comment on SPARK-5158 at 2/16/16 9:14 PM: --- HI All, seemed like all PRs for this issue are closed. This PR: https://github.com/apache/spark/pull/265 is closed claiming there is a more recent PR is being work on, which I assume is this one: https://github.com/apache/spark/pull/4106 but this one is also closed due to inactivity. Looking at the issues filed that are closed as duplicate for this one, there is a need and interest to get standalone mode to access secured HDFS given the active users keytab already available to the machines that run Spark. was (Author: hsaputra): All, the PR for this issues are closed. This PR: https://github.com/apache/spark/pull/265 is closed claiming there is a more recent PR is being work on, which I assume is this one: https://github.com/apache/spark/pull/4106 but this one is also closed due to inactivity. Looking at the issues filed that are closed as duplicate for this one, there is a need and interest to get standalone mode to access secured HDFS given the active users keytab already available to the machines that run Spark. > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5158) Allow for keytab-based HDFS security in Standalone mode
[ https://issues.apache.org/jira/browse/SPARK-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149309#comment-15149309 ] Henry Saputra commented on SPARK-5158: -- All, the PR for this issues are closed. This PR: https://github.com/apache/spark/pull/265 is closed claiming there is a more recent PR is being work on, which I assume is this one: https://github.com/apache/spark/pull/4106 but this one is also closed due to inactivity. Looking at the issues filed that are closed as duplicate for this one, there is a need and interest to get standalone mode to access secured HDFS given the active users keytab already available to the machines that run Spark. > Allow for keytab-based HDFS security in Standalone mode > --- > > Key: SPARK-5158 > URL: https://issues.apache.org/jira/browse/SPARK-5158 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Matthew Cheah >Priority: Critical > > There have been a handful of patches for allowing access to Kerberized HDFS > clusters in standalone mode. The main reason we haven't accepted these > patches have been that they rely on insecure distribution of token files from > the driver to the other components. > As a simpler solution, I wonder if we should just provide a way to have the > Spark driver and executors independently log in and acquire credentials using > a keytab. This would work for users who have a dedicated, single-tenant, > Spark clusters (i.e. they are willing to have a keytab on every machine > running Spark for their application). It wouldn't address all possible > deployment scenarios, but if it's simple I think it's worth considering. > This would also work for Spark streaming jobs, which often run on dedicated > hardware since they are long-running services. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13308) ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error cases
[ https://issues.apache.org/jira/browse/SPARK-13308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-13308. -- Resolution: Fixed Fix Version/s: 2.0.0 > ManagedBuffers passed to OneToOneStreamManager need to be freed in non-error > cases > -- > > Key: SPARK-13308 > URL: https://issues.apache.org/jira/browse/SPARK-13308 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.0 > > > Spark's OneToOneStreamManager does not free ManagedBuffers that are passed to > it except in certain error cases. Instead, ManagedBuffers should be freed > once messages created from them are consumed and destroyed by lower layers of > the Netty networking code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10759) Missing Python code example in ML Programming guide
[ https://issues.apache.org/jira/browse/SPARK-10759?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149266#comment-15149266 ] Jeremy commented on SPARK-10759: Cannot add example for code that doesn't exist. > Missing Python code example in ML Programming guide > --- > > Key: SPARK-10759 > URL: https://issues.apache.org/jira/browse/SPARK-10759 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.5.0 >Reporter: Raela Wang >Assignee: Apache Spark >Priority: Minor > Labels: starter > > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation > http://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-train-validation-split -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13346) DataFrame caching is not handled well during planning or execution
Joseph K. Bradley created SPARK-13346: - Summary: DataFrame caching is not handled well during planning or execution Key: SPARK-13346 URL: https://issues.apache.org/jira/browse/SPARK-13346 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0 Reporter: Joseph K. Bradley I have an iterative algorithm based on DataFrames, and the query plan grows very quickly with each iteration. Caching the current DataFrame at the end of an iteration does not fix the problem. However, converting the DataFrame to an RDD and back at the end of each iteration does fix the problem. Printing the query plans shows that the plan explodes quickly (10 lines, to several hundred lines, to several thousand lines, ...) with successive iterations. The desired behavior is for the analyzer to recognize that a big chunk of the query plan does not need to be computed since it is already cached. The computation on each iteration should be the same. If useful, I can push (complex) code to reproduce the issue. But it should be simple to see if you create an iterative algorithm which produces a new DataFrame from an old one on each iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13346) DataFrame caching is not handled well during planning or execution
[ https://issues.apache.org/jira/browse/SPARK-13346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149227#comment-15149227 ] Joseph K. Bradley commented on SPARK-13346: --- CC: [~andrewor14] [~joshrosen] whom I spoke with about this issue > DataFrame caching is not handled well during planning or execution > -- > > Key: SPARK-13346 > URL: https://issues.apache.org/jira/browse/SPARK-13346 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Joseph K. Bradley > > I have an iterative algorithm based on DataFrames, and the query plan grows > very quickly with each iteration. Caching the current DataFrame at the end > of an iteration does not fix the problem. However, converting the DataFrame > to an RDD and back at the end of each iteration does fix the problem. > Printing the query plans shows that the plan explodes quickly (10 lines, to > several hundred lines, to several thousand lines, ...) with successive > iterations. > The desired behavior is for the analyzer to recognize that a big chunk of the > query plan does not need to be computed since it is already cached. The > computation on each iteration should be the same. > If useful, I can push (complex) code to reproduce the issue. But it should > be simple to see if you create an iterative algorithm which produces a new > DataFrame from an old one on each iteration. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13345) Adding one way ANOVA to Spark ML stat
yuhao yang created SPARK-13345: -- Summary: Adding one way ANOVA to Spark ML stat Key: SPARK-13345 URL: https://issues.apache.org/jira/browse/SPARK-13345 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Priority: Minor One way ANOVA (https://en.wikipedia.org/wiki/One-way_analysis_of_variance) is used to determine whether there are any significant differences between the means of three or more independent (unrelated) groups. One prototype in https://github.com/hhbyyh/StatisticsOnSpark/blob/master/src/main/ANOVA/OneWayANOVA.scala I'll send PR if this is a feature of interest. This can be further enriched with Post-Hoc and Factorial ANOVA. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12154: Assignee: Apache Spark > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah >Assignee: Apache Spark > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149176#comment-15149176 ] Apache Spark commented on SPARK-12154: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/11223 > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12154: Assignee: (was: Apache Spark) > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149155#comment-15149155 ] Xiao Li commented on SPARK-1: - [~josephkb] I found the root cause. : ) In the genCode of Randn and Rand, the seed is user-provided. However, the partitionID could be different. {code} s"$rngTerm = new $className(${seed}L + org.apache.spark.TaskContext.getPartitionId());") {code} If you remove that, you will get the right answer. {code} s"$rngTerm = new $className(${seed}L);") {code} > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13280) FileBasedWriteAheadLog logger name should be under o.a.s namespace
[ https://issues.apache.org/jira/browse/SPARK-13280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-13280. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.0.0 > FileBasedWriteAheadLog logger name should be under o.a.s namespace > -- > > Key: SPARK-13280 > URL: https://issues.apache.org/jira/browse/SPARK-13280 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.0.0 > > > The logger name in FileBasedWriteAheadLog is currently defined as: > {code} > override protected val logName = s"WriteAheadLogManager $callerNameTag" > {code} > That has two problems: > - It's not under the usual "org.apache.spark" namespace so changing the > logging configuration for that package does not affect it > - we've seen cases where {{$callerNameTag}} was empty, in which case the > logger name would have a trailing space, making it impossible to disable it > using a properties file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149139#comment-15149139 ] Alexander Ulanov commented on SPARK-9273: - Hi [~gsateesh110], Besides the one mentioned by Yuhao, there is SparkNet that allows using Caffe. In future, I plan to switch the present neural network implementation in Spark to tensors, and probably implement CNN, that is easier with tensors: https://github.com/avulanov/spark/tree/mlp-tensor Best regards, Alexander > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong
[ https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-11701. --- Resolution: Duplicate > YARN - dynamic allocation and speculation active task accounting wrong > -- > > Key: SPARK-11701 > URL: https://issues.apache.org/jira/browse/SPARK-11701 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > I am using dynamic container allocation and speculation and am seeing issues > with the active task accounting. The Executor UI still shows active tasks on > the an executor but the job/stage is all completed. I think its also > affecting the dynamic allocation being able to release containers because it > thinks there are still tasks. > Its easily reproduce by using spark-shell, turn on dynamic allocation, then > run just a wordcount on decent sized file and save back to hdfs and set the > speculation parameters low: > spark.dynamicAllocation.enabled true > spark.shuffle.service.enabled true > spark.dynamicAllocation.maxExecutors 10 > spark.dynamicAllocation.minExecutors 2 > spark.dynamicAllocation.initialExecutors 10 > spark.dynamicAllocation.executorIdleTimeout 40s > $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf > spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 > --master yarn --deploy-mode client --executor-memory 4g --driver-memory 4g -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13027) Add API for updateStateByKey to provide batch time as input
[ https://issues.apache.org/jira/browse/SPARK-13027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149126#comment-15149126 ] Aaditya Ramesh commented on SPARK-13027: Hi [~zsxwing] sorry to bump this again, I've submitted a new patch. Could you take a look when you get a chance? > Add API for updateStateByKey to provide batch time as input > --- > > Key: SPARK-13027 > URL: https://issues.apache.org/jira/browse/SPARK-13027 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Aaditya Ramesh > > The StateDStream currently does not provide the batch time as input to the > state update function. This is required in cases where the behavior depends > on the batch start time. > We (Conviva) have been patching it manually for the past several Spark > versions but we thought it might be useful for others as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11701) YARN - dynamic allocation and speculation active task accounting wrong
[ https://issues.apache.org/jira/browse/SPARK-11701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-11701: -- Description: I am using dynamic container allocation and speculation and am seeing issues with the active task accounting. The Executor UI still shows active tasks on the an executor but the job/stage is all completed. I think its also affecting the dynamic allocation being able to release containers because it thinks there are still tasks. Its easily reproduce by using spark-shell, turn on dynamic allocation, then run just a wordcount on decent sized file and save back to hdfs and set the speculation parameters low: spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true spark.dynamicAllocation.maxExecutors 10 spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 10 spark.dynamicAllocation.executorIdleTimeout 40s $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 --master yarn --deploy-mode client --executor-memory 4g --driver-memory 4g was: I am using dynamic container allocation and speculation and am seeing issues with the active task accounting. The Executor UI still shows active tasks on the an executor but the job/stage is all completed. I think its also affecting the dynamic allocation being able to release containers because it thinks there are still tasks. Its easily reproduce by using spark-shell, turn on dynamic allocation, then run just a wordcount on decent sized file and set the speculation parameters low: spark.dynamicAllocation.enabled true spark.shuffle.service.enabled true spark.dynamicAllocation.maxExecutors 10 spark.dynamicAllocation.minExecutors 2 spark.dynamicAllocation.initialExecutors 10 spark.dynamicAllocation.executorIdleTimeout 40s $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 --master yarn --deploy-mode client --executor-memory 4g --driver-memory 4g > YARN - dynamic allocation and speculation active task accounting wrong > -- > > Key: SPARK-11701 > URL: https://issues.apache.org/jira/browse/SPARK-11701 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Critical > > I am using dynamic container allocation and speculation and am seeing issues > with the active task accounting. The Executor UI still shows active tasks on > the an executor but the job/stage is all completed. I think its also > affecting the dynamic allocation being able to release containers because it > thinks there are still tasks. > Its easily reproduce by using spark-shell, turn on dynamic allocation, then > run just a wordcount on decent sized file and save back to hdfs and set the > speculation parameters low: > spark.dynamicAllocation.enabled true > spark.shuffle.service.enabled true > spark.dynamicAllocation.maxExecutors 10 > spark.dynamicAllocation.minExecutors 2 > spark.dynamicAllocation.initialExecutors 10 > spark.dynamicAllocation.executorIdleTimeout 40s > $SPARK_HOME/bin/spark-shell --conf spark.speculation=true --conf > spark.speculation.multiplier=0.2 --conf spark.speculation.quantile=0.1 > --master yarn --deploy-mode client --executor-memory 4g --driver-memory 4g -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13344) SaveLoadSuite has many accumulator exceptions
[ https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149106#comment-15149106 ] Apache Spark commented on SPARK-13344: -- User 'andrewor14' has created a pull request for this issue: https://github.com/apache/spark/pull/11222 > SaveLoadSuite has many accumulator exceptions > - > > Key: SPARK-13344 > URL: https://issues.apache.org/jira/browse/SPARK-13344 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > This is because SparkFunSuite clears all accumulators after every single > test. This suite reuses a DF and all of its associated internal accumulators > across many tests. > This is likely caused by SPARK-10620. > {code} > 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered > unregistered accumulator 253 when reconstructing task metrics. > 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update > accumulators for task 0 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 253 > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13344) SaveLoadSuite has many accumulator exceptions
[ https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13344: Assignee: Andrew Or (was: Apache Spark) > SaveLoadSuite has many accumulator exceptions > - > > Key: SPARK-13344 > URL: https://issues.apache.org/jira/browse/SPARK-13344 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or > > This is because SparkFunSuite clears all accumulators after every single > test. This suite reuses a DF and all of its associated internal accumulators > across many tests. > This is likely caused by SPARK-10620. > {code} > 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered > unregistered accumulator 253 when reconstructing task metrics. > 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update > accumulators for task 0 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 253 > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13344) SaveLoadSuite has many accumulator exceptions
[ https://issues.apache.org/jira/browse/SPARK-13344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13344: Assignee: Apache Spark (was: Andrew Or) > SaveLoadSuite has many accumulator exceptions > - > > Key: SPARK-13344 > URL: https://issues.apache.org/jira/browse/SPARK-13344 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Apache Spark > > This is because SparkFunSuite clears all accumulators after every single > test. This suite reuses a DF and all of its associated internal accumulators > across many tests. > This is likely caused by SPARK-10620. > {code} > 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered > unregistered accumulator 253 when reconstructing task metrics. > 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update > accumulators for task 0 > org.apache.spark.SparkException: attempted to access non-existent accumulator > 253 > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13344) SaveLoadSuite has many accumulator exceptions
Andrew Or created SPARK-13344: - Summary: SaveLoadSuite has many accumulator exceptions Key: SPARK-13344 URL: https://issues.apache.org/jira/browse/SPARK-13344 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.0.0 Reporter: Andrew Or Assignee: Andrew Or This is because SparkFunSuite clears all accumulators after every single test. This suite reuses a DF and all of its associated internal accumulators across many tests. This is likely caused by SPARK-10620. {code} 10:52:38.967 WARN org.apache.spark.executor.TaskMetrics: encountered unregistered accumulator 253 when reconstructing task metrics. 10:52:38.967 ERROR org.apache.spark.scheduler.DAGScheduler: Failed to update accumulators for task 0 org.apache.spark.SparkException: attempted to access non-existent accumulator 253 at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1099) at org.apache.spark.scheduler.DAGScheduler$$anonfun$updateAccumulators$1.apply(DAGScheduler.scala:1091) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12976) Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange.
[ https://issues.apache.org/jira/browse/SPARK-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-12976. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10894 [https://github.com/apache/spark/pull/10894] > Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange. > --- > > Key: SPARK-12976 > URL: https://issues.apache.org/jira/browse/SPARK-12976 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > Fix For: 2.0.0 > > > Add LazilyGenerateOrdering to support generated ordering for RangePartitioner > of Exchange instead of InterpretedOrdering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12976) Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange.
[ https://issues.apache.org/jira/browse/SPARK-12976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-12976: --- Assignee: Takuya Ueshin > Add LazilyGenerateOrdering and use it for RangePartitioner of Exchange. > --- > > Key: SPARK-12976 > URL: https://issues.apache.org/jira/browse/SPARK-12976 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin > > Add LazilyGenerateOrdering to support generated ordering for RangePartitioner > of Exchange instead of InterpretedOrdering. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-13343) speculative tasks that didn't commit shouldn't be marked as success
Thomas Graves created SPARK-13343: - Summary: speculative tasks that didn't commit shouldn't be marked as success Key: SPARK-13343 URL: https://issues.apache.org/jira/browse/SPARK-13343 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.6.0 Reporter: Thomas Graves Currently Speculative tasks that didn't commit can show up as success of failures (depending on timing of commit). This is a bit confusing because that task didn't really succeed in the sense it didn't write anything. I think these tasks should be marked as KILLED or something that is more obvious to the user exactly what happened. it is happened to hit the timing where it got a commit denied exception then it shows up as failed and counts against your task failures. It shouldn't count against task failures since that failure really doesn't matter. MapReduce handles these situation so perhaps we can look there for a model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13333) DataFrame filter + randn + unionAll has bad interaction
[ https://issues.apache.org/jira/browse/SPARK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149053#comment-15149053 ] Joseph K. Bradley commented on SPARK-1: --- I now have a much more complex example which does not use unionAll. But it's still an issue with randn, so I suspect it's the same bug. If needed, I can push a branch, but it's a mess of code. [~smilegator] Thanks for taking a look. I'll keep watching the JIRA! > DataFrame filter + randn + unionAll has bad interaction > --- > > Key: SPARK-1 > URL: https://issues.apache.org/jira/browse/SPARK-1 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.2, 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley > > Buggy workflow > * Create a DataFrame df0 > * Filter df0 > * Add a randn column > * Create a copy of the DataFrame > * unionAll the two DataFrames > This fails, where randn produces the same results on the original DataFrame > and the copy before unionAll but fails to do so after unionAll. Removing the > filter fixes the problem. > The bug can be reproduced on master: > {code} > import org.apache.spark.sql.functions.randn > val df0 = sqlContext.createDataFrame(Seq(0, 1).map(Tuple1(_))).toDF("id") > // Removing the following filter() call makes this give the expected result. > val df1 = df0.filter(col("id") === 0).withColumn("b", randn(12345)) > println("DF1") > df1.show() > val df2 = df1.select("id", "b") > println("DF2") > df2.show() // same as df1.show(), as expected > val df3 = df1.unionAll(df2) > println("DF3") > df3.show() // NOT two copies of df1, which is unexpected > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13242) Moderately complex `when` expression causes code generation failure
[ https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13242: Assignee: Apache Spark > Moderately complex `when` expression causes code generation failure > --- > > Key: SPARK-13242 > URL: https://issues.apache.org/jira/browse/SPARK-13242 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Joe Halliwell >Assignee: Apache Spark > > Moderately complex `when` expressions produce generated code that busts the > 64KB method limit. This causes code generation to fail. > Here's a test case exhibiting the problem: > https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d > I'm interested in working on a fix. I'm thinking it may be possible to split > the expressions along the lines of SPARK-8443, but any pointers would be > welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13242) Moderately complex `when` expression causes code generation failure
[ https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149036#comment-15149036 ] Apache Spark commented on SPARK-13242: -- User 'joehalliwell' has created a pull request for this issue: https://github.com/apache/spark/pull/11221 > Moderately complex `when` expression causes code generation failure > --- > > Key: SPARK-13242 > URL: https://issues.apache.org/jira/browse/SPARK-13242 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Joe Halliwell > > Moderately complex `when` expressions produce generated code that busts the > 64KB method limit. This causes code generation to fail. > Here's a test case exhibiting the problem: > https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d > I'm interested in working on a fix. I'm thinking it may be possible to split > the expressions along the lines of SPARK-8443, but any pointers would be > welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13242) Moderately complex `when` expression causes code generation failure
[ https://issues.apache.org/jira/browse/SPARK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13242: Assignee: (was: Apache Spark) > Moderately complex `when` expression causes code generation failure > --- > > Key: SPARK-13242 > URL: https://issues.apache.org/jira/browse/SPARK-13242 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Joe Halliwell > > Moderately complex `when` expressions produce generated code that busts the > 64KB method limit. This causes code generation to fail. > Here's a test case exhibiting the problem: > https://github.com/joehalliwell/spark/commit/4dbdf6e15d1116b8e1eb44822fd29ead9b7d817d > I'm interested in working on a fix. I'm thinking it may be possible to split > the expressions along the lines of SPARK-8443, but any pointers would be > welcome! -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13342) Cannot run INSERT statements in Spark
[ https://issues.apache.org/jira/browse/SPARK-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] neo updated SPARK-13342: Priority: Critical (was: Major) > Cannot run INSERT statements in Spark > - > > Key: SPARK-13342 > URL: https://issues.apache.org/jira/browse/SPARK-13342 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1, 1.6.0 >Reporter: neo >Priority: Critical > > I cannot run a INSERT statement using spark-sql. I tried with both versions > 1.5.1 and 1.6.0 without any luck. But it runs ok on hive. > These are the steps I took. > 1) Launch hive and create the table / insert a record. > create database test > use test > CREATE TABLE stgTable > ( > sno string, > total bigint > ); > INSERT INTO TABLE stgTable VALUES ('12',12) > 2) Launch spark-sql (1.5.1 or 1.6.0) > 3) Try inserting a record from the shell > INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1 > I got this error message > "Invalid method name: 'alter_table_with_cascade'" > I tried changing the hive version inside the spark-sql shell using SET > command. > I changed the hive version > from > SET spark.sql.hive.version=1.2.1 (this is the default setting for my spark > installation) > to > SET spark.sql.hive.version=0.14.0 > but that did not help either -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13327) colnames()<- allows invalid column names
[ https://issues.apache.org/jira/browse/SPARK-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15149017#comment-15149017 ] Apache Spark commented on SPARK-13327: -- User 'olarayej' has created a pull request for this issue: https://github.com/apache/spark/pull/11220 > colnames()<- allows invalid column names > > > Key: SPARK-13327 > URL: https://issues.apache.org/jira/browse/SPARK-13327 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Oscar D. Lara Yejas > > colnames<- fails if: > 1) Given colnames contain . > 2) Given colnames contain NA > 3) Given colnames are not character > 4) Given colnames have different length than dataset's (SparkSQL error is > through but not user friendly) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13327) colnames()<- allows invalid column names
[ https://issues.apache.org/jira/browse/SPARK-13327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13327: Assignee: Apache Spark > colnames()<- allows invalid column names > > > Key: SPARK-13327 > URL: https://issues.apache.org/jira/browse/SPARK-13327 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Oscar D. Lara Yejas >Assignee: Apache Spark > > colnames<- fails if: > 1) Given colnames contain . > 2) Given colnames contain NA > 3) Given colnames are not character > 4) Given colnames have different length than dataset's (SparkSQL error is > through but not user friendly) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org