[jira] [Updated] (SPARK-34763) col(), $"" and df("name") should handle quoted column names properly.
[ https://issues.apache.org/jira/browse/SPARK-34763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-34763: Affects Version/s: (was: 2.4.7) (was: 3.0.2) > col(), $"" and df("name") should handle quoted column names properly. > --- > > Key: SPARK-34763 > URL: https://issues.apache.org/jira/browse/SPARK-34763 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > Quoted column names like `a``b.c` cannot be represented with col(), $"" > and df("") because they don't handle such column names properly. > For example, if we have a following DataFrame. > {code} > val df1 = spark.sql("SELECT 'col1' AS `a``b.c`") > {code} > For the DataFrame, this query is successfully executed. > {code} > scala> df1.selectExpr("`a``b.c`").show > +-+ > |a`b.c| > +-+ > | col1| > +-+ > {code} > But the following query will fail because df1("`a``b.c`") throws an exception. > {code} > scala> df1.select(df1("`a``b.c`")).show > org.apache.spark.sql.AnalysisException: syntax error in attribute name: > `a``b.c`; > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152) > at > org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121) > at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221) > at org.apache.spark.sql.Dataset.col(Dataset.scala:1274) > at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241) > ... 49 elided > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34322) When refreshing a view, also refresh its underlying tables
[ https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-34322: Summary: When refreshing a view, also refresh its underlying tables (was: When refreshing a non-temporary view, also refresh its underlying tables) > When refreshing a view, also refresh its underlying tables > -- > > Key: SPARK-34322 > URL: https://issues.apache.org/jira/browse/SPARK-34322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Priority: Major > > For a view, there might be several underlying tables. > In long-running spark server use case, such as zeppelin, kyuubi, livy. > If a table updated, we need refresh this table in current long running spark > session. > But if the table is a view, we need refresh the underlying tables one by one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
[ https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-34322: Description: For a view, there might be several underlying tables. In long-running spark server use case, such as zeppelin, kyuubi, livy. If a table updated, we need refresh this table in current long running spark session. But if the table is a view, we need refresh the underlying tables one by one. > When refreshing a non-temporary view, also refresh its underlying tables > > > Key: SPARK-34322 > URL: https://issues.apache.org/jira/browse/SPARK-34322 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Priority: Major > > For a view, there might be several underlying tables. > In long-running spark server use case, such as zeppelin, kyuubi, livy. > If a table updated, we need refresh this table in current long running spark > session. > But if the table is a view, we need refresh the underlying tables one by one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables
feiwang created SPARK-34322: --- Summary: When refreshing a non-temporary view, also refresh its underlying tables Key: SPARK-34322 URL: https://issues.apache.org/jira/browse/SPARK-34322 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases
[ https://issues.apache.org/jira/browse/SPARK-34040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-34040: Description: Here is a test case(query contains '\n') which can not be covered by runCliWithin: {code:java} runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2") {code} {code:java} 11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli driver is booted. Waiting for expected answers. 11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> spark-sql> select 'test1'; 11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 'spark-sql> select 'test1';' 11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1 11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time taken: 8.987 seconds, Fetched 1 row(s) 11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> spark-sql> select 'test2'; 11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2 11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time taken: 0.068 seconds, Fetched 1 row(s) 11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> spark-sql> 11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: === CliSuite failure output === Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true --hiveconf hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4 Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute] Failed to capture next expected output " > select 'test2';" within 1 minute. {code} It seems that it is not recommended to transfer multiple queries one time, but there is some UT like this: https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544 was: Here is a test case(query contains '\n') which can not be covered by runCliWithin: {code:java} runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2") {code} {code:java} 11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli driver is booted. Waiting for expected answers. 11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> spark-sql> select 'test1'; 11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 'spark-sql> select 'test1';' 11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1 11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time taken: 8.987 seconds, Fetched 1 row(s) 11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> spark-sql> select 'test2'; 11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2 11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time taken: 0.068 seconds, Fetched 1 row(s) 11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> spark-sql> 11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: === CliSuite failure output === Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true --hiveconf hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4 Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute] Failed to capture next expected output " > select 'test2';" within 1 minute. {code} It seems that it is not better to transfer multiple queries one time, but there is some UT like this: https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544 > function runCliWithin of CliSuite can not cover some test cases > --- > > Key:
[jira] [Updated] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases
[ https://issues.apache.org/jira/browse/SPARK-34040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-34040: Description: Here is a test case(query contains two statements splitted by '\n') which can not be covered by runCliWithin: {code:java} runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2") {code} {code:java} 11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli driver is booted. Waiting for expected answers. 11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> spark-sql> select 'test1'; 11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 'spark-sql> select 'test1';' 11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1 11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time taken: 8.987 seconds, Fetched 1 row(s) 11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> spark-sql> select 'test2'; 11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2 11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time taken: 0.068 seconds, Fetched 1 row(s) 11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> spark-sql> 11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: === CliSuite failure output === Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true --hiveconf hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4 Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute] Failed to capture next expected output " > select 'test2';" within 1 minute. {code} It seems that it is not recommended to transfer multiple queries one time, but there is some UT like this: https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544 was: Here is a test case(query contains '\n') which can not be covered by runCliWithin: {code:java} runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2") {code} {code:java} 11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli driver is booted. Waiting for expected answers. 11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> spark-sql> select 'test1'; 11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 'spark-sql> select 'test1';' 11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1 11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time taken: 8.987 seconds, Fetched 1 row(s) 11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> spark-sql> select 'test2'; 11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2 11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time taken: 0.068 seconds, Fetched 1 row(s) 11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> spark-sql> 11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: === CliSuite failure output === Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true --hiveconf hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4 Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute] Failed to capture next expected output " > select 'test2';" within 1 minute. {code} It seems that it is not recommended to transfer multiple queries one time, but there is some UT like this: https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544 > function runCliWithin of CliSuite can not cover some test cases >
[jira] [Updated] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases
[ https://issues.apache.org/jira/browse/SPARK-34040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-34040: Description: Here is a test case(query contains '\n') which can not be covered by runCliWithin: {code:java} runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2") {code} {code:java} 11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli driver is booted. Waiting for expected answers. 11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> spark-sql> select 'test1'; 11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 'spark-sql> select 'test1';' 11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1 11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time taken: 8.987 seconds, Fetched 1 row(s) 11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> spark-sql> select 'test2'; 11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2 11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time taken: 0.068 seconds, Fetched 1 row(s) 11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> spark-sql> 11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: === CliSuite failure output === Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true --hiveconf hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4 Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute] Failed to capture next expected output " > select 'test2';" within 1 minute. {code} It seems that it is not better to transfer multiple queries one time, but there is some UT like this: https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544 was: Here is a test case(query contains '\n') which can not be covered by runCliWithin: ``` runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2") ``` log: ``` 11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli driver is booted. Waiting for expected answers. 11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> spark-sql> select 'test1'; 11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 'spark-sql> select 'test1';' 11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1 11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time taken: 8.987 seconds, Fetched 1 row(s) 11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> spark-sql> select 'test2'; 11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2 11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time taken: 0.068 seconds, Fetched 1 row(s) 11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> spark-sql> 11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: === CliSuite failure output === Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true --hiveconf hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4 Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute] Failed to capture next expected output " > select 'test2';" within 1 minute. ``` It seems that it is not better to transfer multiple queries one time, but there is some UT like this: https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544 > function runCliWithin of CliSuite can not cover some test cases > --- > > Key: SPARK-34040 >
[jira] [Created] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases
feiwang created SPARK-34040: --- Summary: function runCliWithin of CliSuite can not cover some test cases Key: SPARK-34040 URL: https://issues.apache.org/jira/browse/SPARK-34040 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.1 Reporter: feiwang Here is a test case(query contains '\n') which can not be covered by runCliWithin: ``` runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2") ``` log: ``` 11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli driver is booted. Waiting for expected answers. 11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> spark-sql> select 'test1'; 11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 'spark-sql> select 'test1';' 11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1 11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time taken: 8.987 seconds, Fetched 1 row(s) 11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> spark-sql> select 'test2'; 11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2 11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time taken: 0.068 seconds, Fetched 1 row(s) 11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> spark-sql> 11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: === CliSuite failure output === Spark SQL CLI command line: ../../bin/spark-sql --master local --driver-java-options -Dderby.system.durability=test --conf spark.ui.enabled=false --hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true --hiveconf hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4 Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 minute] Failed to capture next expected output " > select 'test2';" within 1 minute. ``` It seems that it is not better to transfer multiple queries one time, but there is some UT like this: https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33100) Support parse the sql statements with c-style comments
[ https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-33100: Description: Now the spark-sql does not support parse the sql statements with C-style comments. For the sql statements: {code:java} /* SELECT 'test'; */ SELECT 'test'; {code} Would be split to two statements: The first: "/* SELECT 'test'" The second: "*/ SELECT 'test'" Then it would throw an exception because the first one is illegal. was: Now the spark-sql does not support parse the sql statements with c-style coments. For example: For the sql statements: {code:java} /* SELECT 'test'; */ SELECT 'test'; {code} Would be split to two statements: The first: "/* SELECT 'test'" The second: "*/ SELECT 'test'" Then it would throw an exception because the first one is illegal. > Support parse the sql statements with c-style comments > -- > > Key: SPARK-33100 > URL: https://issues.apache.org/jira/browse/SPARK-33100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Assignee: Apache Spark >Priority: Minor > > Now the spark-sql does not support parse the sql statements with C-style > comments. > For the sql statements: > {code:java} > /* SELECT 'test'; */ > SELECT 'test'; > {code} > Would be split to two statements: > The first: "/* SELECT 'test'" > The second: "*/ SELECT 'test'" > Then it would throw an exception because the first one is illegal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33100) Support parse the sql statements with c-style comments
[ https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-33100: Description: Now the spark-sql does not support parse the sql statements with c-style coments. For example: For the sql statements: {code:java} /* SELECT 'test'; */ SELECT 'test'; {code} Would be split to two statements: The first: "/* SELECT 'test'" The second: "*/ SELECT 'test'" Then it would throw an exception because the first one is illegal. was: Now the spark-sql does not support parse the sql statements with c-style coments. For example: For the sql statements: {code:java} // Some comments here /* SELECT 'test'; */ SELECT 'test'; {code} Would be split to two statements: The first: "/* SELECT 'test'" The second: "*/ SELECT 'test'" Then it would throw an exception because the first one is illegal. > Support parse the sql statements with c-style comments > -- > > Key: SPARK-33100 > URL: https://issues.apache.org/jira/browse/SPARK-33100 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: feiwang >Priority: Minor > > Now the spark-sql does not support parse the sql statements with c-style > coments. > For example: > For the sql statements: > {code:java} > /* SELECT 'test'; */ > SELECT 'test'; > {code} > Would be split to two statements: > The first: "/* SELECT 'test'" > The second: "*/ SELECT 'test'" > Then it would throw an exception because the first one is illegal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33100) Support parse the sql statements with c-style comments
feiwang created SPARK-33100: --- Summary: Support parse the sql statements with c-style comments Key: SPARK-33100 URL: https://issues.apache.org/jira/browse/SPARK-33100 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: feiwang Now the spark-sql does not support parse the sql statements with c-style coments. For example: For the sql statements: {code:java} // Some comments here /* SELECT 'test'; */ SELECT 'test'; {code} Would be split to two statements: The first: "/* SELECT 'test'" The second: "*/ SELECT 'test'" Then it would throw an exception because the first one is illegal. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31467) Fix test issue with table named `test` in hive/SQLQuerySuite
[ https://issues.apache.org/jira/browse/SPARK-31467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-31467: Description: If we add ut in hive/SQLQuerySuite and use table named `test`. We may meet these exceptions. {code:java} org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is not allowed.;; [info] 'InsertIntoTable Project [_1#1403 AS key#1406, _2#1404 AS value#1407], Map(name -> Some(n1)), true, false [info] +- Project [col1#3850] [info]+- LocalRelation [col1#3850] {code} {code:java} org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or view 'test' already exists in database 'default'; [info] at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:226) [info] at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216) [info] at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216) [info] at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) [info] at org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216) {code} > Fix test issue with table named `test` in hive/SQLQuerySuite > > > Key: SPARK-31467 > URL: https://issues.apache.org/jira/browse/SPARK-31467 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.1.0 >Reporter: feiwang >Priority: Major > > If we add ut in hive/SQLQuerySuite and use table named `test`. We may meet > these exceptions. > {code:java} > org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is > not allowed.;; > [info] 'InsertIntoTable Project [_1#1403 AS key#1406, _2#1404 AS value#1407], > Map(name -> Some(n1)), true, false > [info] +- Project [col1#3850] > [info]+- LocalRelation [col1#3850] > {code} > {code:java} > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view 'test' already exists in database 'default'; > [info] at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:226) > [info] at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216) > [info] at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216) > [info] at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > [info] at > org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31467) Fix test issue with table named `test` in hive/SQLQuerySuite
feiwang created SPARK-31467: --- Summary: Fix test issue with table named `test` in hive/SQLQuerySuite Key: SPARK-31467 URL: https://issues.apache.org/jira/browse/SPARK-31467 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.1.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31263) Enable yarn shuffle service close the idle connections
[ https://issues.apache.org/jira/browse/SPARK-31263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang resolved SPARK-31263. - Resolution: Duplicate > Enable yarn shuffle service close the idle connections > -- > > Key: SPARK-31263 > URL: https://issues.apache.org/jira/browse/SPARK-31263 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: feiwang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31263) Enable yarn shuffle service close the idle connections
feiwang created SPARK-31263: --- Summary: Enable yarn shuffle service close the idle connections Key: SPARK-31263 URL: https://issues.apache.org/jira/browse/SPARK-31263 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.1.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait
[ https://issues.apache.org/jira/browse/SPARK-31179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-31179: Description: When reading shuffle data, maybe several fetch request sent to a same shuffle server. There is a client pool, and these request may share the same client. When the shuffle server is busy, it may cause the request connection timeout. For example: there are two request connection, rc1 and rc2. Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 minutes. 1: rc1 hold the client lock, it timeout after 2 minutes. 2: rc2 hold the client lock, it timeout after 2 minutes. 3: rc1 start the second retry, hold lock and timeout after 2 minutes. 4: rc2 start the second retry, hold lock and timeout after 2 minutes. 5: rc1 start the third retry, hold lock and timeout after 2 minutes. 6: rc2 start the third retry, hold lock and timeout after 2 minutes. It wastes lots of time. > Fast fail the connection while last shuffle connection failed in the last > retry IO wait > > > Key: SPARK-31179 > URL: https://issues.apache.org/jira/browse/SPARK-31179 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: feiwang >Priority: Major > > When reading shuffle data, maybe several fetch request sent to a same shuffle > server. > There is a client pool, and these request may share the same client. > When the shuffle server is busy, it may cause the request connection timeout. > For example: there are two request connection, rc1 and rc2. > Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 > minutes. > 1: rc1 hold the client lock, it timeout after 2 minutes. > 2: rc2 hold the client lock, it timeout after 2 minutes. > 3: rc1 start the second retry, hold lock and timeout after 2 minutes. > 4: rc2 start the second retry, hold lock and timeout after 2 minutes. > 5: rc1 start the third retry, hold lock and timeout after 2 minutes. > 6: rc2 start the third retry, hold lock and timeout after 2 minutes. > It wastes lots of time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait
feiwang created SPARK-31179: --- Summary: Fast fail the connection while last shuffle connection failed in the last retry IO wait Key: SPARK-31179 URL: https://issues.apache.org/jira/browse/SPARK-31179 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.1.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31093) Fast fail while fetching shuffle data unsuccessfully
[ https://issues.apache.org/jira/browse/SPARK-31093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-31093: Description: When fetching shuffle data unsuccessfully, we put a FailureFetchResult into results(a linkedBlockingQueue) and wait it to be taken. Then a FetchFailedException would be thrown. In fact, we can fast fail the task once fetching shuffle data unsuccessfully. > Fast fail while fetching shuffle data unsuccessfully > > > Key: SPARK-31093 > URL: https://issues.apache.org/jira/browse/SPARK-31093 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: feiwang >Priority: Minor > > When fetching shuffle data unsuccessfully, we put a FailureFetchResult into > results(a linkedBlockingQueue) and wait it to be taken. > Then a FetchFailedException would be thrown. > In fact, we can fast fail the task once fetching shuffle data unsuccessfully. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31093) Fast fail while fetching shuffle data unsuccessfully
[ https://issues.apache.org/jira/browse/SPARK-31093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-31093: Summary: Fast fail while fetching shuffle data unsuccessfully (was: Fast fail while fetching shuffle data from a remote block unsuccessfully) > Fast fail while fetching shuffle data unsuccessfully > > > Key: SPARK-31093 > URL: https://issues.apache.org/jira/browse/SPARK-31093 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: feiwang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31093) Fast fail while fetching shuffle data from a remote block unsuccessfully
feiwang created SPARK-31093: --- Summary: Fast fail while fetching shuffle data from a remote block unsuccessfully Key: SPARK-31093 URL: https://issues.apache.org/jira/browse/SPARK-31093 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.1.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31016) [DEPLOY] Pack the user jars when submitting Spark Application
[ https://issues.apache.org/jira/browse/SPARK-31016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-31016: Description: Nowadays, Spark only pack the jars under $SPARK_HOME/jars. How about packing the user jars when submitting Spark application? Sometimes, user may involve lots of jars expect spark libs. I think it can reduce the pressure for HDFS and nodemanager(localizer). was: No, spark only pack the jars under $SPARK_HOME/jars. How about packing the user jars when submitting Spark application? I think it can reduce the pressure for HDFS and nodemanager(localizer). > [DEPLOY] Pack the user jars when submitting Spark Application > - > > Key: SPARK-31016 > URL: https://issues.apache.org/jira/browse/SPARK-31016 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Minor > > Nowadays, Spark only pack the jars under $SPARK_HOME/jars. > How about packing the user jars when submitting Spark application? > Sometimes, user may involve lots of jars expect spark libs. > I think it can reduce the pressure for HDFS and nodemanager(localizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31016) [DEPLOY] Pack the user jars when submitting Spark Application
feiwang created SPARK-31016: --- Summary: [DEPLOY] Pack the user jars when submitting Spark Application Key: SPARK-31016 URL: https://issues.apache.org/jira/browse/SPARK-31016 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 3.0.0 Reporter: feiwang No, spark only pack the jars under $SPARK_HOME/jars. How about packing the user jars when submitting Spark application? I think it can reduce the pressure for HDFS and nodemanager(localizer). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30472) [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting String to IntegerType.
[ https://issues.apache.org/jira/browse/SPARK-30472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-30472: Summary: [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting String to IntegerType. (was: ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow.) > [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting > String to IntegerType. > -- > > Key: SPARK-30472 > URL: https://issues.apache.org/jira/browse/SPARK-30472 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30472) ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow.
feiwang created SPARK-30472: --- Summary: ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow. Key: SPARK-30472 URL: https://issues.apache.org/jira/browse/SPARK-30472 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-30471: Description: When we comparing a String Type and IntegerType: '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is IntegerType. But the value of string may exceed Int.MaxValue, then the result is corruputed. For example: {code:java} // Some comments here CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) AS ta(id); SELECT * FROM ta WHERE id > 0; // result is null {code} was: When we comparing a String Type and IntegerType: '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is IntegerType. But the value of string may exceed Int.MaxValue, then the result is corruputed. For example: {code:java} // Some comments here CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) AS ta(id); SELECT * FROM ta WHERE id > 0; {code} > Fix issue when compare string and IntegerType > - > > Key: SPARK-30471 > URL: https://issues.apache.org/jira/browse/SPARK-30471 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Major > > When we comparing a String Type and IntegerType: > '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). > Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) > is IntegerType. > But the value of string may exceed Int.MaxValue, then the result is > corruputed. > For example: > {code:java} > // Some comments here > CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS > STRING)) AS ta(id); > SELECT * FROM ta WHERE id > 0; // result is null > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType
[ https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-30471: Description: When we comparing a String Type and IntegerType: '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is IntegerType. But the value of string may exceed Int.MaxValue, then the result is corruputed. For example: {code:java} // Some comments here CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) AS ta(id); SELECT * FROM ta WHERE id > 0; {code} > Fix issue when compare string and IntegerType > - > > Key: SPARK-30471 > URL: https://issues.apache.org/jira/browse/SPARK-30471 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Major > > When we comparing a String Type and IntegerType: > '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType). > Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) > is IntegerType. > But the value of string may exceed Int.MaxValue, then the result is > corruputed. > For example: > {code:java} > // Some comments here > CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS > STRING)) AS ta(id); > SELECT * FROM ta WHERE id > 0; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30471) Fix issue when compare string and IntegerType
feiwang created SPARK-30471: --- Summary: Fix issue when compare string and IntegerType Key: SPARK-30471 URL: https://issues.apache.org/jira/browse/SPARK-30471 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.
[ https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29857: Description: When there are many applications in spark history server, the renderer of history summary page is heavy, we can enable deferRender to tuning it. See details https://datatables.net/examples/ajax/defer_render.html > [WEB UI] Support defer render the spark history summary page. > -- > > Key: SPARK-29857 > URL: https://issues.apache.org/jira/browse/SPARK-29857 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When there are many applications in spark history server, the renderer of > history summary page is heavy, we can enable deferRender to tuning it. > See details https://datatables.net/examples/ajax/defer_render.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Exception information cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet {code} was: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet {code} > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Exception information > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here public String getFoo() { return foo; } {code} was: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Some comments here > public String getFoo() > { > return foo; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
[ https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29860: Description: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: The data type of one or more elements in the left hand side of an IN subquery is not compatible with the data type of the output of the subquery Mismatched columns: [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] Left side: [decimal(18,0)]. Right side: [decimal(19,0)].;; 'Project [*] +- 'Filter id#219 IN (list#218 []) : +- Project [id#220] : +- SubqueryAlias `default`.`tb` :+- Relation[id#220] parquet +- SubqueryAlias `default`.`ta` +- Relation[id#219] parquet {code} was: The follow statement would throw an exception. {code:java} sql("create table ta(id Decimal(18,0)) using parquet") sql("create table tb(id Decimal(19,0)) using parquet") sql("select * from ta where id in (select id from tb)").shown() {code} {code:java} // Some comments here public String getFoo() { return foo; } {code} > [SQL] Fix data type mismatch issue for inSubQuery > - > > Key: SPARK-29860 > URL: https://issues.apache.org/jira/browse/SPARK-29860 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > The follow statement would throw an exception. > {code:java} > sql("create table ta(id Decimal(18,0)) using parquet") > sql("create table tb(id Decimal(19,0)) using parquet") > sql("select * from ta where id in (select id from tb)").shown() > {code} > {code:java} > // Some comments here > cannot resolve '(default.ta.`id` IN (listquery()))' due to data type > mismatch: > The data type of one or more elements in the left hand side of an IN subquery > is not compatible with the data type of the output of the subquery > Mismatched columns: > [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))] > Left side: > [decimal(18,0)]. > Right side: > [decimal(19,0)].;; > 'Project [*] > +- 'Filter id#219 IN (list#218 []) >: +- Project [id#220] >: +- SubqueryAlias `default`.`tb` >:+- Relation[id#220] parquet >+- SubqueryAlias `default`.`ta` > +- Relation[id#219] parquet > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery
feiwang created SPARK-29860: --- Summary: [SQL] Fix data type mismatch issue for inSubQuery Key: SPARK-29860 URL: https://issues.apache.org/jira/browse/SPARK-29860 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.
feiwang created SPARK-29857: --- Summary: [WEB UI] Support defer render the spark history summary page. Key: SPARK-29857 URL: https://issues.apache.org/jira/browse/SPARK-29857 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Description: As shown in the attachment, if task failed during reading shuffle data or because of executor loss, its shuffle read size would be shown as 0. But this size is important for user, it can help detect data skew. was: If task failed during reading shuffle data or because of executor loss, its shuffle read size would be shown as 0. But this size is important for user, it can help detect data skew. !screenshot-1.png! > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > As shown in the attachment, if task failed during reading shuffle data or > because of executor loss, its shuffle read size would be shown as 0. > But this size is important for user, it can help detect data skew. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Description: If task failed during reading shuffle data or because of executor loss, its shuffle read size would be shown as 0. But this size is important for user, it can help detect data skew. !screenshot-1.png! > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > If task failed during reading shuffle data or because of executor loss, its > shuffle read size would be shown as 0. > But this size is important for user, it can help detect data skew. > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Attachment: screenshot-1.png > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
[ https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29689: Summary: [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size (was: [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size) > [WEB-UI] When task failed during reading shuffle data or other failure, > enable show total shuffle read size > --- > > Key: SPARK-29689 > URL: https://issues.apache.org/jira/browse/SPARK-29689 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29689) [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size
feiwang created SPARK-29689: --- Summary: [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size Key: SPARK-29689 URL: https://issues.apache.org/jira/browse/SPARK-29689 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations
[ https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960244#comment-16960244 ] feiwang edited comment on SPARK-27736 at 10/26/19 2:30 AM: --- Hi, we met this issue recently. [~joshrosen] [~tgraves] How about implementing a simple solution: * Let externalShuffleClient can query whether a executor is registered in ESS * when FetchFailedException thrown, check whether this executor is registered in ESS * if not, we should remove all outputs of executors that are not registered on this host. If it is Ok, I can implement it. was (Author: hzfeiwang): Hi, we met this issue recently. [~joshrosen] [~tgraves] How about implementing a simple solution: * Let externalShuffleClient can query whether a executor is registered in ESS * when remove executor, check whether this executor is registered in ESS * if not, we should remove all outputs of executors that are not registered on this host. If it is Ok, I can implement it. > Improve handling of FetchFailures caused by ExternalShuffleService losing > track of executor registrations > - > > Key: SPARK-27736 > URL: https://issues.apache.org/jira/browse/SPARK-27736 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Minor > > This ticket describes a fault-tolerance edge-case which can cause Spark jobs > to fail if a single external shuffle service process reboots and fails to > recover the list of registered executors (something which can happen when > using YARN if NodeManager recovery is disabled) _and_ the Spark job has a > large number of executors per host. > I believe this problem can be worked around today via a change of > configurations, but I'm filing this issue to (a) better document this > problem, and (b) propose either a change of default configurations or > additional DAGScheduler logic to better handle this failure mode. > h2. Problem description > The external shuffle service process is _mostly_ stateless except for a map > tracking the set of registered applications and executors. > When processing a shuffle fetch request, the shuffle services first checks > whether the requested block ID's executor is registered; if it's not > registered then the shuffle service throws an exception like > {code:java} > java.lang.RuntimeException: Executor is not registered > (appId=application_1557557221330_6891, execId=428){code} > and this exception becomes a {{FetchFailed}} error in the executor requesting > the shuffle block. > In normal operation this error should not occur because executors shouldn't > be mis-routing shuffle fetch requests. However, this _can_ happen if the > shuffle service crashes and restarts, causing it to lose its in-memory > executor registration state. With YARN this state can be recovered from disk > if YARN NodeManager recovery is enabled (using the mechanism added in > SPARK-9439), but I don't believe that we perform state recovery in Standalone > and Mesos modes (see SPARK-24223). > If state cannot be recovered then map outputs cannot be served (even though > the files probably still exist on disk). In theory, this shouldn't cause > Spark jobs to fail because we can always redundantly recompute lost / > unfetchable map outputs. > However, in practice this can cause total job failures in deployments where > the node with the failed shuffle service was running a large number of > executors: by default, the DAGScheduler unregisters map outputs _only from > individual executor whose shuffle blocks could not be fetched_ (see > [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]), > so it can take several rounds of failed stage attempts to fail and clear > output from all executors on the faulty host. If the number of executors on a > host is greater than the stage retry limit then this can exhaust stage retry > attempts and cause job failures. > This "multiple rounds of recomputation to discover all failed executors on a > host" problem was addressed by SPARK-19753, which added a > {{spark.files.fetchFailure.unRegisterOutputOnHost}} configuration which > promotes executor fetch failures into host-wide fetch failures (clearing > output from all neighboring executors upon a single failure). However, that > configuration is {{false}} by default. > h2. Potential solutions > I have a few ideas about how we can improve this situation: > - Update the [YARN external shuffle service > documentation|https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service] > to recommend
[jira] [Commented] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations
[ https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960244#comment-16960244 ] feiwang commented on SPARK-27736: - Hi, we met this issue recently. [~joshrosen] [~tgraves] How about implementing a simple solution: * Let externalShuffleClient can query whether a executor is registered in ESS * when remove executor, check whether this executor is registered in ESS * if not, we should remove all outputs of executors that are not registered on this host. If it is Ok, I can implement it. > Improve handling of FetchFailures caused by ExternalShuffleService losing > track of executor registrations > - > > Key: SPARK-27736 > URL: https://issues.apache.org/jira/browse/SPARK-27736 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Josh Rosen >Priority: Minor > > This ticket describes a fault-tolerance edge-case which can cause Spark jobs > to fail if a single external shuffle service process reboots and fails to > recover the list of registered executors (something which can happen when > using YARN if NodeManager recovery is disabled) _and_ the Spark job has a > large number of executors per host. > I believe this problem can be worked around today via a change of > configurations, but I'm filing this issue to (a) better document this > problem, and (b) propose either a change of default configurations or > additional DAGScheduler logic to better handle this failure mode. > h2. Problem description > The external shuffle service process is _mostly_ stateless except for a map > tracking the set of registered applications and executors. > When processing a shuffle fetch request, the shuffle services first checks > whether the requested block ID's executor is registered; if it's not > registered then the shuffle service throws an exception like > {code:java} > java.lang.RuntimeException: Executor is not registered > (appId=application_1557557221330_6891, execId=428){code} > and this exception becomes a {{FetchFailed}} error in the executor requesting > the shuffle block. > In normal operation this error should not occur because executors shouldn't > be mis-routing shuffle fetch requests. However, this _can_ happen if the > shuffle service crashes and restarts, causing it to lose its in-memory > executor registration state. With YARN this state can be recovered from disk > if YARN NodeManager recovery is enabled (using the mechanism added in > SPARK-9439), but I don't believe that we perform state recovery in Standalone > and Mesos modes (see SPARK-24223). > If state cannot be recovered then map outputs cannot be served (even though > the files probably still exist on disk). In theory, this shouldn't cause > Spark jobs to fail because we can always redundantly recompute lost / > unfetchable map outputs. > However, in practice this can cause total job failures in deployments where > the node with the failed shuffle service was running a large number of > executors: by default, the DAGScheduler unregisters map outputs _only from > individual executor whose shuffle blocks could not be fetched_ (see > [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]), > so it can take several rounds of failed stage attempts to fail and clear > output from all executors on the faulty host. If the number of executors on a > host is greater than the stage retry limit then this can exhaust stage retry > attempts and cause job failures. > This "multiple rounds of recomputation to discover all failed executors on a > host" problem was addressed by SPARK-19753, which added a > {{spark.files.fetchFailure.unRegisterOutputOnHost}} configuration which > promotes executor fetch failures into host-wide fetch failures (clearing > output from all neighboring executors upon a single failure). However, that > configuration is {{false}} by default. > h2. Potential solutions > I have a few ideas about how we can improve this situation: > - Update the [YARN external shuffle service > documentation|https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service] > to recommend enabling node manager recovery. > - Consider defaulting {{spark.files.fetchFailure.unRegisterOutputOnHost}} to > {{true}}. This would improve out-of-the-box resiliency for large clusters. > The trade-off here is a reduction of efficiency in case there are transient > "false positive" fetch failures, but I suspect this case may be unlikely in > practice (so the change of default could be an acceptable trade-off). See > [prior discussion on >
[jira] [Updated] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Description: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. As shown in the attachment, the value of spark.sql.files.maxPartitionBytes is 128MB. For stage 1, its input is 16.3TB, but there are only 6400 tasks. I checked the code, it is only effective for data source table. So, its description is confused. Same as all the descriptions of `spark.sql.files.*`. was: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. As shown in the attachment, the value of spark.sql.files.maxPartitionBytes is 128MB. For stage 1, its input is 16.3TB, but there are only 6400 tasks. I checked the code, it is only effective for data source table. So, its description is confused. > [SQL][DOC] The descriptions of `spark.sql.files.*` are confused. > > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > As shown in the attachment, the value of spark.sql.files.maxPartitionBytes > is 128MB. > For stage 1, its input is 16.3TB, but there are only 6400 tasks. > I checked the code, it is only effective for data source table. > So, its description is confused. > Same as all the descriptions of `spark.sql.files.*`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Summary: [SQL][DOC] The descriptions of `spark.sql.files.*` are confused. (was: [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.) > [SQL][DOC] The descriptions of `spark.sql.files.*` are confused. > > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > As shown in the attachment, the value of spark.sql.files.maxPartitionBytes > is 128MB. > For stage 1, its input is 16.3TB, but there are only 6400 tasks. > I checked the code, it is only effective for data source table. > So, its description is confused. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Description: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. As shown in the attachment, the value of spark.sql.files.maxPartitionBytes is 128MB. For stage 1, its input is 16.3TB, but there are only 6400 tasks. I checked the code, it is only effective for data source table. So, its description is confused. was: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. But as shown in the attachment, for stage 1, there are 6.1 I checked the code, it is only effective for data source table. So, its description is confused. > [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused. > - > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > As shown in the attachment, the value of spark.sql.files.maxPartitionBytes > is 128MB. > For stage 1, its input is 16.3TB, but there are only 6400 tasks. > I checked the code, it is only effective for data source table. > So, its description is confused. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Description: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. But as shown in the attachment, for stage 1, there are 6.1 I checked the code, it is only effective for data source table. So, its description is confused. was: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. But as shown in the attachment, I checked the code, it is only effective for data source table. So, its description is confused. > [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused. > - > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > But as shown in the attachment, for stage 1, there are 6.1 > I checked the code, it is only effective for data source table. > So, its description is confused. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Description: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. But as shown in the attachment, it can not. I checked the code, it is only effective for data source table. So, its description is confused. was: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value. But as shown in the attachment, it can not. I checked the code, it is only effective for data source table. So, its description is confused. > [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused. > - > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > But as shown in the attachment, it can not. > I checked the code, it is only effective for data source table. > So, its description is confused. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Description: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. But as shown in the attachment, I checked the code, it is only effective for data source table. So, its description is confused. was: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value for spark sql. But as shown in the attachment, it can not. I checked the code, it is only effective for data source table. So, its description is confused. > [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused. > - > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value for spark sql. > But as shown in the attachment, > I checked the code, it is only effective for data source table. > So, its description is confused. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Description: Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. {code:java} The maximum number of bytes to pack into a single partition when reading files. {code} It seems that it can ensure each partition at most process bytes of that value. But as shown in the attachment, it can not. I checked the code, it is only effective for data source table. So, its description is confused. > [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused. > - > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > > Hi,the description of `spark.sql.files.maxPartitionBytes` is shown as below. > {code:java} > The maximum number of bytes to pack into a single partition when reading > files. > {code} > It seems that it can ensure each partition at most process bytes of that > value. > But as shown in the attachment, it can not. > I checked the code, it is only effective for data source table. > So, its description is confused. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
[ https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29542: Attachment: screenshot-1.png > [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused. > - > > Key: SPARK-29542 > URL: https://issues.apache.org/jira/browse/SPARK-29542 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
feiwang created SPARK-29542: --- Summary: [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused. Key: SPARK-29542 URL: https://issues.apache.org/jira/browse/SPARK-29542 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29262) DataFrameWriter insertIntoPartition function
[ https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954424#comment-16954424 ] feiwang commented on SPARK-29262: - I'll try to implement it. > DataFrameWriter insertIntoPartition function > > > Key: SPARK-29262 > URL: https://issues.apache.org/jira/browse/SPARK-29262 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: feiwang >Priority: Minor > > InsertIntoPartition is a useful function. > For SQL statement, relative syntax. > {code:java} > insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ... > {code} > In the example above, I specify all the partition key value, so it must be a > static partition overwrite, regardless whether enable dynamic partition > overwrite. > If we enable dynamic partition overwrite. For the sql below, it will only > overwrite relative partition not whole table. > If we disable dynamic partition overwrite, it will overwrite whole table. > {code:java} > insert overwrite table tbl_a partition(p1,p2,...,pn) select ... > {code} > As far as now, dataFrame does not support overwrite a specific partition. > It means that, for a partitioned table, if we insert overwrite by using > dataFrame with dynamic partition overwrite disabled, it will always > overwrite whole table. > So, we should support insertIntoPartition for dataFrameWriter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Description: For InsertIntoHadoopFsRelation operations. Case A: Application appA insert overwrite table table_a with static partition overwrite. But it was killed when committing tasks, because one task is hang. And parts of its committed tasks output is kept under /path/table_a/_temporary/0/. Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/. It executes successfully. But it also commit the data reminded by killed application to destination dir. Case B: Application appA insert overwrite table table_a. Application appB insert overwrite table table_a, too. They execute concurrently, and they may all use /path/table_a/_temporary/0/ as workPath. And their result may be corruptted. was: When we insert overwrite a partition of table. For a stage, whose tasks commit output, a task saves output to a staging dir firstly, when this task complete, it will save output to committedTaskPath, when all tasks of this stage success, all task output under committedTaskPath will be moved to destination dir. However, when we kill an application, which is committing tasks' output, parts of tasks' results will be kept in committedTaskPath, which would not be cleared gracefully. Then we rerun this application and the new application will reuse this committedTaskPath dir. And when the task commit stage of new application success, all task output under this committedTaskPath, which contains parts of old application's task output , would be moved to destination dir and the result is duplicated. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > For InsertIntoHadoopFsRelation operations. > Case A: > Application appA insert overwrite table table_a with static partition > overwrite. > But it was killed when committing tasks, because one task is hang. > And parts of its committed tasks output is kept under > /path/table_a/_temporary/0/. > Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/. > It executes successfully. > But it also commit the data reminded by killed application to destination dir. > Case B: > Application appA insert overwrite table table_a. > Application appB insert overwrite table table_a, too. > They execute concurrently, and they may all use /path/table_a/_temporary/0/ > as workPath. > And their result may be corruptted. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949229#comment-16949229 ] feiwang edited comment on SPARK-29302 at 10/11/19 7:46 AM: --- [~dangdangdang] Hi, I have thought a simple solution. We just need make the file name of a task be unique. And the OutputCommitCoordinator would decide which task file can be committed. But I don't have an appropriate unit test. was (Author: hzfeiwang): [~dangdangdang] Hi, I have thought a simple solution. We just need make the file name of a task be unique. And the OutputCommitCoordinator would decide which task file can be committed. But I don't have a appropriate unit test. > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949229#comment-16949229 ] feiwang commented on SPARK-29302: - [~dangdangdang] Hi, I have thought a simple solution. We just need make the file name of a task be unique. And the OutputCommitCoordinator would decide which task file can be committed. But I don't have a appropriate unit test. > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947451#comment-16947451 ] feiwang commented on SPARK-29302: - cc [~cloud_fan] > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947446#comment-16947446 ] feiwang edited comment on SPARK-29302 at 10/9/19 8:31 AM: -- Sorry for the late reply, I was on my National Day holiday for the past eight days. I just made a simple jobId in the UT above. In fact, it was created by a jobIdInstant. And for the tasks of a same job, they are same. So, I think this is still an issue. !screenshot-1.png! !screenshot-2.png! was (Author: hzfeiwang): Sorry, I was on my National Day holiday for the past eight days. I just made a simple jobId in the UT above. In fact, it was created by a jobIdInstant. And for the tasks of a same job, they are same. So, I think this is still an issue. !screenshot-1.png! !screenshot-2.png! > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947446#comment-16947446 ] feiwang edited comment on SPARK-29302 at 10/9/19 8:30 AM: -- Sorry, I was on my National Day holiday for the past eight days. I just made a simple jobId in the UT above. In fact, it was created by a jobIdInstant. And for the tasks of a same job, they are same. So, I think this is still an issue. !screenshot-1.png! !screenshot-2.png! was (Author: hzfeiwang): Sorry, I was on my National Day holiday for the past eight days. I just made a simple jobId in the UT below. In fact, it was created by a jobIdInstant. And for the tasks of a same job, they are same. So, I think this is still an issue. !screenshot-1.png! !screenshot-2.png! > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang reopened SPARK-29302: - > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947446#comment-16947446 ] feiwang commented on SPARK-29302: - Sorry, I was on my National Day holiday for the past eight days. I just made a simple jobId in the UT below. In fact, it was created by a jobIdInstant. And for the tasks of a same job, they are same. So, I think this is still an issue. !screenshot-1.png! !screenshot-2.png! > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29302: Attachment: screenshot-1.png > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29302: Attachment: (was: screenshot-1.png) > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29302: Attachment: screenshot-2.png > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29302: Attachment: screenshot-1.png > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941485#comment-16941485 ] feiwang commented on SPARK-29302: - Yes, they are in a same stage, so they have a same jobId. > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941475#comment-16941475 ] feiwang commented on SPARK-29295: - Yes, how can we resolve this issue? Only by upgrading the hive version? > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941472#comment-16941472 ] feiwang edited comment on SPARK-29302 at 10/1/19 2:24 AM: -- For dynamic partition overwrite, when executing a task, a determinable path would be specified. In the reproduce suite above, I create two task attempt context with same task id and different attempt id. And specify a output dir for newTaskTempFile method. was (Author: hzfeiwang): For dynamic partition overwrite, when execute a task, a determinable path would be specified. In the reproduce suite above, I create two task attempt context with same task id and different attempt id. And specify a output dir for newTaskTempFile method. > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941472#comment-16941472 ] feiwang commented on SPARK-29302: - For dynamic partition overwrite, when execute a task, a determinable path would be specified. In the reproduce suite above, I create two task attempt context with same task id and different attempt id. And specify a output dir for newTaskTempFile method. > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941467#comment-16941467 ] feiwang commented on SPARK-29302: - You can add the code below into FileFormatWriterSuite. {code:java} test("SPARK-29302: for dynamic partition overwrite, a task will concurrent write a same file" + " with its relative speculation task") { withTempDir { f => val jobId = SparkHadoopWriterUtils.createJobID(new Date(), 1) val taskId = new TaskID(jobId, TaskType.MAP, 1) val taskAttemptId0 = new TaskAttemptID(taskId, 0) val taskAttemptId1 = new TaskAttemptID(taskId, 1) val taskAttemptContext0: TaskAttemptContext = { // Set up the configuration object val hadoopConf = new Configuration(); hadoopConf.set("mapreduce.job.id", jobId.toString) hadoopConf.set("mapreduce.task.id", taskAttemptId0.getTaskID.toString) hadoopConf.set("mapreduce.task.attempt.id", taskAttemptId0.toString) hadoopConf.setBoolean("mapreduce.task.ismap", true) hadoopConf.setInt("mapreduce.task.partition", 0) new TaskAttemptContextImpl(hadoopConf, taskAttemptId0) } val taskAttemptContext1: TaskAttemptContext = { // Set up the configuration object val hadoopConf = new Configuration(); hadoopConf.set("mapreduce.job.id", jobId.toString) hadoopConf.set("mapreduce.task.id", taskAttemptId1.getTaskID.toString) hadoopConf.set("mapreduce.task.attempt.id", taskAttemptId1.toString) hadoopConf.setBoolean("mapreduce.task.ismap", true) hadoopConf.setInt("mapreduce.task.partition", 0) new TaskAttemptContextImpl(hadoopConf, taskAttemptId1) } val committer = new HadoopMapReduceCommitProtocol(jobId.toString, f.getAbsolutePath) val tf0 = committer.newTaskTempFile(taskAttemptContext0, Some(f.getAbsolutePath), "ext") val tf1 = committer.newTaskTempFile(taskAttemptContext1, Some(f.getAbsolutePath), "ext") assert(tf0 == tf1) } {code} > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29302: Description: Now, for a dynamic partition overwrite operation, the filename of a task output is determinable. So, if speculation is enabled, would a task conflict with its relative speculation task? Would the two tasks concurrent write a same file? was: Now, for a dynamic partition overwrite operation, the filename of a task output is determinable. So, if speculation is enabled, would a task conflict with its relative speculation task? > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29302: Description: Now, for a dynamic partition overwrite operation, the filename of a task output is determinable. So, if speculation is enabled, would a task conflict with its relative speculation task? was: Now, for a dynamic partition overwrite operation, the filename of a task output is determinable. So, if we enable > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29302: Description: Now, for a dynamic partition overwrite operation, the filename of a task output is determinable. So, if we enable > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if we enable -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29302) dynamic partition overwrite with speculation enabled
feiwang created SPARK-29302: --- Summary: dynamic partition overwrite with speculation enabled Key: SPARK-29302 URL: https://issues.apache.org/jira/browse/SPARK-29302 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940754#comment-16940754 ] feiwang commented on SPARK-29295: - relative hive issue, https://issues.apache.org/jira/browse/HIVE-17063 > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29295: Comment: was deleted (was: I have tested in 2.3.1 branch, without SPARK-25271, it will always give duplicate result. Thanks for SPARK-25271, it enable this statement use data source command if it is convertible.) > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940600#comment-16940600 ] feiwang edited comment on SPARK-29295 at 9/30/19 1:52 AM: -- cc [~cloud_fan] [~viirya] was (Author: hzfeiwang): cc [~cloud_fan] > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940600#comment-16940600 ] feiwang commented on SPARK-29295: - cc [~cloud_fan] > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940599#comment-16940599 ] feiwang commented on SPARK-29295: - I have tested in 2.3.1 branch, without SPARK-25271, it will always give duplicate result. Thanks for SPARK-25271, it enable this statement use data source command if it is convertible. > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940598#comment-16940598 ] feiwang commented on SPARK-29295: - When we set CONVERT_METASTORE_PARQUET=true, it will use InsertIntoHadoopFsRelationCommand to process this statement. When we set CONVERT_METASTORE_PARQUET=false, it will use InsertIntoHiveTable. > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29295: Description: When we drop a partition of a external table and then overwrite it, if we set CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition. But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result. Here is a reproduce code below(you can add it into SQLQuerySuite in hive module): {code:java} test("spark gives duplicate result when dropping a partition of an external partitioned table" + " firstly and they overwrite it") { withTable("test") { withTempDir { f => sql("create external table test(id int) partitioned by (name string) stored as " + s"parquet location '${f.getAbsolutePath}'") withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) { sql("insert overwrite table test partition(name='n1') select 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("insert overwrite table test partition(name='n1') select 2") checkAnswer( sql("select id from test where name = 'n1' order by id"), Array(Row(1), Row(2))) } withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) { sql("insert overwrite table test partition(name='n1') select 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("insert overwrite table test partition(name='n1') select 2") checkAnswer( sql("select id from test where name = 'n1' order by id"), Array(Row(2))) } } } } {code} was: When we drop a partition of a external table and then overwrite it, if we set CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition. But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result. Here is a reproduce code below: {code:java} test("spark gives duplicate result when dropping a partition of an external partitioned table" + " firstly and they overwrite it") { withTempView("ta", "tb") { withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) { withTempDir { f => sql("create external table ta(id int) partitioned by (name string) stored as " + s"parquet location '${f.getAbsolutePath}'") sql("insert overwrite table ta partition(name='n1') select 1") sql("ALTER TABLE ta DROP PARTITION(name='n1')") sql("insert overwrite table ta partition(name='n1') select 2") checkAnswer( sql("select id from ta where name = 'n1' order by id"), Array(Row(1), Row(2))) } } withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) { withTempDir { fa => sql("create external table tb(id int) partitioned by (name string) stored as " + s"parquet location '${fa.getAbsolutePath}'") sql("insert overwrite table tb partition(name='n1') select 1") sql("ALTER TABLE tb DROP PARTITION(name='n1')") sql("insert overwrite table tb partition(name='n1') select 2") checkAnswer( sql("select id from tb where name = 'n1' order by id"), Row(2)) } } } } {code} > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } >
[jira] [Created] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
feiwang created SPARK-29295: --- Summary: Duplicate result when dropping partition of an external table and then overwriting Key: SPARK-29295 URL: https://issues.apache.org/jira/browse/SPARK-29295 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: feiwang When we drop a partition of a external table and then overwrite it, if we set CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition. But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result. Here is a reproduce code below: {code:java} test("spark gives duplicate result when dropping a partition of an external partitioned table" + " firstly and they overwrite it") { withTempView("ta", "tb") { withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) { withTempDir { f => sql("create external table ta(id int) partitioned by (name string) stored as " + s"parquet location '${f.getAbsolutePath}'") sql("insert overwrite table ta partition(name='n1') select 1") sql("ALTER TABLE ta DROP PARTITION(name='n1')") sql("insert overwrite table ta partition(name='n1') select 2") checkAnswer( sql("select id from ta where name = 'n1' order by id"), Array(Row(1), Row(2))) } } withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) { withTempDir { fa => sql("create external table tb(id int) partitioned by (name string) stored as " + s"parquet location '${fa.getAbsolutePath}'") sql("insert overwrite table tb partition(name='n1') select 1") sql("ALTER TABLE tb DROP PARTITION(name='n1')") sql("insert overwrite table tb partition(name='n1') select 2") checkAnswer( sql("select id from tb where name = 'n1' order by id"), Row(2)) } } } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29262) DataFrameWriter insertIntoPartition function
[ https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29262: Description: InsertIntoPartition is a useful function. For SQL statement, relative syntax. {code:java} insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ... {code} In the example above, I specify all the partition key value, so it must be a static partition overwrite, regardless whether enable dynamic partition overwrite. If we enable dynamic partition overwrite. For the sql below, it will only overwrite relative partition not whole table. If we disable dynamic partition overwrite, it will overwrite whole table. {code:java} insert overwrite table tbl_a partition(p1,p2,...,pn) select ... {code} As far as now, dataFrame does not support overwrite a specific partition. It means that, for a partitioned table, if we insert overwrite by using dataFrame with dynamic partition overwrite disabled, it will always overwrite whole table. So, we should support insertIntoPartition for dataFrameWriter. was: InsertIntoPartition is a useful function. For SQL statement, relative syntax. {code:java} insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ... {code} In the example above, I specify all the partition key value, so it must be a static partition overwrite, regardless whether enable dynamic partition overwrite. If we enable dynamic partition overwrite. For the sql below, it will only overwrite relative partition not whole table. {code:java} insert overwrite table tbl_a partition(p1,p2,...,pn) select ... {code} As far as now, dataFrame does not support insertIntoPartition. It means that, for a partitioned table, if we insert overwrite by using dataFrame with dynamic partition overwrite disabled, it will always overwrite whole table. So, we should support insertIntoPartition for dataFrameWriter. > DataFrameWriter insertIntoPartition function > > > Key: SPARK-29262 > URL: https://issues.apache.org/jira/browse/SPARK-29262 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > InsertIntoPartition is a useful function. > For SQL statement, relative syntax. > {code:java} > insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ... > {code} > In the example above, I specify all the partition key value, so it must be a > static partition overwrite, regardless whether enable dynamic partition > overwrite. > If we enable dynamic partition overwrite. For the sql below, it will only > overwrite relative partition not whole table. > If we disable dynamic partition overwrite, it will overwrite whole table. > {code:java} > insert overwrite table tbl_a partition(p1,p2,...,pn) select ... > {code} > As far as now, dataFrame does not support overwrite a specific partition. > It means that, for a partitioned table, if we insert overwrite by using > dataFrame with dynamic partition overwrite disabled, it will always > overwrite whole table. > So, we should support insertIntoPartition for dataFrameWriter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29262) DataFrameWriter insertIntoPartition function
[ https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29262: Description: InsertIntoPartition is a useful function. For SQL statement, relative syntax. {code:java} insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ... {code} In the example above, I specify all the partition key value, so it must be a static partition overwrite, regardless whether enable dynamic partition overwrite. If we enable dynamic partition overwrite. For the sql below, it will only overwrite relative partition not whole table. {code:java} insert overwrite table tbl_a partition(p1,p2,...,pn) select ... {code} As far as now, dataFrame does not support insertIntoPartition. It means that, for a partitioned table, if we insert overwrite by using dataFrame with dynamic partition overwrite disabled, it will always overwrite whole table. So, we should support insertIntoPartition for dataFrameWriter. was: Do we have plan to support insertIntoPartition function for dataFrameWriter? [~cloud_fan] > DataFrameWriter insertIntoPartition function > > > Key: SPARK-29262 > URL: https://issues.apache.org/jira/browse/SPARK-29262 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > > InsertIntoPartition is a useful function. > For SQL statement, relative syntax. > {code:java} > insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ... > {code} > In the example above, I specify all the partition key value, so it must be a > static partition overwrite, regardless whether enable dynamic partition > overwrite. > If we enable dynamic partition overwrite. For the sql below, it will only > overwrite relative partition not whole table. > {code:java} > insert overwrite table tbl_a partition(p1,p2,...,pn) select ... > {code} > As far as now, dataFrame does not support insertIntoPartition. > It means that, for a partitioned table, if we insert overwrite by using > dataFrame with dynamic partition overwrite disabled, it will always > overwrite whole table. > So, we should support insertIntoPartition for dataFrameWriter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29262) DataFrameWriter insertIntoPartition function
feiwang created SPARK-29262: --- Summary: DataFrameWriter insertIntoPartition function Key: SPARK-29262 URL: https://issues.apache.org/jira/browse/SPARK-29262 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.4 Reporter: feiwang Do we have plan to support insertIntoPartition function for dataFrameWriter? [~cloud_fan] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29113) Some annotation errors in ApplicationCache.scala
[ https://issues.apache.org/jira/browse/SPARK-29113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29113: Issue Type: Documentation (was: Bug) > Some annotation errors in ApplicationCache.scala > > > Key: SPARK-29113 > URL: https://issues.apache.org/jira/browse/SPARK-29113 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29113) Some annotation errors in ApplicationCache.scala
feiwang created SPARK-29113: --- Summary: Some annotation errors in ApplicationCache.scala Key: SPARK-29113 URL: https://issues.apache.org/jira/browse/SPARK-29113 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Reporter: feiwang -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930221#comment-16930221 ] feiwang edited comment on SPARK-28945 at 9/16/19 4:22 AM: -- [~advancedxy] Thanks. Hope SPARK-28945 can be merged soon. It is important for data quality. was (Author: hzfeiwang): [~advancedxy] Thanks. Hope SPARK-28945 can be merged soon. It is critical for data quality. > Allow concurrent writes to different partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930221#comment-16930221 ] feiwang commented on SPARK-28945: - [~advancedxy] Thanks. Hope SPARK-28945 can be merged soon. It is critical for data quality. > Allow concurrent writes to different partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930010#comment-16930010 ] feiwang edited comment on SPARK-28945 at 9/15/19 5:17 PM: -- [~cloud_fan] [~advancedxy] Hi, I think the exception shown in the email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E) is related with this PR(https://github.com/apache/spark/pull/25795). When dynamicPartitionOverwrite is true, we should skip commitJob. was (Author: hzfeiwang): [~cloud_fan] [~advancedxy] Hi, I think the exception shown in the email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E) can be resolved by this PR: (https://github.com/apache/spark/pull/25795). When dynamicPartitionOverwrite is true, we should skip commitJob. > Allow concurrent writes to different partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930010#comment-16930010 ] feiwang edited comment on SPARK-28945 at 9/15/19 5:12 PM: -- [~cloud_fan] [~advancedxy] Hi, I think the exception shown in the email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E) can be resolved by this PR: (https://github.com/apache/spark/pull/25795). When dynamicPartitionOverwrite is true, we should skip commitJob. was (Author: hzfeiwang): [~cloud_fan] [~advancedxy] Hi, I think the exception shown in the email https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E can be resolved by this PR: https://github.com/apache/spark/pull/25795. When dynamicPartitionOverwrite is true, we should skip commitJob. > Allow concurrent writes to different partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930010#comment-16930010 ] feiwang commented on SPARK-28945: - [~cloud_fan] [~advancedxy] Hi, I think the exception shown in the email https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E can be resolved by this PR: https://github.com/apache/spark/pull/25795. When dynamicPartitionOverwrite is true, we should skip commitJob. > Allow concurrent writes to different partitions with dynamic partition > overwrite > > > Key: SPARK-28945 > URL: https://issues.apache.org/jira/browse/SPARK-28945 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: koert kuipers >Priority: Minor > > It is desirable to run concurrent jobs that write to different partitions > within same baseDir using partitionBy and dynamic partitionOverwriteMode. > See for example here: > https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning > Or the discussion here: > https://github.com/delta-io/delta/issues/9 > This doesnt seem that difficult. I suspect only changes needed are in > org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has > a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling > all committer activity (committer.setupJob, committer.commitJob, etc.) when > dynamicPartitionOverwrite is true. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler
[ https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929939#comment-16929939 ] feiwang commented on SPARK-29043: - Thanks, [~kabhwan]. > [History Server]Only one replay thread of FsHistoryProvider work because of > straggler > - > > Key: SPARK-29043 > URL: https://issues.apache.org/jira/browse/SPARK-29043 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > Attachments: image-2019-09-11-15-09-22-912.png, > image-2019-09-11-15-10-25-326.png, screenshot-1.png > > > As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for > spark history server. > However, there is only one replay thread work because of straggler. > Let's check the code. > https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547 > There is a synchronous operation for all replay tasks. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929916#comment-16929916 ] feiwang commented on SPARK-29037: - [~advancedxy] Hi, I found that even with dynamicPartitionOverwrite, spark may still give duplicate result for the case below. And I have create a pull request. https://github.com/apache/spark/pull/25795 Case: Application appA insert overwrite table table_a with static partition overwrite. But it was killed when committing tasks, because one task is hang. And parts of its committed tasks output is kept under /path/table_a/_temporary/0/. Then we run application appB insert overwrite table table_a with dynamic partition overwrite. It executes successfully. But it also commit the data under /path/table_a/_temporary/0/ to destination dir. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674 ] feiwang edited comment on SPARK-29037 at 9/15/19 4:41 AM: -- [~advancedxy] Thanks for your reply. I will learn more about dynamic partition. Thanks for your suggestion. was (Author: hzfeiwang): [~advancedxy] Thanks for your reply. I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. I will learn more about dynamic partition. Thanks for your suggestion. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928596#comment-16928596 ] feiwang edited comment on SPARK-29037 at 9/12/19 4:17 PM: -- [~advancedxy] 1. We re-submit the same application again. We meet this issue when insert overwrite table, so it is not feasible to resolve in user side. Yes, this issue can be resolved in the hadoop side, but it involves a new release of hadoop. We can do it in spark side. About output check, I think it is not appropriate, because when several application(insert overwrite a partition of same table) running at same time, they may use the same committedTaskPath. So, I think we could implement a spark's FileCommitProtocol reference the implementation of `InsertIntoHiveTable`. // org.apache.spark.sql.hive.execution.InsertIntoHiveTable For InsertIntoHiveTable, it first saveAsHiveFile(commit all tasks' output) to a hive-staging dir as shown in the log below. {code:java} 19/09/12 02:47:46 INFO FileOutputCommitter: Saved output of task 'attempt_20190912024744_0004_m_00_0' to hdfs://hercules-sub/user/b_hive_dba/fwang12_test/test_merge/.hive-staging_hive_2019-09-12_02-47-44_798_6385324183561649436-1/-ext-1/_temporary/0/task_20190912024744_0004_m_00 {code} Then it load these output to hive table. was (Author: hzfeiwang): [~advancedxy] 1. We re-submit the same application again. We meet this issue when insert overwrite table, so it is feasible to resolve in user side. Yes, this issue can be resolved in the hadoop side, but it involves a new release of hadoop. We can do it in spark side. About output check, I think it is not appropriate, because when several application(insert overwrite a partition of same table) running at same time, they may use the same committedTaskPath. So, I think we could implement a spark's FileCommitProtocol reference the implementation of `InsertIntoHiveTable`. // org.apache.spark.sql.hive.execution.InsertIntoHiveTable For InsertIntoHiveTable, it first saveAsHiveFile(commit all tasks' output) to a hive-staging dir as shown in the log below. {code:java} 19/09/12 02:47:46 INFO FileOutputCommitter: Saved output of task 'attempt_20190912024744_0004_m_00_0' to hdfs://hercules-sub/user/b_hive_dba/fwang12_test/test_merge/.hive-staging_hive_2019-09-12_02-47-44_798_6385324183561649436-1/-ext-1/_temporary/0/task_20190912024744_0004_m_00 {code} Then it load these output to hive table. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674 ] feiwang edited comment on SPARK-29037 at 9/12/19 3:58 PM: -- [~advancedxy] Thanks for your reply. I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. I will learn more about dynamic partition. Thanks for your suggestion. was (Author: hzfeiwang): [~advancedxy] Thanks for your reply. I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674 ] feiwang edited comment on SPARK-29037 at 9/12/19 3:49 PM: -- [~advancedxy] Thanks for your reply. I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. was (Author: hzfeiwang): [~advancedxy] I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. So I think this issue is usual when there is only one partition column for a table. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674 ] feiwang edited comment on SPARK-29037 at 9/12/19 3:48 PM: -- [~advancedxy] I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. So I think this issue is usual when there is only one partition column for a table. was (Author: hzfeiwang): [~advancedxy] I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674 ] feiwang edited comment on SPARK-29037 at 9/12/19 3:46 PM: -- [~advancedxy] I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 {code:java} val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == SaveMode.Overwrite && staticPartitions.size < partitionColumns.length {code} When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. was (Author: hzfeiwang): [~advancedxy] I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674 ] feiwang commented on SPARK-29037: - [~advancedxy] I just checked the code, as shown below. https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106 When partitionColumns==1, for the operation of inserting overwrite table partition, dynamicPartitionOverwrite is always false even DynamicOverwrite is enabled. > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun
[ https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] feiwang updated SPARK-29037: Comment: was deleted (was: In detail, I think we need change the logic of {color:red}InsertIntoHadoopFsRelationCommand{color} reference the implementation of {color:red}InsertIntoHiveTable{color}.) > [Core] Spark gives duplicate result when an application was killed and rerun > > > Key: SPARK-29037 > URL: https://issues.apache.org/jira/browse/SPARK-29037 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.3.3 >Reporter: feiwang >Priority: Major > Attachments: screenshot-1.png > > > When we insert overwrite a partition of table. > For a stage, whose tasks commit output, a task saves output to a staging dir > firstly, when this task complete, it will save output to committedTaskPath, > when all tasks of this stage success, all task output under committedTaskPath > will be moved to destination dir. > However, when we kill an application, which is committing tasks' output, > parts of tasks' results will be kept in committedTaskPath, which would not be > cleared gracefully. > Then we rerun this application and the new application will reuse this > committedTaskPath dir. > And when the task commit stage of new application success, all task output > under this committedTaskPath, which contains parts of old application's task > output , would be moved to destination dir and the result is duplicated. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org