from:"feiwang $JIRA$"

[jira] [Updated] (SPARK-34763) col(), $"" and df("name") should handle quoted column names properly.

2021-12-20 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-34763:

Affects Version/s: (was: 2.4.7)
   (was: 3.0.2)

> col(), $"" and df("name") should handle quoted column names properly.
> ---
>
> Key: SPARK-34763
> URL: https://issues.apache.org/jira/browse/SPARK-34763
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> Quoted column names like `a``b.c` cannot be represented with col(), $"" 
> and df("") because they don't handle such column names properly.
> For example, if we have a following DataFrame.
> {code}
> val df1 = spark.sql("SELECT 'col1' AS `a``b.c`")
> {code}
> For the DataFrame, this query is successfully executed.
> {code}
> scala> df1.selectExpr("`a``b.c`").show
> +-+
> |a`b.c|
> +-+
> | col1|
> +-+
> {code}
> But the following query will fail because df1("`a``b.c`") throws an exception.
> {code}
> scala> df1.select(df1("`a``b.c`")).show
> org.apache.spark.sql.AnalysisException: syntax error in attribute name: 
> `a``b.c`;
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:152)
>   at 
> org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:162)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
>   at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
>   at org.apache.spark.sql.Dataset.col(Dataset.scala:1274)
>   at org.apache.spark.sql.Dataset.apply(Dataset.scala:1241)
>   ... 49 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34322) When refreshing a view, also refresh its underlying tables

2021-02-02 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-34322:

Summary: When refreshing a view, also refresh its underlying tables  (was: 
When refreshing a non-temporary view, also refresh its underlying tables)

> When refreshing a view, also refresh its underlying tables
> --
>
> Key: SPARK-34322
> URL: https://issues.apache.org/jira/browse/SPARK-34322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Priority: Major
>
> For a view, there might be several underlying tables.
> In long-running spark server use case, such as zeppelin, kyuubi, livy.
> If a table updated, we need refresh this table in current long running spark 
> session.
> But if the table is a view, we need refresh the underlying tables one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-34322:

Description: 
For a view, there might be several underlying tables.

In long-running spark server use case, such as zeppelin, kyuubi, livy.

If a table updated, we need refresh this table in current long running spark 
session.

But if the table is a view, we need refresh the underlying tables one by one.


> When refreshing a non-temporary view, also refresh its underlying tables
> 
>
> Key: SPARK-34322
> URL: https://issues.apache.org/jira/browse/SPARK-34322
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Priority: Major
>
> For a view, there might be several underlying tables.
> In long-running spark server use case, such as zeppelin, kyuubi, livy.
> If a table updated, we need refresh this table in current long running spark 
> session.
> But if the table is a view, we need refresh the underlying tables one by one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34322) When refreshing a non-temporary view, also refresh its underlying tables

2021-02-01 Thread feiwang (Jira)

feiwang created SPARK-34322:
---

 Summary: When refreshing a non-temporary view, also refresh its 
underlying tables
 Key: SPARK-34322
 URL: https://issues.apache.org/jira/browse/SPARK-34322
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases

2021-01-06 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-34040:

Description: 
Here is a  test case(query contains '\n') which can not be covered by 
runCliWithin:

{code:java}
runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2")

{code}



{code:java}
11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli 
driver is booted. Waiting for expected answers.
11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> 
spark-sql> select 'test1';
11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 
'spark-sql> select 'test1';'
11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1
11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time 
taken: 8.987 seconds, Fetched 1 row(s)
11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> 
spark-sql>  select 'test2';
11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2
11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time 
taken: 0.068 seconds, Fetched 1 row(s)
11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> 
spark-sql> 
11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: 
===
CliSuite failure output
===
Spark SQL CLI command line: ../../bin/spark-sql --master local 
--driver-java-options -Dderby.system.durability=test --conf 
spark.ui.enabled=false --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true
 --hiveconf 
hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc
 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf 
hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 
minute]
Failed to capture next expected output " >  select 'test2';" within 1 
minute.
{code}

It seems that it is not recommended to transfer multiple queries one time, but 
there is some UT like this:
https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544

  was:
Here is a  test case(query contains '\n') which can not be covered by 
runCliWithin:

{code:java}
runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2")

{code}



{code:java}
11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli 
driver is booted. Waiting for expected answers.
11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> 
spark-sql> select 'test1';
11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 
'spark-sql> select 'test1';'
11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1
11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time 
taken: 8.987 seconds, Fetched 1 row(s)
11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> 
spark-sql>  select 'test2';
11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2
11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time 
taken: 0.068 seconds, Fetched 1 row(s)
11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> 
spark-sql> 
11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: 
===
CliSuite failure output
===
Spark SQL CLI command line: ../../bin/spark-sql --master local 
--driver-java-options -Dderby.system.durability=test --conf 
spark.ui.enabled=false --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true
 --hiveconf 
hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc
 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf 
hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 
minute]
Failed to capture next expected output " >  select 'test2';" within 1 
minute.
{code}

It seems that it is not better to transfer multiple queries one time, but there 
is some UT like this:
https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544


> function runCliWithin of CliSuite can not cover some test cases
> ---
>
> Key:

[jira] [Updated] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases

2021-01-06 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-34040:

Description: 
Here is a  test case(query contains two statements splitted by '\n') which can 
not be covered by runCliWithin:

{code:java}
runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2")

{code}



{code:java}
11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli 
driver is booted. Waiting for expected answers.
11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> 
spark-sql> select 'test1';
11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 
'spark-sql> select 'test1';'
11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1
11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time 
taken: 8.987 seconds, Fetched 1 row(s)
11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> 
spark-sql>  select 'test2';
11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2
11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time 
taken: 0.068 seconds, Fetched 1 row(s)
11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> 
spark-sql> 
11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: 
===
CliSuite failure output
===
Spark SQL CLI command line: ../../bin/spark-sql --master local 
--driver-java-options -Dderby.system.durability=test --conf 
spark.ui.enabled=false --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true
 --hiveconf 
hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc
 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf 
hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 
minute]
Failed to capture next expected output " >  select 'test2';" within 1 
minute.
{code}

It seems that it is not recommended to transfer multiple queries one time, but 
there is some UT like this:
https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544

  was:
Here is a  test case(query contains '\n') which can not be covered by 
runCliWithin:

{code:java}
runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2")

{code}



{code:java}
11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli 
driver is booted. Waiting for expected answers.
11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> 
spark-sql> select 'test1';
11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 
'spark-sql> select 'test1';'
11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1
11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time 
taken: 8.987 seconds, Fetched 1 row(s)
11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> 
spark-sql>  select 'test2';
11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2
11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time 
taken: 0.068 seconds, Fetched 1 row(s)
11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> 
spark-sql> 
11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: 
===
CliSuite failure output
===
Spark SQL CLI command line: ../../bin/spark-sql --master local 
--driver-java-options -Dderby.system.durability=test --conf 
spark.ui.enabled=false --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true
 --hiveconf 
hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc
 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf 
hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 
minute]
Failed to capture next expected output " >  select 'test2';" within 1 
minute.
{code}

It seems that it is not recommended to transfer multiple queries one time, but 
there is some UT like this:
https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544


> function runCliWithin of CliSuite can not cover some test cases
>

[jira] [Updated] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases

2021-01-06 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-34040:

Description: 
Here is a  test case(query contains '\n') which can not be covered by 
runCliWithin:

{code:java}
runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2")

{code}



{code:java}
11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli 
driver is booted. Waiting for expected answers.
11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> 
spark-sql> select 'test1';
11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 
'spark-sql> select 'test1';'
11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1
11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time 
taken: 8.987 seconds, Fetched 1 row(s)
11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> 
spark-sql>  select 'test2';
11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2
11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time 
taken: 0.068 seconds, Fetched 1 row(s)
11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> 
spark-sql> 
11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: 
===
CliSuite failure output
===
Spark SQL CLI command line: ../../bin/spark-sql --master local 
--driver-java-options -Dderby.system.durability=test --conf 
spark.ui.enabled=false --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true
 --hiveconf 
hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc
 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf 
hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 
minute]
Failed to capture next expected output " >  select 'test2';" within 1 
minute.
{code}

It seems that it is not better to transfer multiple queries one time, but there 
is some UT like this:
https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544

  was:
Here is a  test case(query contains '\n') which can not be covered by 
runCliWithin:
```
runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2")
```


log:
```
11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli 
driver is booted. Waiting for expected answers.
11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> 
spark-sql> select 'test1';
11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 
'spark-sql> select 'test1';'
11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1
11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time 
taken: 8.987 seconds, Fetched 1 row(s)
11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> 
spark-sql>  select 'test2';
11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2
11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time 
taken: 0.068 seconds, Fetched 1 row(s)
11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> 
spark-sql> 
11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: 
===
CliSuite failure output
===
Spark SQL CLI command line: ../../bin/spark-sql --master local 
--driver-java-options -Dderby.system.durability=test --conf 
spark.ui.enabled=false --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true
 --hiveconf 
hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc
 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf 
hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 
minute]
Failed to capture next expected output " >  select 'test2';" within 1 
minute.
```

It seems that it is not better to transfer multiple queries one time, but there 
is some UT like this:
https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544


> function runCliWithin of CliSuite can not cover some test cases
> ---
>
> Key: SPARK-34040
>

[jira] [Created] (SPARK-34040) function runCliWithin of CliSuite can not cover some test cases

2021-01-06 Thread feiwang (Jira)

feiwang created SPARK-34040:
---

 Summary: function runCliWithin of CliSuite can not cover some test 
cases
 Key: SPARK-34040
 URL: https://issues.apache.org/jira/browse/SPARK-34040
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.0.1
Reporter: feiwang


Here is a  test case(query contains '\n') which can not be covered by 
runCliWithin:
```
runCliWithin(1.minute)("select 'test1';\n select 'test2';" -> "test2")
```


log:
```
11:35:00.383 pool-1-thread-1-ScalaTest-running-CliSuite INFO CliSuite: Cli 
driver is booted. Waiting for expected answers.
11:35:01.104 Thread-6 INFO CliSuite: 2021-01-06 19:35:01.104 - stdout> 
spark-sql> select 'test1';
11:35:01.104 Thread-6 INFO CliSuite: stdout> found expected output line 0: 
'spark-sql> select 'test1';'
11:35:10.120 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.12 - stdout> test1
11:35:10.121 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.121 - stderr> Time 
taken: 8.987 seconds, Fetched 1 row(s)
11:35:10.151 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.151 - stdout> 
spark-sql>  select 'test2';
11:35:10.220 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.22 - stdout> test2
11:35:10.220 Thread-7 INFO CliSuite: 2021-01-06 19:35:10.22 - stderr> Time 
taken: 0.068 seconds, Fetched 1 row(s)
11:35:10.443 Thread-6 INFO CliSuite: 2021-01-06 19:35:10.443 - stdout> 
spark-sql> 
11:36:00.390 pool-1-thread-1-ScalaTest-running-CliSuite ERROR CliSuite: 
===
CliSuite failure output
===
Spark SQL CLI command line: ../../bin/spark-sql --master local 
--driver-java-options -Dderby.system.durability=test --conf 
spark.ui.enabled=false --hiveconf 
javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=/Users/fwang12/ebay/apache-spark/target/tmp/spark-7c275c0c-fc8e-49c7-b643-18f20fe8ba51;create=true
 --hiveconf 
hive.exec.scratchdir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-bb0fded2-1a25-45d2-9f78-16898f32aefc
 --hiveconf conf1=conftest --hiveconf conf2=1 --hiveconf 
hive.metastore.warehouse.dir=/Users/fwang12/ebay/apache-spark/target/tmp/spark-4901396f-9a7a-4299-b7fc-9cb3b24c46f4
Exception: java.util.concurrent.TimeoutException: Futures timed out after [1 
minute]
Failed to capture next expected output " >  select 'test2';" within 1 
minute.
```

It seems that it is not better to transfer multiple queries one time, but there 
is some UT like this:
https://github.com/apache/spark/blob/f9daf035f473fea12a2ee67428db8d78f29973d5/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/thriftserver/CliSuite.scala#L542-L544



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33100) Support parse the sql statements with c-style comments

2020-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-33100:

Description: 
Now the spark-sql does not support parse the sql statements with C-style 
comments.
For the sql statements:
{code:java}
/* SELECT 'test'; */
SELECT 'test';
{code}
Would be split to two statements:
The first: "/* SELECT 'test'"
The second: "*/ SELECT 'test'"

Then it would throw an exception because the first one is illegal.


  was:
Now the spark-sql does not support parse the sql statements with c-style 
coments.
For example:
For the sql statements:
{code:java}
/* SELECT 'test'; */
SELECT 'test';
{code}
Would be split to two statements:
The first: "/* SELECT 'test'"
The second: "*/ SELECT 'test'"

Then it would throw an exception because the first one is illegal.



> Support parse the sql statements with c-style comments
> --
>
> Key: SPARK-33100
> URL: https://issues.apache.org/jira/browse/SPARK-33100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Assignee: Apache Spark
>Priority: Minor
>
> Now the spark-sql does not support parse the sql statements with C-style 
> comments.
> For the sql statements:
> {code:java}
> /* SELECT 'test'; */
> SELECT 'test';
> {code}
> Would be split to two statements:
> The first: "/* SELECT 'test'"
> The second: "*/ SELECT 'test'"
> Then it would throw an exception because the first one is illegal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33100) Support parse the sql statements with c-style comments

2020-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-33100:

Description: 
Now the spark-sql does not support parse the sql statements with c-style 
coments.
For example:
For the sql statements:
{code:java}
/* SELECT 'test'; */
SELECT 'test';
{code}
Would be split to two statements:
The first: "/* SELECT 'test'"
The second: "*/ SELECT 'test'"

Then it would throw an exception because the first one is illegal.


  was:
Now the spark-sql does not support parse the sql statements with c-style 
coments.
For example:
For the sql statements:
{code:java}
// Some comments here
/* SELECT 'test'; */
SELECT 'test';
{code}
Would be split to two statements:
The first: "/* SELECT 'test'"
The second: "*/ SELECT 'test'"

Then it would throw an exception because the first one is illegal.



> Support parse the sql statements with c-style comments
> --
>
> Key: SPARK-33100
> URL: https://issues.apache.org/jira/browse/SPARK-33100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: feiwang
>Priority: Minor
>
> Now the spark-sql does not support parse the sql statements with c-style 
> coments.
> For example:
> For the sql statements:
> {code:java}
> /* SELECT 'test'; */
> SELECT 'test';
> {code}
> Would be split to two statements:
> The first: "/* SELECT 'test'"
> The second: "*/ SELECT 'test'"
> Then it would throw an exception because the first one is illegal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33100) Support parse the sql statements with c-style comments

2020-10-09 Thread feiwang (Jira)

feiwang created SPARK-33100:
---

 Summary: Support parse the sql statements with c-style comments
 Key: SPARK-33100
 URL: https://issues.apache.org/jira/browse/SPARK-33100
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: feiwang


Now the spark-sql does not support parse the sql statements with c-style 
coments.
For example:
For the sql statements:
{code:java}
// Some comments here
/* SELECT 'test'; */
SELECT 'test';
{code}
Would be split to two statements:
The first: "/* SELECT 'test'"
The second: "*/ SELECT 'test'"

Then it would throw an exception because the first one is illegal.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31467) Fix test issue with table named `test` in hive/SQLQuerySuite

2020-04-17 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-31467:

Description: 
If we add ut in hive/SQLQuerySuite and use table named `test`. We may meet 
these exceptions.
{code:java}
 org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is 
not allowed.;;
[info] 'InsertIntoTable Project [_1#1403 AS key#1406, _2#1404 AS value#1407], 
Map(name -> Some(n1)), true, false
[info] +- Project [col1#3850]
[info]+- LocalRelation [col1#3850]
{code}


{code:java}
org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
view 'test' already exists in database 'default';
[info]   at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:226)
[info]   at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
[info]   at 
org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
[info]   at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
[info]   at 
org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216)
{code}



> Fix test issue with table named `test` in hive/SQLQuerySuite
> 
>
> Key: SPARK-31467
> URL: https://issues.apache.org/jira/browse/SPARK-31467
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Major
>
> If we add ut in hive/SQLQuerySuite and use table named `test`. We may meet 
> these exceptions.
> {code:java}
>  org.apache.spark.sql.AnalysisException: Inserting into an RDD-based table is 
> not allowed.;;
> [info] 'InsertIntoTable Project [_1#1403 AS key#1406, _2#1404 AS value#1407], 
> Map(name -> Some(n1)), true, false
> [info] +- Project [col1#3850]
> [info]+- LocalRelation [col1#3850]
> {code}
> {code:java}
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'test' already exists in database 'default';
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply$mcV$sp(HiveExternalCatalog.scala:226)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$doCreateTable$1.apply(HiveExternalCatalog.scala:216)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
> [info]   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.doCreateTable(HiveExternalCatalog.scala:216)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31467) Fix test issue with table named `test` in hive/SQLQuerySuite

2020-04-17 Thread feiwang (Jira)

feiwang created SPARK-31467:
---

 Summary: Fix test issue with table named `test` in 
hive/SQLQuerySuite
 Key: SPARK-31467
 URL: https://issues.apache.org/jira/browse/SPARK-31467
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.1.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31263) Enable yarn shuffle service close the idle connections

2020-03-26 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang resolved SPARK-31263.
-
Resolution: Duplicate

> Enable yarn shuffle service close the idle connections
> --
>
> Key: SPARK-31263
> URL: https://issues.apache.org/jira/browse/SPARK-31263
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31263) Enable yarn shuffle service close the idle connections

2020-03-25 Thread feiwang (Jira)

feiwang created SPARK-31263:
---

 Summary: Enable yarn shuffle service close the idle connections
 Key: SPARK-31263
 URL: https://issues.apache.org/jira/browse/SPARK-31263
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait

2020-03-17 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-31179:

Description: 
When reading shuffle data, maybe several fetch request sent to a same shuffle 
server.
There is a client pool, and these request may share the same client.
When the shuffle server is busy, it may cause the request connection timeout.
For example: there are two request connection, rc1 and rc2.
Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 
minutes.

1: rc1 hold the client lock, it timeout after 2 minutes.
2: rc2 hold the client lock, it timeout after 2 minutes.
3: rc1 start the second retry, hold lock and timeout after 2 minutes.
4: rc2 start the second retry, hold lock and timeout after 2 minutes.
5: rc1 start the third retry, hold lock and timeout after 2 minutes.
6: rc2 start the third retry, hold lock and timeout after 2 minutes.
It wastes lots of time.

> Fast fail the connection while last shuffle connection failed in the last 
> retry IO wait 
> 
>
> Key: SPARK-31179
> URL: https://issues.apache.org/jira/browse/SPARK-31179
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Major
>
> When reading shuffle data, maybe several fetch request sent to a same shuffle 
> server.
> There is a client pool, and these request may share the same client.
> When the shuffle server is busy, it may cause the request connection timeout.
> For example: there are two request connection, rc1 and rc2.
> Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 
> minutes.
> 1: rc1 hold the client lock, it timeout after 2 minutes.
> 2: rc2 hold the client lock, it timeout after 2 minutes.
> 3: rc1 start the second retry, hold lock and timeout after 2 minutes.
> 4: rc2 start the second retry, hold lock and timeout after 2 minutes.
> 5: rc1 start the third retry, hold lock and timeout after 2 minutes.
> 6: rc2 start the third retry, hold lock and timeout after 2 minutes.
> It wastes lots of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait

2020-03-17 Thread feiwang (Jira)

feiwang created SPARK-31179:
---

 Summary: Fast fail the connection while last shuffle connection 
failed in the last retry IO wait 
 Key: SPARK-31179
 URL: https://issues.apache.org/jira/browse/SPARK-31179
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31093) Fast fail while fetching shuffle data unsuccessfully

2020-03-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-31093:

Description: 
When fetching shuffle data unsuccessfully, we put a FailureFetchResult into 
results(a linkedBlockingQueue) and wait it to be taken.
Then a FetchFailedException would be thrown.

In fact, we can fast fail the task once fetching shuffle data unsuccessfully.

> Fast fail while fetching shuffle data unsuccessfully
> 
>
> Key: SPARK-31093
> URL: https://issues.apache.org/jira/browse/SPARK-31093
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Minor
>
> When fetching shuffle data unsuccessfully, we put a FailureFetchResult into 
> results(a linkedBlockingQueue) and wait it to be taken.
> Then a FetchFailedException would be thrown.
> In fact, we can fast fail the task once fetching shuffle data unsuccessfully.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31093) Fast fail while fetching shuffle data unsuccessfully

2020-03-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-31093:

Summary: Fast fail while fetching shuffle data unsuccessfully  (was: Fast 
fail while fetching shuffle data from a remote block unsuccessfully)

> Fast fail while fetching shuffle data unsuccessfully
> 
>
> Key: SPARK-31093
> URL: https://issues.apache.org/jira/browse/SPARK-31093
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31093) Fast fail while fetching shuffle data from a remote block unsuccessfully

2020-03-09 Thread feiwang (Jira)

feiwang created SPARK-31093:
---

 Summary: Fast fail while fetching shuffle data from a remote block 
unsuccessfully
 Key: SPARK-31093
 URL: https://issues.apache.org/jira/browse/SPARK-31093
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31016) [DEPLOY] Pack the user jars when submitting Spark Application

2020-03-02 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-31016:

Description: 
Nowadays, Spark only pack the jars under $SPARK_HOME/jars.
How about packing the user jars when submitting Spark application?
Sometimes, user may involve lots of jars expect spark libs.
I think it can reduce the pressure for HDFS and nodemanager(localizer).

  was:
No, spark only pack the jars under $SPARK_HOME/jars.
How about packing the user jars when submitting Spark application?
I think it can reduce the pressure for HDFS and nodemanager(localizer).


> [DEPLOY] Pack the user jars when submitting Spark Application
> -
>
> Key: SPARK-31016
> URL: https://issues.apache.org/jira/browse/SPARK-31016
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Minor
>
> Nowadays, Spark only pack the jars under $SPARK_HOME/jars.
> How about packing the user jars when submitting Spark application?
> Sometimes, user may involve lots of jars expect spark libs.
> I think it can reduce the pressure for HDFS and nodemanager(localizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31016) [DEPLOY] Pack the user jars when submitting Spark Application

2020-03-02 Thread feiwang (Jira)

feiwang created SPARK-31016:
---

 Summary: [DEPLOY] Pack the user jars when submitting Spark 
Application
 Key: SPARK-31016
 URL: https://issues.apache.org/jira/browse/SPARK-31016
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 3.0.0
Reporter: feiwang


No, spark only pack the jars under $SPARK_HOME/jars.
How about packing the user jars when submitting Spark application?
I think it can reduce the pressure for HDFS and nodemanager(localizer).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30472) [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting String to IntegerType.

2020-01-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-30472:

Summary: [SQL] ANSI SQL: Throw exception on format invalid and overflow 
when casting String to IntegerType.  (was: ANSI SQL: Cast String to Integer 
Type, throw exception on format invalid and overflow.)

> [SQL] ANSI SQL: Throw exception on format invalid and overflow when casting 
> String to IntegerType.
> --
>
> Key: SPARK-30472
> URL: https://issues.apache.org/jira/browse/SPARK-30472
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30472) ANSI SQL: Cast String to Integer Type, throw exception on format invalid and overflow.

2020-01-09 Thread feiwang (Jira)

feiwang created SPARK-30472:
---

 Summary: ANSI SQL: Cast String to Integer Type, throw exception on 
format invalid and overflow.
 Key: SPARK-30472
 URL: https://issues.apache.org/jira/browse/SPARK-30472
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType

2020-01-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-30471:

Description: 
When we comparing a String Type and IntegerType:
'2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).

Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is 
IntegerType.

But the value of string may exceed Int.MaxValue, then the result is corruputed.


For example:
{code:java}
// Some comments here
CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS STRING)) 
 AS ta(id);
SELECT * FROM ta WHERE id > 0; // result is null
{code}


  was:
When we comparing a String Type and IntegerType:
'2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).

Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is 
IntegerType.

But the value of string may exceed Int.MaxValue, then the result is corruputed.


For example:
{code:java}
// Some comments here
 CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
STRING))  AS ta(id);
SELECT * FROM ta WHERE id > 0;
{code}



> Fix issue when compare string and IntegerType
> -
>
> Key: SPARK-30471
> URL: https://issues.apache.org/jira/browse/SPARK-30471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Major
>
> When we comparing a String Type and IntegerType:
> '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).
> Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) 
> is IntegerType.
> But the value of string may exceed Int.MaxValue, then the result is 
> corruputed.
> For example:
> {code:java}
> // Some comments here
> CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
> STRING))  AS ta(id);
> SELECT * FROM ta WHERE id > 0; // result is null
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30471) Fix issue when compare string and IntegerType

2020-01-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-30471:

Description: 
When we comparing a String Type and IntegerType:
'2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).

Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) is 
IntegerType.

But the value of string may exceed Int.MaxValue, then the result is corruputed.


For example:
{code:java}
// Some comments here
 CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
STRING))  AS ta(id);
SELECT * FROM ta WHERE id > 0;
{code}


> Fix issue when compare string and IntegerType
> -
>
> Key: SPARK-30471
> URL: https://issues.apache.org/jira/browse/SPARK-30471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Major
>
> When we comparing a String Type and IntegerType:
> '2147483648'(StringType, which exceed Int.MaxValue) > 0(IntegerType).
> Now the result of findCommonTypeForBinaryComparison(StringType, IntegerType) 
> is IntegerType.
> But the value of string may exceed Int.MaxValue, then the result is 
> corruputed.
> For example:
> {code:java}
> // Some comments here
>  CREATE TEMPORARY VIEW ta AS SELECT * FROM VALUES(CAST ('2147483648' AS 
> STRING))  AS ta(id);
> SELECT * FROM ta WHERE id > 0;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30471) Fix issue when compare string and IntegerType

2020-01-09 Thread feiwang (Jira)

feiwang created SPARK-30471:
---

 Summary: Fix issue when compare string and IntegerType
 Key: SPARK-30471
 URL: https://issues.apache.org/jira/browse/SPARK-30471
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.

2019-11-12 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29857:

Description: 
When there are many applications in spark history server, the renderer of 
history summary page is heavy, we can enable deferRender to tuning it.

See details https://datatables.net/examples/ajax/defer_render.html

> [WEB UI] Support defer render the spark history summary page. 
> --
>
> Key: SPARK-29857
> URL: https://issues.apache.org/jira/browse/SPARK-29857
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When there are many applications in spark history server, the renderer of 
> history summary page is heavy, we can enable deferRender to tuning it.
> See details https://datatables.net/examples/ajax/defer_render.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Exception information
cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet
{code}



  was:
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet
{code}




> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Exception information
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}

cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet


> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
public String getFoo()
{
return foo;
}
{code}



  was:
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}

cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet



> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Some comments here
> public String getFoo()
> {
> return foo;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29860:

Description: 
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
cannot resolve '(default.ta.`id` IN (listquery()))' due to data type mismatch: 
The data type of one or more elements in the left hand side of an IN subquery
is not compatible with the data type of the output of the subquery
Mismatched columns:
[(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
Left side:
[decimal(18,0)].
Right side:
[decimal(19,0)].;;
'Project [*]
+- 'Filter id#219 IN (list#218 [])
   :  +- Project [id#220]
   : +- SubqueryAlias `default`.`tb`
   :+- Relation[id#220] parquet
   +- SubqueryAlias `default`.`ta`
  +- Relation[id#219] parquet
{code}



  was:
The follow statement would throw an exception.


{code:java}
  sql("create table ta(id Decimal(18,0)) using parquet")
  sql("create table tb(id Decimal(19,0)) using parquet")
  sql("select * from ta where id in (select id from tb)").shown()
{code}


{code:java}
// Some comments here
public String getFoo()
{
return foo;
}
{code}




> [SQL] Fix data type mismatch issue for inSubQuery
> -
>
> Key: SPARK-29860
> URL: https://issues.apache.org/jira/browse/SPARK-29860
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> The follow statement would throw an exception.
> {code:java}
>   sql("create table ta(id Decimal(18,0)) using parquet")
>   sql("create table tb(id Decimal(19,0)) using parquet")
>   sql("select * from ta where id in (select id from tb)").shown()
> {code}
> {code:java}
> // Some comments here
> cannot resolve '(default.ta.`id` IN (listquery()))' due to data type 
> mismatch: 
> The data type of one or more elements in the left hand side of an IN subquery
> is not compatible with the data type of the output of the subquery
> Mismatched columns:
> [(default.ta.`id`:decimal(18,0), default.tb.`id`:decimal(19,0))]
> Left side:
> [decimal(18,0)].
> Right side:
> [decimal(19,0)].;;
> 'Project [*]
> +- 'Filter id#219 IN (list#218 [])
>:  +- Project [id#220]
>: +- SubqueryAlias `default`.`tb`
>:+- Relation[id#220] parquet
>+- SubqueryAlias `default`.`ta`
>   +- Relation[id#219] parquet
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29860) [SQL] Fix data type mismatch issue for inSubQuery

2019-11-12 Thread feiwang (Jira)

feiwang created SPARK-29860:
---

 Summary: [SQL] Fix data type mismatch issue for inSubQuery
 Key: SPARK-29860
 URL: https://issues.apache.org/jira/browse/SPARK-29860
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29857) [WEB UI] Support defer render the spark history summary page.

2019-11-12 Thread feiwang (Jira)

feiwang created SPARK-29857:
---

 Summary: [WEB UI] Support defer render the spark history summary 
page. 
 Key: SPARK-29857
 URL: https://issues.apache.org/jira/browse/SPARK-29857
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Description: 
As shown in the attachment, if task failed during reading shuffle data or 
because of executor loss, its shuffle read size would be shown as 0.

But this size is important for user, it can help detect data skew.

  was:
If task failed during reading shuffle data or because of executor loss, its 
shuffle read size would be shown as 0.

But this size is important for user, it can help detect data skew.




 !screenshot-1.png! 


> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> As shown in the attachment, if task failed during reading shuffle data or 
> because of executor loss, its shuffle read size would be shown as 0.
> But this size is important for user, it can help detect data skew.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Description: 
If task failed during reading shuffle data or because of executor loss, its 
shuffle read size would be shown as 0.

But this size is important for user, it can help detect data skew.




 !screenshot-1.png! 

> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> If task failed during reading shuffle data or because of executor loss, its 
> shuffle read size would be shown as 0.
> But this size is important for user, it can help detect data skew.
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Attachment: screenshot-1.png

> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29689) [WEB-UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29689:

Summary: [WEB-UI] When task failed during reading shuffle data or other 
failure, enable show total shuffle read size  (was: [UI] When task failed 
during reading shuffle data or other failure, enable show total shuffle read 
size)

> [WEB-UI] When task failed during reading shuffle data or other failure, 
> enable show total shuffle read size
> ---
>
> Key: SPARK-29689
> URL: https://issues.apache.org/jira/browse/SPARK-29689
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29689) [UI] When task failed during reading shuffle data or other failure, enable show total shuffle read size

2019-10-31 Thread feiwang (Jira)

feiwang created SPARK-29689:
---

 Summary: [UI] When task failed during reading shuffle data or 
other failure, enable show total shuffle read size
 Key: SPARK-29689
 URL: https://issues.apache.org/jira/browse/SPARK-29689
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations

2019-10-25 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960244#comment-16960244
 ] 

feiwang edited comment on SPARK-27736 at 10/26/19 2:30 AM:
---

Hi, we met this issue recently.
[~joshrosen] [~tgraves]
How about implementing a simple solution:
* Let externalShuffleClient can query whether a executor is registered in ESS
* when FetchFailedException thrown, check whether this executor is registered 
in ESS
* if not, we should remove all outputs of executors that are not registered on 
this host.

If it is Ok, I can implement it.


was (Author: hzfeiwang):
Hi, we met this issue recently.
[~joshrosen] [~tgraves]
How about implementing a simple solution:
* Let externalShuffleClient can query whether a executor is registered in ESS
* when remove executor, check whether this executor is registered in ESS
* if not, we should remove all outputs of executors that are not registered on 
this host.

If it is Ok, I can implement it.

> Improve handling of FetchFailures caused by ExternalShuffleService losing 
> track of executor registrations
> -
>
> Key: SPARK-27736
> URL: https://issues.apache.org/jira/browse/SPARK-27736
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Minor
>
> This ticket describes a fault-tolerance edge-case which can cause Spark jobs 
> to fail if a single external shuffle service process reboots and fails to 
> recover the list of registered executors (something which can happen when 
> using YARN if NodeManager recovery is disabled) _and_ the Spark job has a 
> large number of executors per host.
> I believe this problem can be worked around today via a change of 
> configurations, but I'm filing this issue to (a) better document this 
> problem, and (b) propose either a change of default configurations or 
> additional DAGScheduler logic to better handle this failure mode.
> h2. Problem description
> The external shuffle service process is _mostly_ stateless except for a map 
> tracking the set of registered applications and executors.
> When processing a shuffle fetch request, the shuffle services first checks 
> whether the requested block ID's executor is registered; if it's not 
> registered then the shuffle service throws an exception like 
> {code:java}
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1557557221330_6891, execId=428){code}
> and this exception becomes a {{FetchFailed}} error in the executor requesting 
> the shuffle block.
> In normal operation this error should not occur because executors shouldn't 
> be mis-routing shuffle fetch requests. However, this _can_ happen if the 
> shuffle service crashes and restarts, causing it to lose its in-memory 
> executor registration state. With YARN this state can be recovered from disk 
> if YARN NodeManager recovery is enabled (using the mechanism added in 
> SPARK-9439), but I don't believe that we perform state recovery in Standalone 
> and Mesos modes (see SPARK-24223).
> If state cannot be recovered then map outputs cannot be served (even though 
> the files probably still exist on disk). In theory, this shouldn't cause 
> Spark jobs to fail because we can always redundantly recompute lost / 
> unfetchable map outputs.
> However, in practice this can cause total job failures in deployments where 
> the node with the failed shuffle service was running a large number of 
> executors: by default, the DAGScheduler unregisters map outputs _only from 
> individual executor whose shuffle blocks could not be fetched_ (see 
> [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]),
>  so it can take several rounds of failed stage attempts to fail and clear 
> output from all executors on the faulty host. If the number of executors on a 
> host is greater than the stage retry limit then this can exhaust stage retry 
> attempts and cause job failures.
> This "multiple rounds of recomputation to discover all failed executors on a 
> host" problem was addressed by SPARK-19753, which added a 
> {{spark.files.fetchFailure.unRegisterOutputOnHost}} configuration which 
> promotes executor fetch failures into host-wide fetch failures (clearing 
> output from all neighboring executors upon a single failure). However, that 
> configuration is {{false}} by default.
> h2. Potential solutions
> I have a few ideas about how we can improve this situation:
>  - Update the [YARN external shuffle service 
> documentation|https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service]
>  to recommend

[jira] [Commented] (SPARK-27736) Improve handling of FetchFailures caused by ExternalShuffleService losing track of executor registrations

2019-10-25 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960244#comment-16960244
 ] 

feiwang commented on SPARK-27736:
-

Hi, we met this issue recently.
[~joshrosen] [~tgraves]
How about implementing a simple solution:
* Let externalShuffleClient can query whether a executor is registered in ESS
* when remove executor, check whether this executor is registered in ESS
* if not, we should remove all outputs of executors that are not registered on 
this host.

If it is Ok, I can implement it.

> Improve handling of FetchFailures caused by ExternalShuffleService losing 
> track of executor registrations
> -
>
> Key: SPARK-27736
> URL: https://issues.apache.org/jira/browse/SPARK-27736
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Minor
>
> This ticket describes a fault-tolerance edge-case which can cause Spark jobs 
> to fail if a single external shuffle service process reboots and fails to 
> recover the list of registered executors (something which can happen when 
> using YARN if NodeManager recovery is disabled) _and_ the Spark job has a 
> large number of executors per host.
> I believe this problem can be worked around today via a change of 
> configurations, but I'm filing this issue to (a) better document this 
> problem, and (b) propose either a change of default configurations or 
> additional DAGScheduler logic to better handle this failure mode.
> h2. Problem description
> The external shuffle service process is _mostly_ stateless except for a map 
> tracking the set of registered applications and executors.
> When processing a shuffle fetch request, the shuffle services first checks 
> whether the requested block ID's executor is registered; if it's not 
> registered then the shuffle service throws an exception like 
> {code:java}
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1557557221330_6891, execId=428){code}
> and this exception becomes a {{FetchFailed}} error in the executor requesting 
> the shuffle block.
> In normal operation this error should not occur because executors shouldn't 
> be mis-routing shuffle fetch requests. However, this _can_ happen if the 
> shuffle service crashes and restarts, causing it to lose its in-memory 
> executor registration state. With YARN this state can be recovered from disk 
> if YARN NodeManager recovery is enabled (using the mechanism added in 
> SPARK-9439), but I don't believe that we perform state recovery in Standalone 
> and Mesos modes (see SPARK-24223).
> If state cannot be recovered then map outputs cannot be served (even though 
> the files probably still exist on disk). In theory, this shouldn't cause 
> Spark jobs to fail because we can always redundantly recompute lost / 
> unfetchable map outputs.
> However, in practice this can cause total job failures in deployments where 
> the node with the failed shuffle service was running a large number of 
> executors: by default, the DAGScheduler unregisters map outputs _only from 
> individual executor whose shuffle blocks could not be fetched_ (see 
> [code|https://github.com/apache/spark/blame/bfb3ffe9b33a403a1f3b6f5407d34a477ce62c85/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1643]),
>  so it can take several rounds of failed stage attempts to fail and clear 
> output from all executors on the faulty host. If the number of executors on a 
> host is greater than the stage retry limit then this can exhaust stage retry 
> attempts and cause job failures.
> This "multiple rounds of recomputation to discover all failed executors on a 
> host" problem was addressed by SPARK-19753, which added a 
> {{spark.files.fetchFailure.unRegisterOutputOnHost}} configuration which 
> promotes executor fetch failures into host-wide fetch failures (clearing 
> output from all neighboring executors upon a single failure). However, that 
> configuration is {{false}} by default.
> h2. Potential solutions
> I have a few ideas about how we can improve this situation:
>  - Update the [YARN external shuffle service 
> documentation|https://spark.apache.org/docs/latest/running-on-yarn.html#configuring-the-external-shuffle-service]
>  to recommend enabling node manager recovery.
>  - Consider defaulting {{spark.files.fetchFailure.unRegisterOutputOnHost}} to 
> {{true}}. This would improve out-of-the-box resiliency for large clusters. 
> The trade-off here is a reduction of efficiency in case there are transient 
> "false positive" fetch failures, but I suspect this case may be unlikely in 
> practice (so the change of default could be an acceptable trade-off). See 
> [prior discussion on 
>

[jira] [Updated] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.

2019-10-22 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Description: 
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes is 
128MB.
For stage 1, its input is 16.3TB, but there are only 6400 tasks.


I checked the code,  it is only effective for data source table.
So, its description is confused.
Same as all the descriptions of `spark.sql.files.*`.

  was:
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes is 
128MB.
For stage 1, its input is 16.3TB, but there are only 6400 tasks.


I checked the code,  it is only effective for data source table.
So, its description is confused.


> [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
> 
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes 
> is 128MB.
> For stage 1, its input is 16.3TB, but there are only 6400 tasks.
> I checked the code,  it is only effective for data source table.
> So, its description is confused.
> Same as all the descriptions of `spark.sql.files.*`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29542) [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.

2019-10-22 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Summary: [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.  
(was: [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.)

> [SQL][DOC] The descriptions of `spark.sql.files.*` are confused.
> 
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes 
> is 128MB.
> For stage 1, its input is 16.3TB, but there are only 6400 tasks.
> I checked the code,  it is only effective for data source table.
> So, its description is confused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.

2019-10-21 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Description: 
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes is 
128MB.
For stage 1, its input is 16.3TB, but there are only 6400 tasks.


I checked the code,  it is only effective for data source table.
So, its description is confused.

  was:
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

But as shown in the attachment,  for stage 1, there are 6.1 

I checked the code,  it is only effective for data source table.
So, its description is confused.


> [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
> -
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> As shown in the attachment,  the value of spark.sql.files.maxPartitionBytes 
> is 128MB.
> For stage 1, its input is 16.3TB, but there are only 6400 tasks.
> I checked the code,  it is only effective for data source table.
> So, its description is confused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.

2019-10-21 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Description: 
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

But as shown in the attachment,  for stage 1, there are 6.1 

I checked the code,  it is only effective for data source table.
So, its description is confused.

  was:
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

But as shown in the attachment,  

I checked the code,  it is only effective for data source table.
So, its description is confused.


> [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
> -
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> But as shown in the attachment,  for stage 1, there are 6.1 
> I checked the code,  it is only effective for data source table.
> So, its description is confused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.

2019-10-21 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Description: 
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

But as shown in the attachment, it can not.

I checked the code,  it is only effective for data source table.
So, its description is confused.

  was:
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value.

But as shown in the attachment, it can not.

I checked the code,  it is only effective for data source table.
So, its description is confused.


> [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
> -
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> But as shown in the attachment, it can not.
> I checked the code,  it is only effective for data source table.
> So, its description is confused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.

2019-10-21 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Description: 
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

But as shown in the attachment,  

I checked the code,  it is only effective for data source table.
So, its description is confused.

  was:
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value 
for spark sql.

But as shown in the attachment, it can not.

I checked the code,  it is only effective for data source table.
So, its description is confused.


> [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
> -
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value for spark sql.
> But as shown in the attachment,  
> I checked the code,  it is only effective for data source table.
> So, its description is confused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.

2019-10-21 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Description: 
Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.

{code:java}
The maximum number of bytes to pack into a single partition when reading files.
{code}

It seems that it can ensure each partition at most process bytes of that value.

But as shown in the attachment, it can not.

I checked the code,  it is only effective for data source table.
So, its description is confused.

> [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
> -
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>
> Hi，the description of `spark.sql.files.maxPartitionBytes` is shown as below.
> {code:java}
> The maximum number of bytes to pack into a single partition when reading 
> files.
> {code}
> It seems that it can ensure each partition at most process bytes of that 
> value.
> But as shown in the attachment, it can not.
> I checked the code,  it is only effective for data source table.
> So, its description is confused.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.

2019-10-21 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29542:

Attachment: screenshot-1.png

> [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.
> -
>
> Key: SPARK-29542
> URL: https://issues.apache.org/jira/browse/SPARK-29542
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29542) [DOC] The description of `spark.sql.files.maxPartitionBytes` is confused.

2019-10-21 Thread feiwang (Jira)

feiwang created SPARK-29542:
---

 Summary: [DOC] The description of 
`spark.sql.files.maxPartitionBytes` is confused.
 Key: SPARK-29542
 URL: https://issues.apache.org/jira/browse/SPARK-29542
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29262) DataFrameWriter insertIntoPartition function

2019-10-18 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16954424#comment-16954424
 ] 

feiwang commented on SPARK-29262:
-

I'll try to implement it.

> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: feiwang
>Priority: Minor
>
> InsertIntoPartition is a useful function.
> For SQL statement, relative syntax.
> {code:java}
> insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
> {code}
> In the example above, I specify all the partition key value, so it must be a 
> static partition overwrite, regardless whether enable dynamic partition 
> overwrite.
> If we enable dynamic partition overwrite. For the sql below, it will only 
> overwrite relative partition not whole table.
> If we disable dynamic partition overwrite, it will overwrite whole table.
> {code:java}
> insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
> {code}
> As far as now, dataFrame does not support overwrite a specific partition.
> It means that, for a partitioned table, if we insert overwrite  by using 
> dataFrame with dynamic partition overwrite disabled,  it will always 
> overwrite whole table.
> So, we should support insertIntoPartition for dataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-10-17 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Description: 
For InsertIntoHadoopFsRelation operations.

Case A:
Application appA insert overwrite table table_a with static partition overwrite.
But it was killed when committing tasks, because one task is hang.
And parts of its committed tasks output is kept under 
/path/table_a/_temporary/0/.

Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
It executes successfully.
But it also commit the data reminded by killed application to destination dir.

Case B:

Application appA insert overwrite table table_a.

Application appB insert overwrite table table_a, too.

They execute concurrently, and they may all use /path/table_a/_temporary/0/ as 
workPath.

And their result may be corruptted.

  was:
When we insert overwrite a partition of table.
For a stage, whose tasks commit output, a task saves output to a staging dir 
firstly,  when this task complete, it will save output to committedTaskPath, 
when all tasks of this stage success, all task output under committedTaskPath 
will be moved to destination dir.

However, when we kill an application, which is committing tasks' output, parts 
of tasks' results will be kept in committedTaskPath, which would not be cleared 
gracefully.

Then we rerun this application and the new application will reuse this 
committedTaskPath dir.

And when the task commit stage of new application success, all task output 
under this committedTaskPath, which contains parts of old application's task 
output , would be moved to destination dir and the result is duplicated.




> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> For InsertIntoHadoopFsRelation operations.
> Case A:
> Application appA insert overwrite table table_a with static partition 
> overwrite.
> But it was killed when committing tasks, because one task is hang.
> And parts of its committed tasks output is kept under 
> /path/table_a/_temporary/0/.
> Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
> It executes successfully.
> But it also commit the data reminded by killed application to destination dir.
> Case B:
> Application appA insert overwrite table table_a.
> Application appB insert overwrite table table_a, too.
> They execute concurrently, and they may all use /path/table_a/_temporary/0/ 
> as workPath.
> And their result may be corruptted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-11 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949229#comment-16949229
 ] 

feiwang edited comment on SPARK-29302 at 10/11/19 7:46 AM:
---

[~dangdangdang]

Hi, I have thought a simple solution.

We just need make the file name of a task be unique.
And the OutputCommitCoordinator would decide which task file can be committed.

But I don't have an appropriate unit test.


was (Author: hzfeiwang):
[~dangdangdang]

Hi, I have thought a simple solution.

We just need make the file name of a task be unique.
And the OutputCommitCoordinator would decide which task file can be committed.

But I don't have a appropriate unit test.

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-11 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949229#comment-16949229
 ] 

feiwang commented on SPARK-29302:
-

[~dangdangdang]

Hi, I have thought a simple solution.

We just need make the file name of a task be unique.
And the OutputCommitCoordinator would decide which task file can be committed.

But I don't have a appropriate unit test.

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947451#comment-16947451
 ] 

feiwang commented on SPARK-29302:
-

cc [~cloud_fan]

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947446#comment-16947446
 ] 

feiwang edited comment on SPARK-29302 at 10/9/19 8:31 AM:
--

Sorry for the late reply, I was on my National Day holiday for the past eight 
days. 
I just made a simple jobId in the UT above.
In fact, it was created by a jobIdInstant.
And for  the tasks of a same job, they are same.
So, I think this is still an issue.
 !screenshot-1.png! 
 !screenshot-2.png! 




was (Author: hzfeiwang):
Sorry, I was on my National Day holiday for the past eight days. 
I just made a simple jobId in the UT above.
In fact, it was created by a jobIdInstant.
And for  the tasks of a same job, they are same.
So, I think this is still an issue.
 !screenshot-1.png! 
 !screenshot-2.png! 



> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947446#comment-16947446
 ] 

feiwang edited comment on SPARK-29302 at 10/9/19 8:30 AM:
--

Sorry, I was on my National Day holiday for the past eight days. 
I just made a simple jobId in the UT above.
In fact, it was created by a jobIdInstant.
And for  the tasks of a same job, they are same.
So, I think this is still an issue.
 !screenshot-1.png! 
 !screenshot-2.png! 




was (Author: hzfeiwang):
Sorry, I was on my National Day holiday for the past eight days. 
I just made a simple jobId in the UT below.
In fact, it was created by a jobIdInstant.
And for  the tasks of a same job, they are same.
So, I think this is still an issue.
 !screenshot-1.png! 
 !screenshot-2.png! 



> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang reopened SPARK-29302:
-

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947446#comment-16947446
 ] 

feiwang commented on SPARK-29302:
-

Sorry, I was on my National Day holiday for the past eight days. 
I just made a simple jobId in the UT below.
In fact, it was created by a jobIdInstant.
And for  the tasks of a same job, they are same.
So, I think this is still an issue.
 !screenshot-1.png! 
 !screenshot-2.png! 



> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29302:

Attachment: screenshot-1.png

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29302:

Attachment: (was: screenshot-1.png)

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29302:

Attachment: screenshot-2.png

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-09 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29302:

Attachment: screenshot-1.png

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941485#comment-16941485
 ] 

feiwang commented on SPARK-29302:
-

Yes, they are in a same stage, so they have a same jobId.

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-30 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941475#comment-16941475
 ] 

feiwang commented on SPARK-29295:
-

Yes, how can we resolve this issue?
Only by upgrading the hive version?

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941472#comment-16941472
 ] 

feiwang edited comment on SPARK-29302 at 10/1/19 2:24 AM:
--

For dynamic partition overwrite, when executing a task, a determinable path 
would be specified.

In the reproduce suite above, I create two task attempt context with same task 
id and different attempt id.
And specify a output dir for   newTaskTempFile method.


was (Author: hzfeiwang):
For dynamic partition overwrite, when execute a task, a determinable path would 
be specified.

In the reproduce suite above, I create two task attempt context with same task 
id and different attempt id.
And specify a output dir for   newTaskTempFile method.

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941472#comment-16941472
 ] 

feiwang commented on SPARK-29302:
-

For dynamic partition overwrite, when execute a task, a determinable path would 
be specified.

In the reproduce suite above, I create two task attempt context with same task 
id and different attempt id.
And specify a output dir for   newTaskTempFile method.

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16941467#comment-16941467
 ] 

feiwang commented on SPARK-29302:
-

You can add the code below into FileFormatWriterSuite.
{code:java}
  test("SPARK-29302: for dynamic partition overwrite, a task will concurrent 
write a same file" +
" with its relative speculation task") {
withTempDir { f =>
  val jobId = SparkHadoopWriterUtils.createJobID(new Date(), 1)
  val taskId = new TaskID(jobId, TaskType.MAP, 1)
  val taskAttemptId0 = new TaskAttemptID(taskId, 0)
  val taskAttemptId1 = new TaskAttemptID(taskId, 1)

  val taskAttemptContext0: TaskAttemptContext = {
// Set up the configuration object
val hadoopConf = new Configuration();
hadoopConf.set("mapreduce.job.id", jobId.toString)
hadoopConf.set("mapreduce.task.id", taskAttemptId0.getTaskID.toString)
hadoopConf.set("mapreduce.task.attempt.id", taskAttemptId0.toString)
hadoopConf.setBoolean("mapreduce.task.ismap", true)
hadoopConf.setInt("mapreduce.task.partition", 0)

new TaskAttemptContextImpl(hadoopConf, taskAttemptId0)
  }

  val taskAttemptContext1: TaskAttemptContext = {
// Set up the configuration object
val hadoopConf = new Configuration();
hadoopConf.set("mapreduce.job.id", jobId.toString)
hadoopConf.set("mapreduce.task.id", taskAttemptId1.getTaskID.toString)
hadoopConf.set("mapreduce.task.attempt.id", taskAttemptId1.toString)
hadoopConf.setBoolean("mapreduce.task.ismap", true)
hadoopConf.setInt("mapreduce.task.partition", 0)

new TaskAttemptContextImpl(hadoopConf, taskAttemptId1)
  }

  val committer = new HadoopMapReduceCommitProtocol(jobId.toString, 
f.getAbsolutePath)
  val tf0 = committer.newTaskTempFile(taskAttemptContext0, 
Some(f.getAbsolutePath), "ext")
  val tf1 = committer.newTaskTempFile(taskAttemptContext1, 
Some(f.getAbsolutePath), "ext")
  assert(tf0 == tf1)
}
{code}


> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29302:

Description: 
Now, for a dynamic partition overwrite operation,  the filename of a task 
output is determinable.

So, if speculation is enabled,  would a task conflict with  its relative 
speculation task?

Would the two tasks concurrent write a same file?


  was:
Now, for a dynamic partition overwrite operation,  the filename of a task 
output is determinable.

So, if speculation is enabled,  would a task conflict with  its relative 
speculation task?



> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29302:

Description: 
Now, for a dynamic partition overwrite operation,  the filename of a task 
output is determinable.

So, if speculation is enabled,  would a task conflict with  its relative 
speculation task?


  was:
Now, for a dynamic partition overwrite operation,  the filename of a task 
output is determinable.

So, if we enable 


> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29302:

Description: 
Now, for a dynamic partition overwrite operation,  the filename of a task 
output is determinable.

So, if we enable 

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if we enable 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-09-30 Thread feiwang (Jira)

feiwang created SPARK-29302:
---

 Summary: dynamic partition overwrite with speculation enabled
 Key: SPARK-29302
 URL: https://issues.apache.org/jira/browse/SPARK-29302
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-30 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940754#comment-16940754
 ] 

feiwang commented on SPARK-29295:
-

relative hive issue, https://issues.apache.org/jira/browse/HIVE-17063

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-29 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29295:

Comment: was deleted

(was: I have tested in 2.3.1 branch, without SPARK-25271, it will always give 
duplicate result.
Thanks for SPARK-25271, it enable this statement use data source command  if it 
is convertible.)

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-29 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940600#comment-16940600
 ] 

feiwang edited comment on SPARK-29295 at 9/30/19 1:52 AM:
--

cc [~cloud_fan] [~viirya]


was (Author: hzfeiwang):
cc [~cloud_fan]

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-29 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940600#comment-16940600
 ] 

feiwang commented on SPARK-29295:
-

cc [~cloud_fan]

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-29 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940599#comment-16940599
 ] 

feiwang commented on SPARK-29295:
-

I have tested in 2.3.1 branch, without SPARK-25271, it will always give 
duplicate result.
Thanks for SPARK-25271, it enable this statement use data source command  if it 
is convertible.

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-29 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16940598#comment-16940598
 ] 

feiwang commented on SPARK-29295:
-

When we set CONVERT_METASTORE_PARQUET=true, it will use 
InsertIntoHadoopFsRelationCommand to process this statement.

When we set CONVERT_METASTORE_PARQUET=false, it will use InsertIntoHiveTable.

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-29 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29295:

Description: 
When we drop a partition of a external table and then overwrite it, if we set 
CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition.
But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result.

Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
module):

{code:java}
  test("spark gives duplicate result when dropping a partition of an external 
partitioned table" +
" firstly and they overwrite it") {
withTable("test") {
  withTempDir { f =>
sql("create external table test(id int) partitioned by (name string) 
stored as " +
  s"parquet location '${f.getAbsolutePath}'")

withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) {
  sql("insert overwrite table test partition(name='n1') select 1")
  sql("ALTER TABLE test DROP PARTITION(name='n1')")
  sql("insert overwrite table test partition(name='n1') select 2")
  checkAnswer( sql("select id from test where name = 'n1' order by id"),
Array(Row(1), Row(2)))
}

withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) {
  sql("insert overwrite table test partition(name='n1') select 1")
  sql("ALTER TABLE test DROP PARTITION(name='n1')")
  sql("insert overwrite table test partition(name='n1') select 2")
  checkAnswer( sql("select id from test where name = 'n1' order by id"),
Array(Row(2)))
}
  }
}
  }
{code}


  was:
When we drop a partition of a external table and then overwrite it, if we set 
CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition.
But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result.

Here is a reproduce code below:

{code:java}
  test("spark gives duplicate result when dropping a partition of an external 
partitioned table" +
" firstly and they overwrite it") {
withTempView("ta", "tb") {
  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) {
withTempDir { f =>
  sql("create external table ta(id int) partitioned by (name string) 
stored as " +
s"parquet location '${f.getAbsolutePath}'")
  sql("insert overwrite table ta partition(name='n1') select 1")
  sql("ALTER TABLE ta DROP PARTITION(name='n1')")
  sql("insert overwrite table ta partition(name='n1') select 2")
  checkAnswer( sql("select id from ta where name = 'n1' order by id"),
Array(Row(1), Row(2)))
}
  }

  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) {
withTempDir { fa =>
  sql("create external table tb(id int) partitioned by (name string) 
stored as " +
s"parquet location '${fa.getAbsolutePath}'")
  sql("insert overwrite table tb partition(name='n1') select 1")
  sql("ALTER TABLE tb DROP PARTITION(name='n1')")
  sql("insert overwrite table tb partition(name='n1') select 2")
  checkAnswer( sql("select id from tb where name = 'n1' order by id"), 
Row(2))
}
  }
}
  }
{code}



> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
>

[jira] [Created] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2019-09-29 Thread feiwang (Jira)

feiwang created SPARK-29295:
---

 Summary: Duplicate result when dropping partition of an external 
table and then overwriting
 Key: SPARK-29295
 URL: https://issues.apache.org/jira/browse/SPARK-29295
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: feiwang


When we drop a partition of a external table and then overwrite it, if we set 
CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition.
But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result.

Here is a reproduce code below:

{code:java}
  test("spark gives duplicate result when dropping a partition of an external 
partitioned table" +
" firstly and they overwrite it") {
withTempView("ta", "tb") {
  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) {
withTempDir { f =>
  sql("create external table ta(id int) partitioned by (name string) 
stored as " +
s"parquet location '${f.getAbsolutePath}'")
  sql("insert overwrite table ta partition(name='n1') select 1")
  sql("ALTER TABLE ta DROP PARTITION(name='n1')")
  sql("insert overwrite table ta partition(name='n1') select 2")
  checkAnswer( sql("select id from ta where name = 'n1' order by id"),
Array(Row(1), Row(2)))
}
  }

  withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) {
withTempDir { fa =>
  sql("create external table tb(id int) partitioned by (name string) 
stored as " +
s"parquet location '${fa.getAbsolutePath}'")
  sql("insert overwrite table tb partition(name='n1') select 1")
  sql("ALTER TABLE tb DROP PARTITION(name='n1')")
  sql("insert overwrite table tb partition(name='n1') select 2")
  checkAnswer( sql("select id from tb where name = 'n1' order by id"), 
Row(2))
}
  }
}
  }
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29262) DataFrameWriter insertIntoPartition function

2019-09-28 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29262:

Description: 
InsertIntoPartition is a useful function.
For SQL statement, relative syntax.

{code:java}
insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
{code}
In the example above, I specify all the partition key value, so it must be a 
static partition overwrite, regardless whether enable dynamic partition 
overwrite.
If we enable dynamic partition overwrite. For the sql below, it will only 
overwrite relative partition not whole table.
If we disable dynamic partition overwrite, it will overwrite whole table.
{code:java}
insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
{code}

As far as now, dataFrame does not support overwrite a specific partition.
It means that, for a partitioned table, if we insert overwrite  by using 
dataFrame with dynamic partition overwrite disabled,  it will always overwrite 
whole table.

So, we should support insertIntoPartition for dataFrameWriter.

  was:
InsertIntoPartition is a useful function.
For SQL statement, relative syntax.

{code:java}
insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
{code}
In the example above, I specify all the partition key value, so it must be a 
static partition overwrite, regardless whether enable dynamic partition 
overwrite.
If we enable dynamic partition overwrite. For the sql below, it will only 
overwrite relative partition not whole table.
{code:java}
insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
{code}

As far as now, dataFrame does not support insertIntoPartition.
It means that, for a partitioned table, if we insert overwrite  by using 
dataFrame with dynamic partition overwrite disabled,  it will always overwrite 
whole table.

So, we should support insertIntoPartition for dataFrameWriter.


> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> InsertIntoPartition is a useful function.
> For SQL statement, relative syntax.
> {code:java}
> insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
> {code}
> In the example above, I specify all the partition key value, so it must be a 
> static partition overwrite, regardless whether enable dynamic partition 
> overwrite.
> If we enable dynamic partition overwrite. For the sql below, it will only 
> overwrite relative partition not whole table.
> If we disable dynamic partition overwrite, it will overwrite whole table.
> {code:java}
> insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
> {code}
> As far as now, dataFrame does not support overwrite a specific partition.
> It means that, for a partitioned table, if we insert overwrite  by using 
> dataFrame with dynamic partition overwrite disabled,  it will always 
> overwrite whole table.
> So, we should support insertIntoPartition for dataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29262) DataFrameWriter insertIntoPartition function

2019-09-28 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29262:

Description: 
InsertIntoPartition is a useful function.
For SQL statement, relative syntax.

{code:java}
insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
{code}
In the example above, I specify all the partition key value, so it must be a 
static partition overwrite, regardless whether enable dynamic partition 
overwrite.
If we enable dynamic partition overwrite. For the sql below, it will only 
overwrite relative partition not whole table.
{code:java}
insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
{code}

As far as now, dataFrame does not support insertIntoPartition.
It means that, for a partitioned table, if we insert overwrite  by using 
dataFrame with dynamic partition overwrite disabled,  it will always overwrite 
whole table.

So, we should support insertIntoPartition for dataFrameWriter.

  was:
Do we have plan to support insertIntoPartition function for dataFrameWriter?

[~cloud_fan]


> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>
> InsertIntoPartition is a useful function.
> For SQL statement, relative syntax.
> {code:java}
> insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
> {code}
> In the example above, I specify all the partition key value, so it must be a 
> static partition overwrite, regardless whether enable dynamic partition 
> overwrite.
> If we enable dynamic partition overwrite. For the sql below, it will only 
> overwrite relative partition not whole table.
> {code:java}
> insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
> {code}
> As far as now, dataFrame does not support insertIntoPartition.
> It means that, for a partitioned table, if we insert overwrite  by using 
> dataFrame with dynamic partition overwrite disabled,  it will always 
> overwrite whole table.
> So, we should support insertIntoPartition for dataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29262) DataFrameWriter insertIntoPartition function

2019-09-26 Thread feiwang (Jira)

feiwang created SPARK-29262:
---

 Summary: DataFrameWriter insertIntoPartition function
 Key: SPARK-29262
 URL: https://issues.apache.org/jira/browse/SPARK-29262
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.4
Reporter: feiwang


Do we have plan to support insertIntoPartition function for dataFrameWriter?

[~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29113) Some annotation errors in ApplicationCache.scala

2019-09-16 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29113:

Issue Type: Documentation  (was: Bug)

> Some annotation errors in ApplicationCache.scala
> 
>
> Key: SPARK-29113
> URL: https://issues.apache.org/jira/browse/SPARK-29113
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-29113) Some annotation errors in ApplicationCache.scala

2019-09-16 Thread feiwang (Jira)

feiwang created SPARK-29113:
---

 Summary: Some annotation errors in ApplicationCache.scala
 Key: SPARK-29113
 URL: https://issues.apache.org/jira/browse/SPARK-29113
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

2019-09-15 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930221#comment-16930221
 ] 

feiwang edited comment on SPARK-28945 at 9/16/19 4:22 AM:
--

[~advancedxy]
Thanks. Hope SPARK-28945 can be merged soon.
It is important for data quality.


was (Author: hzfeiwang):
[~advancedxy]
Thanks. Hope SPARK-28945 can be merged soon.
It is critical for data quality.

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

2019-09-15 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930221#comment-16930221
 ] 

feiwang commented on SPARK-28945:
-

[~advancedxy]
Thanks. Hope SPARK-28945 can be merged soon.
It is critical for data quality.

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

2019-09-15 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930010#comment-16930010
 ] 

feiwang edited comment on SPARK-28945 at 9/15/19 5:17 PM:
--

[~cloud_fan] [~advancedxy]
Hi, I think the exception shown in the 
email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E)
  is related with this PR(https://github.com/apache/spark/pull/25795).

When dynamicPartitionOverwrite is true, we should skip commitJob.


was (Author: hzfeiwang):
[~cloud_fan] [~advancedxy]
Hi, I think the exception shown in the 
email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E)
  can be resolved by this PR: (https://github.com/apache/spark/pull/25795).

When dynamicPartitionOverwrite is true, we should skip commitJob.

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

2019-09-15 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930010#comment-16930010
 ] 

feiwang edited comment on SPARK-28945 at 9/15/19 5:12 PM:
--

[~cloud_fan] [~advancedxy]
Hi, I think the exception shown in the 
email(https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E)
  can be resolved by this PR: (https://github.com/apache/spark/pull/25795).

When dynamicPartitionOverwrite is true, we should skip commitJob.


was (Author: hzfeiwang):
[~cloud_fan] [~advancedxy]
Hi, I think the exception shown in the email 
https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E
  can be resolved by this PR: https://github.com/apache/spark/pull/25795.

When dynamicPartitionOverwrite is true, we should skip commitJob.

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28945) Allow concurrent writes to different partitions with dynamic partition overwrite

2019-09-15 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930010#comment-16930010
 ] 

feiwang commented on SPARK-28945:
-

[~cloud_fan] [~advancedxy]
Hi, I think the exception shown in the email 
https://mail-archives.apache.org/mod_mbox/spark-dev/201909.mbox/%3CCANx3uAinvf2LdtKfWUsykCJ%2BkHh6oYy0Pt_5LvcTSURGmQKQwg%40mail.gmail.com%3E
  can be resolved by this PR: https://github.com/apache/spark/pull/25795.

When dynamicPartitionOverwrite is true, we should skip commitJob.

> Allow concurrent writes to different partitions with dynamic partition 
> overwrite
> 
>
> Key: SPARK-28945
> URL: https://issues.apache.org/jira/browse/SPARK-28945
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: koert kuipers
>Priority: Minor
>
> It is desirable to run concurrent jobs that write to different partitions 
> within same baseDir using partitionBy and dynamic partitionOverwriteMode.
> See for example here:
> https://stackoverflow.com/questions/38964736/multiple-spark-jobs-appending-parquet-data-to-same-base-path-with-partitioning
> Or the discussion here:
> https://github.com/delta-io/delta/issues/9
> This doesnt seem that difficult. I suspect only changes needed are in 
> org.apache.spark.internal.io.HadoopMapReduceCommitProtocol, which already has 
> a flag for dynamicPartitionOverwrite. I got a quick test to work by disabling 
> all committer activity (committer.setupJob, committer.commitJob, etc.) when 
> dynamicPartitionOverwrite is true. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29043) [History Server]Only one replay thread of FsHistoryProvider work because of straggler

2019-09-15 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929939#comment-16929939
 ] 

feiwang commented on SPARK-29043:
-

Thanks, [~kabhwan]. 

> [History Server]Only one replay thread of FsHistoryProvider work because of 
> straggler
> -
>
> Key: SPARK-29043
> URL: https://issues.apache.org/jira/browse/SPARK-29043
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
> Attachments: image-2019-09-11-15-09-22-912.png, 
> image-2019-09-11-15-10-25-326.png, screenshot-1.png
>
>
> As shown in the attachment, we set spark.history.fs.numReplayThreads=30 for 
> spark history server.
> However, there is only one replay thread work because of straggler.
> Let's check the code.
> https://github.com/apache/spark/blob/7f36cd2aa5e066a807d498b8c51645b136f08a75/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala#L509-L547
> There is a synchronous operation for all replay tasks.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-15 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929916#comment-16929916
 ] 

feiwang commented on SPARK-29037:
-

[~advancedxy] Hi, I found that even with dynamicPartitionOverwrite, spark may 
still give duplicate result for the case below.
And I have create a pull request. 
https://github.com/apache/spark/pull/25795

Case:
Application appA insert overwrite table table_a with static partition overwrite.
But it was killed when committing tasks, because one task is hang.
And parts of its committed tasks output is kept under 
/path/table_a/_temporary/0/.

Then we run application appB insert overwrite table table_a with dynamic 
partition overwrite.
It executes successfully.
But it also commit the data under /path/table_a/_temporary/0/ to destination 
dir.



> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-14 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674
 ] 

feiwang edited comment on SPARK-29037 at 9/15/19 4:41 AM:
--

[~advancedxy]
Thanks for your reply.

I will learn more about dynamic partition. Thanks for your suggestion.


was (Author: hzfeiwang):
[~advancedxy]
Thanks for your reply.
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.
I will learn more about dynamic partition. Thanks for your suggestion.

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-12 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928596#comment-16928596
 ] 

feiwang edited comment on SPARK-29037 at 9/12/19 4:17 PM:
--

[~advancedxy]  
1. We re-submit the same application again.

We meet this issue when insert overwrite table, so it is not feasible to 
resolve in user side.

Yes, this issue can be resolved in the hadoop side, but it involves a new 
release of hadoop. We can do it in spark side.

About output check, I think it is not appropriate, because when several 
application(insert overwrite a partition of same table) running at same time, 
they may use the same committedTaskPath.

So, I think we could implement a spark's FileCommitProtocol reference the 
implementation of `InsertIntoHiveTable`.
// org.apache.spark.sql.hive.execution.InsertIntoHiveTable
For InsertIntoHiveTable, it first saveAsHiveFile(commit all tasks' output) to a 
hive-staging dir as shown in the log below.

{code:java}
19/09/12 02:47:46 INFO FileOutputCommitter: Saved output of task 
'attempt_20190912024744_0004_m_00_0' to 
hdfs://hercules-sub/user/b_hive_dba/fwang12_test/test_merge/.hive-staging_hive_2019-09-12_02-47-44_798_6385324183561649436-1/-ext-1/_temporary/0/task_20190912024744_0004_m_00
{code}

Then it load these output to hive table.





was (Author: hzfeiwang):
[~advancedxy]  
1. We re-submit the same application again.

We meet this issue when insert overwrite table, so it is feasible to resolve in 
user side.

Yes, this issue can be resolved in the hadoop side, but it involves a new 
release of hadoop. We can do it in spark side.

About output check, I think it is not appropriate, because when several 
application(insert overwrite a partition of same table) running at same time, 
they may use the same committedTaskPath.

So, I think we could implement a spark's FileCommitProtocol reference the 
implementation of `InsertIntoHiveTable`.
// org.apache.spark.sql.hive.execution.InsertIntoHiveTable
For InsertIntoHiveTable, it first saveAsHiveFile(commit all tasks' output) to a 
hive-staging dir as shown in the log below.

{code:java}
19/09/12 02:47:46 INFO FileOutputCommitter: Saved output of task 
'attempt_20190912024744_0004_m_00_0' to 
hdfs://hercules-sub/user/b_hive_dba/fwang12_test/test_merge/.hive-staging_hive_2019-09-12_02-47-44_798_6385324183561649436-1/-ext-1/_temporary/0/task_20190912024744_0004_m_00
{code}

Then it load these output to hive table.




> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-12 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674
 ] 

feiwang edited comment on SPARK-29037 at 9/12/19 3:58 PM:
--

[~advancedxy]
Thanks for your reply.
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.
I will learn more about dynamic partition. Thanks for your suggestion.


was (Author: hzfeiwang):
[~advancedxy]
Thanks for your reply.
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-12 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674
 ] 

feiwang edited comment on SPARK-29037 at 9/12/19 3:49 PM:
--

[~advancedxy]
Thanks for your reply.
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.


was (Author: hzfeiwang):
[~advancedxy]
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.

So I think this issue is usual when there is  only one partition column for a 
table.


> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-12 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674
 ] 

feiwang edited comment on SPARK-29037 at 9/12/19 3:48 PM:
--

[~advancedxy]
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.

So I think this issue is usual when there is  only one partition column for a 
table.



was (Author: hzfeiwang):
[~advancedxy]
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-12 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674
 ] 

feiwang edited comment on SPARK-29037 at 9/12/19 3:46 PM:
--

[~advancedxy]
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

{code:java}
val dynamicPartitionOverwrite = enableDynamicOverwrite && mode == 
SaveMode.Overwrite &&
  staticPartitions.size < partitionColumns.length
{code}

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.


was (Author: hzfeiwang):
[~advancedxy]
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-12 Thread feiwang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928674#comment-16928674
 ] 

feiwang commented on SPARK-29037:
-

[~advancedxy]
I just checked the code, as shown below.
https://github.com/apache/spark/blob/c56a012bc839cd2f92c2be41faea91d1acfba4eb/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala#L105-L106

When partitionColumns==1, for the operation of inserting overwrite table 
partition,  dynamicPartitionOverwrite is always false even DynamicOverwrite is 
enabled.

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29037) [Core] Spark gives duplicate result when an application was killed and rerun

2019-09-12 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-29037:

Comment: was deleted

(was: In detail,  I think we need change the logic of 
{color:red}InsertIntoHadoopFsRelationCommand{color} reference the 
implementation of {color:red}InsertIntoHiveTable{color}.)

> [Core] Spark gives duplicate result when an application was killed and rerun
> 
>
> Key: SPARK-29037
> URL: https://issues.apache.org/jira/browse/SPARK-29037
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.3
>Reporter: feiwang
>Priority: Major
> Attachments: screenshot-1.png
>
>
> When we insert overwrite a partition of table.
> For a stage, whose tasks commit output, a task saves output to a staging dir 
> firstly,  when this task complete, it will save output to committedTaskPath, 
> when all tasks of this stage success, all task output under committedTaskPath 
> will be moved to destination dir.
> However, when we kill an application, which is committing tasks' output, 
> parts of tasks' results will be kept in committedTaskPath, which would not be 
> cleared gracefully.
> Then we rerun this application and the new application will reuse this 
> committedTaskPath dir.
> And when the task commit stage of new application success, all task output 
> under this committedTaskPath, which contains parts of old application's task 
> output , would be moved to destination dir and the result is duplicated.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 226 matches

Mail list logo