date:20200808

[jira] [Created] (SPARK-32572) Run all the tests at once, instead of having separate entrypoints.

2020-08-08 Thread Fokko Driesprong (Jira)

Fokko Driesprong created SPARK-32572:


 Summary: Run all the tests at once, instead of having separate 
entrypoints.
 Key: SPARK-32572
 URL: https://issues.apache.org/jira/browse/SPARK-32572
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Fokko Driesprong






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32572) Run all the tests at once, instead of having separate entrypoints.

2020-08-08 Thread Fokko Driesprong (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-32572:
-
Description: 
Started with this comment thread: 
https://github.com/apache/spark/pull/29121/files#r456683561

Each file is invoked separately and has a separate entry point: 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/tests/test_wrapper.py#L120]

We would replace 
[https://github.com/apache/spark/blob/master/dev/run-tests.py#L470] this 
function call to the subprocess with something that would invoke the python 
tests.

> Run all the tests at once, instead of having separate entrypoints.
> --
>
> Key: SPARK-32572
> URL: https://issues.apache.org/jira/browse/SPARK-32572
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> Started with this comment thread: 
> https://github.com/apache/spark/pull/29121/files#r456683561
> Each file is invoked separately and has a separate entry point: 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/tests/test_wrapper.py#L120]
> We would replace 
> [https://github.com/apache/spark/blob/master/dev/run-tests.py#L470] this 
> function call to the subprocess with something that would invoke the python 
> tests.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32573) Eliminate Anti Join when BuildSide is Empty

2020-08-08 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-32573:
---

 Summary: Eliminate Anti Join when BuildSide is Empty
 Key: SPARK-32573
 URL: https://issues.apache.org/jira/browse/SPARK-32573
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
introduced several new types of HashedRelation
 * EmptyHashedRelation
 * EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. But as for a improvement, 
EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as 
for in AQE, we can even eliminate anti join when we knew that buildSide is 
empty.

 

This Patch including two changes.

In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
well

In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
ShuffleWriteRecord is 0

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32573) Eliminate Anti Join when BuildSide is Empty

2020-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32573:


Assignee: (was: Apache Spark)

> Eliminate Anti Join when BuildSide is Empty
> ---
>
> Key: SPARK-32573
> URL: https://issues.apache.org/jira/browse/SPARK-32573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
> introduced several new types of HashedRelation
>  * EmptyHashedRelation
>  * EmptyHashedRelationWithAllNullKeys
> They were all limited to used only in NAAJ scenario. But as for a 
> improvement, EmptyHashedRelation could also be used in Normal AntiJoin for 
> fast stop, and as for in AQE, we can even eliminate anti join when we knew 
> that buildSide is empty.
>  
> This Patch including two changes.
> In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
> well
> In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
> ShuffleWriteRecord is 0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32573) Eliminate Anti Join when BuildSide is Empty

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173623#comment-17173623
 ] 

Apache Spark commented on SPARK-32573:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29389

> Eliminate Anti Join when BuildSide is Empty
> ---
>
> Key: SPARK-32573
> URL: https://issues.apache.org/jira/browse/SPARK-32573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
> introduced several new types of HashedRelation
>  * EmptyHashedRelation
>  * EmptyHashedRelationWithAllNullKeys
> They were all limited to used only in NAAJ scenario. But as for a 
> improvement, EmptyHashedRelation could also be used in Normal AntiJoin for 
> fast stop, and as for in AQE, we can even eliminate anti join when we knew 
> that buildSide is empty.
>  
> This Patch including two changes.
> In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
> well
> In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
> ShuffleWriteRecord is 0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32573) Eliminate Anti Join when BuildSide is Empty

2020-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32573:


Assignee: Apache Spark

> Eliminate Anti Join when BuildSide is Empty
> ---
>
> Key: SPARK-32573
> URL: https://issues.apache.org/jira/browse/SPARK-32573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Minor
>
> In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
> introduced several new types of HashedRelation
>  * EmptyHashedRelation
>  * EmptyHashedRelationWithAllNullKeys
> They were all limited to used only in NAAJ scenario. But as for a 
> improvement, EmptyHashedRelation could also be used in Normal AntiJoin for 
> fast stop, and as for in AQE, we can even eliminate anti join when we knew 
> that buildSide is empty.
>  
> This Patch including two changes.
> In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
> well
> In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
> ShuffleWriteRecord is 0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32573) Eliminate Anti Join when BuildSide is Empty

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173624#comment-17173624
 ] 

Apache Spark commented on SPARK-32573:
--

User 'leanken' has created a pull request for this issue:
https://github.com/apache/spark/pull/29389

> Eliminate Anti Join when BuildSide is Empty
> ---
>
> Key: SPARK-32573
> URL: https://issues.apache.org/jira/browse/SPARK-32573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
> introduced several new types of HashedRelation
>  * EmptyHashedRelation
>  * EmptyHashedRelationWithAllNullKeys
> They were all limited to used only in NAAJ scenario. But as for a 
> improvement, EmptyHashedRelation could also be used in Normal AntiJoin for 
> fast stop, and as for in AQE, we can even eliminate anti join when we knew 
> that buildSide is empty.
>  
> This Patch including two changes.
> In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
> well
> In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
> ShuffleWriteRecord is 0
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-08-08 Thread JinxinTang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173633#comment-17173633
 ] 

JinxinTang commented on SPARK-32500:


I have found the root cause is `org.apache.spark.SparkContext#localProperties` 
is thread local, the `

spark.job.description` is seted by stream execution thread, and when we use 
python through py4j server to save, the py4j server thread belong to main 
thread group, so 
`org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart$#apply` can 
not get `desc` property from spark context.

When we use scala to operate, there is no problem because they are all belong 
to stream execution thread. 

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-26 at 6.50.39 PM.png, Screen Shot 
> 2020-07-30 at 9.04.21 PM.png, image-2020-08-01-10-21-51-246.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32564) Inject data statistics to simulate plan generation on actual TPCDS data

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173634#comment-17173634
 ] 

Apache Spark commented on SPARK-32564:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29390

>  Inject data statistics to simulate plan generation on actual TPCDS data
> 
>
> Key: SPARK-32564
> URL: https://issues.apache.org/jira/browse/SPARK-32564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then 
> checks if plans can be generated correctly. But, the generated plans can be 
> different from actual ones because the input tables are empty (e.g., the 
> plans always use broadcast-hash joins, but actual ones use sort-merge joins 
> for larger tables). To mitigate the issue, this ticket targets at defining 
> data statistics constants extracted from generated TPCDS data in 
> `TPCDSTableStats`, then injects the statistics via 
> `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in 
> `TPCDSQuerySuite`.
> Please see a link below about how to extract the table statistics:
>  - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3
> For example, the generated plans of TPCDS `q2` are different with/without 
> this fix:
> {code:java}
>  w/ this fix: q2 
> == Physical Plan ==
> * Sort (43)
> +- Exchange (42)
>  +- * Project (41)
>  +- * SortMergeJoin Inner (40)
>  :- * Sort (28)
>  : +- Exchange (27)
>  : +- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- * Sort (39)
>  +- Exchange (38)
>  +- * Project (37)
>  +- * BroadcastHashJoin Inner BuildRight (36)
>  :- * HashAggregate (30)
>  : +- ReusedExchange (29)
>  +- BroadcastExchange (35)
>  +- * Project (34)
>  +- * Filter (33)
>  +- * ColumnarToRow (32)
>  +- Scan parquet default.date_dim (31)
>  w/o this fix: q2 
> == Physical Plan ==
> * Sort (40)
> +- Exchange (39)
>  +- * Project (38)
>  +- * BroadcastHashJoin Inner BuildRight (37)
>  :- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- BroadcastExchange (36)
>  +- * Project (35)
>  +- * BroadcastHashJoin Inner BuildRight (34)
>  :- * HashAggregate (28)
>  : +- ReusedExchange (27)
>  +- BroadcastExchange (33)
>  +- * Project (32)
>  +- * Filter (31)
>  +- * ColumnarToRow (30)
>  +- Scan parquet default.date_dim (29)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32564) Inject data statistics to simulate plan generation on actual TPCDS data

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173635#comment-17173635
 ] 

Apache Spark commented on SPARK-32564:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29390

>  Inject data statistics to simulate plan generation on actual TPCDS data
> 
>
> Key: SPARK-32564
> URL: https://issues.apache.org/jira/browse/SPARK-32564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then 
> checks if plans can be generated correctly. But, the generated plans can be 
> different from actual ones because the input tables are empty (e.g., the 
> plans always use broadcast-hash joins, but actual ones use sort-merge joins 
> for larger tables). To mitigate the issue, this ticket targets at defining 
> data statistics constants extracted from generated TPCDS data in 
> `TPCDSTableStats`, then injects the statistics via 
> `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in 
> `TPCDSQuerySuite`.
> Please see a link below about how to extract the table statistics:
>  - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3
> For example, the generated plans of TPCDS `q2` are different with/without 
> this fix:
> {code:java}
>  w/ this fix: q2 
> == Physical Plan ==
> * Sort (43)
> +- Exchange (42)
>  +- * Project (41)
>  +- * SortMergeJoin Inner (40)
>  :- * Sort (28)
>  : +- Exchange (27)
>  : +- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- * Sort (39)
>  +- Exchange (38)
>  +- * Project (37)
>  +- * BroadcastHashJoin Inner BuildRight (36)
>  :- * HashAggregate (30)
>  : +- ReusedExchange (29)
>  +- BroadcastExchange (35)
>  +- * Project (34)
>  +- * Filter (33)
>  +- * ColumnarToRow (32)
>  +- Scan parquet default.date_dim (31)
>  w/o this fix: q2 
> == Physical Plan ==
> * Sort (40)
> +- Exchange (39)
>  +- * Project (38)
>  +- * BroadcastHashJoin Inner BuildRight (37)
>  :- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- BroadcastExchange (36)
>  +- * Project (35)
>  +- * BroadcastHashJoin Inner BuildRight (34)
>  :- * HashAggregate (28)
>  : +- ReusedExchange (27)
>  +- BroadcastExchange (33)
>  +- * Project (32)
>  +- * Filter (31)
>  +- * ColumnarToRow (30)
>  +- Scan parquet default.date_dim (29)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32564) Inject data statistics to simulate plan generation on actual TPCDS data

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173638#comment-17173638
 ] 

Apache Spark commented on SPARK-32564:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29391

>  Inject data statistics to simulate plan generation on actual TPCDS data
> 
>
> Key: SPARK-32564
> URL: https://issues.apache.org/jira/browse/SPARK-32564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then 
> checks if plans can be generated correctly. But, the generated plans can be 
> different from actual ones because the input tables are empty (e.g., the 
> plans always use broadcast-hash joins, but actual ones use sort-merge joins 
> for larger tables). To mitigate the issue, this ticket targets at defining 
> data statistics constants extracted from generated TPCDS data in 
> `TPCDSTableStats`, then injects the statistics via 
> `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in 
> `TPCDSQuerySuite`.
> Please see a link below about how to extract the table statistics:
>  - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3
> For example, the generated plans of TPCDS `q2` are different with/without 
> this fix:
> {code:java}
>  w/ this fix: q2 
> == Physical Plan ==
> * Sort (43)
> +- Exchange (42)
>  +- * Project (41)
>  +- * SortMergeJoin Inner (40)
>  :- * Sort (28)
>  : +- Exchange (27)
>  : +- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- * Sort (39)
>  +- Exchange (38)
>  +- * Project (37)
>  +- * BroadcastHashJoin Inner BuildRight (36)
>  :- * HashAggregate (30)
>  : +- ReusedExchange (29)
>  +- BroadcastExchange (35)
>  +- * Project (34)
>  +- * Filter (33)
>  +- * ColumnarToRow (32)
>  +- Scan parquet default.date_dim (31)
>  w/o this fix: q2 
> == Physical Plan ==
> * Sort (40)
> +- Exchange (39)
>  +- * Project (38)
>  +- * BroadcastHashJoin Inner BuildRight (37)
>  :- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- BroadcastExchange (36)
>  +- * Project (35)
>  +- * BroadcastHashJoin Inner BuildRight (34)
>  :- * HashAggregate (28)
>  : +- ReusedExchange (27)
>  +- BroadcastExchange (33)
>  +- * Project (32)
>  +- * Filter (31)
>  +- * ColumnarToRow (30)
>  +- Scan parquet default.date_dim (29)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32564) Inject data statistics to simulate plan generation on actual TPCDS data

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173639#comment-17173639
 ] 

Apache Spark commented on SPARK-32564:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29391

>  Inject data statistics to simulate plan generation on actual TPCDS data
> 
>
> Key: SPARK-32564
> URL: https://issues.apache.org/jira/browse/SPARK-32564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then 
> checks if plans can be generated correctly. But, the generated plans can be 
> different from actual ones because the input tables are empty (e.g., the 
> plans always use broadcast-hash joins, but actual ones use sort-merge joins 
> for larger tables). To mitigate the issue, this ticket targets at defining 
> data statistics constants extracted from generated TPCDS data in 
> `TPCDSTableStats`, then injects the statistics via 
> `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in 
> `TPCDSQuerySuite`.
> Please see a link below about how to extract the table statistics:
>  - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3
> For example, the generated plans of TPCDS `q2` are different with/without 
> this fix:
> {code:java}
>  w/ this fix: q2 
> == Physical Plan ==
> * Sort (43)
> +- Exchange (42)
>  +- * Project (41)
>  +- * SortMergeJoin Inner (40)
>  :- * Sort (28)
>  : +- Exchange (27)
>  : +- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- * Sort (39)
>  +- Exchange (38)
>  +- * Project (37)
>  +- * BroadcastHashJoin Inner BuildRight (36)
>  :- * HashAggregate (30)
>  : +- ReusedExchange (29)
>  +- BroadcastExchange (35)
>  +- * Project (34)
>  +- * Filter (33)
>  +- * ColumnarToRow (32)
>  +- Scan parquet default.date_dim (31)
>  w/o this fix: q2 
> == Physical Plan ==
> * Sort (40)
> +- Exchange (39)
>  +- * Project (38)
>  +- * BroadcastHashJoin Inner BuildRight (37)
>  :- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- BroadcastExchange (36)
>  +- * Project (35)
>  +- * BroadcastHashJoin Inner BuildRight (34)
>  :- * HashAggregate (28)
>  : +- ReusedExchange (27)
>  +- BroadcastExchange (33)
>  +- * Project (32)
>  +- * Filter (31)
>  +- * ColumnarToRow (30)
>  +- Scan parquet default.date_dim (29)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-32574) Race condition in FsHistoryProvider listing iteration

2020-08-08 Thread Yan Xiaole (Jira)

Yan Xiaole created SPARK-32574:
--

 Summary: Race condition in FsHistoryProvider listing iteration
 Key: SPARK-32574
 URL: https://issues.apache.org/jira/browse/SPARK-32574
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yan Xiaole


It seems to be a race condition in FSHistoryProvider that iterate the listing 
with high concurrency:
{code:java}
val stale = listing.view(classOf[LogInfo])
 .index("lastProcessed")
 .last(newLastScanTime - 1)
 .asScala
 .toList{code}

`toList` will iterate items in the listing, if one of listing entries was 
deleted by some clean thread, checkForLogs will throw out a 
`java.util.NoSuchElementException` and abort current execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32574) Race condition in FsHistoryProvider listing iteration

2020-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32574:


Assignee: Apache Spark

> Race condition in FsHistoryProvider listing iteration
> -
>
> Key: SPARK-32574
> URL: https://issues.apache.org/jira/browse/SPARK-32574
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Assignee: Apache Spark
>Priority: Major
>
> It seems to be a race condition in FSHistoryProvider that iterate the listing 
> with high concurrency:
> {code:java}
> val stale = listing.view(classOf[LogInfo])
>  .index("lastProcessed")
>  .last(newLastScanTime - 1)
>  .asScala
>  .toList{code}
> `toList` will iterate items in the listing, if one of listing entries was 
> deleted by some clean thread, checkForLogs will throw out a 
> `java.util.NoSuchElementException` and abort current execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32574) Race condition in FsHistoryProvider listing iteration

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173646#comment-17173646
 ] 

Apache Spark commented on SPARK-32574:
--

User 'yanxiaole' has created a pull request for this issue:
https://github.com/apache/spark/pull/29392

> Race condition in FsHistoryProvider listing iteration
> -
>
> Key: SPARK-32574
> URL: https://issues.apache.org/jira/browse/SPARK-32574
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Priority: Major
>
> It seems to be a race condition in FSHistoryProvider that iterate the listing 
> with high concurrency:
> {code:java}
> val stale = listing.view(classOf[LogInfo])
>  .index("lastProcessed")
>  .last(newLastScanTime - 1)
>  .asScala
>  .toList{code}
> `toList` will iterate items in the listing, if one of listing entries was 
> deleted by some clean thread, checkForLogs will throw out a 
> `java.util.NoSuchElementException` and abort current execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32574) Race condition in FsHistoryProvider listing iteration

2020-08-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32574:


Assignee: (was: Apache Spark)

> Race condition in FsHistoryProvider listing iteration
> -
>
> Key: SPARK-32574
> URL: https://issues.apache.org/jira/browse/SPARK-32574
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Priority: Major
>
> It seems to be a race condition in FSHistoryProvider that iterate the listing 
> with high concurrency:
> {code:java}
> val stale = listing.view(classOf[LogInfo])
>  .index("lastProcessed")
>  .last(newLastScanTime - 1)
>  .asScala
>  .toList{code}
> `toList` will iterate items in the listing, if one of listing entries was 
> deleted by some clean thread, checkForLogs will throw out a 
> `java.util.NoSuchElementException` and abort current execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32574) Race condition in FsHistoryProvider listing iteration

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173647#comment-17173647
 ] 

Apache Spark commented on SPARK-32574:
--

User 'yanxiaole' has created a pull request for this issue:
https://github.com/apache/spark/pull/29392

> Race condition in FsHistoryProvider listing iteration
> -
>
> Key: SPARK-32574
> URL: https://issues.apache.org/jira/browse/SPARK-32574
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yan Xiaole
>Priority: Major
>
> It seems to be a race condition in FSHistoryProvider that iterate the listing 
> with high concurrency:
> {code:java}
> val stale = listing.view(classOf[LogInfo])
>  .index("lastProcessed")
>  .last(newLastScanTime - 1)
>  .asScala
>  .toList{code}
> `toList` will iterate items in the listing, if one of listing entries was 
> deleted by some clean thread, checkForLogs will throw out a 
> `java.util.NoSuchElementException` and abort current execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32319) Disallow the use of unused imports

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32319:
-

Assignee: Fokko Driesprong

> Disallow the use of unused imports
> --
>
> Key: SPARK-32319
> URL: https://issues.apache.org/jira/browse/SPARK-32319
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
>
> We don't want to import stuff that we're not going to use, to reduce the 
> memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32319) Disallow the use of unused imports

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32319.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29121
[https://github.com/apache/spark/pull/29121]

> Disallow the use of unused imports
> --
>
> Key: SPARK-32319
> URL: https://issues.apache.org/jira/browse/SPARK-32319
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Assignee: Fokko Driesprong
>Priority: Major
> Fix For: 3.1.0
>
>
> We don't want to import stuff that we're not going to use, to reduce the 
> memory pressure.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32564) Inject data statistics to simulate plan generation on actual TPCDS data

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32564:
--
Fix Version/s: 3.0.1

>  Inject data statistics to simulate plan generation on actual TPCDS data
> 
>
> Key: SPARK-32564
> URL: https://issues.apache.org/jira/browse/SPARK-32564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> `TPCDSQuerySuite` currently computes plans with empty TPCDS tables, then 
> checks if plans can be generated correctly. But, the generated plans can be 
> different from actual ones because the input tables are empty (e.g., the 
> plans always use broadcast-hash joins, but actual ones use sort-merge joins 
> for larger tables). To mitigate the issue, this ticket targets at defining 
> data statistics constants extracted from generated TPCDS data in 
> `TPCDSTableStats`, then injects the statistics via 
> `spark.sessionState.catalog.alterTableStats` when defining TPCDS tables in 
> `TPCDSQuerySuite`.
> Please see a link below about how to extract the table statistics:
>  - https://gist.github.com/maropu/f553d32c323ee803d39e2f7fa0b5a8c3
> For example, the generated plans of TPCDS `q2` are different with/without 
> this fix:
> {code:java}
>  w/ this fix: q2 
> == Physical Plan ==
> * Sort (43)
> +- Exchange (42)
>  +- * Project (41)
>  +- * SortMergeJoin Inner (40)
>  :- * Sort (28)
>  : +- Exchange (27)
>  : +- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- * Sort (39)
>  +- Exchange (38)
>  +- * Project (37)
>  +- * BroadcastHashJoin Inner BuildRight (36)
>  :- * HashAggregate (30)
>  : +- ReusedExchange (29)
>  +- BroadcastExchange (35)
>  +- * Project (34)
>  +- * Filter (33)
>  +- * ColumnarToRow (32)
>  +- Scan parquet default.date_dim (31)
>  w/o this fix: q2 
> == Physical Plan ==
> * Sort (40)
> +- Exchange (39)
>  +- * Project (38)
>  +- * BroadcastHashJoin Inner BuildRight (37)
>  :- * Project (26)
>  : +- * BroadcastHashJoin Inner BuildRight (25)
>  : :- * HashAggregate (19)
>  : : +- Exchange (18)
>  : : +- * HashAggregate (17)
>  : : +- * Project (16)
>  : : +- * BroadcastHashJoin Inner BuildRight (15)
>  : : :- Union (9)
>  : : : :- * Project (4)
>  : : : : +- * Filter (3)
>  : : : : +- * ColumnarToRow (2)
>  : : : : +- Scan parquet default.web_sales (1)
>  : : : +- * Project (8)
>  : : : +- * Filter (7)
>  : : : +- * ColumnarToRow (6)
>  : : : +- Scan parquet default.catalog_sales (5)
>  : : +- BroadcastExchange (14)
>  : : +- * Project (13)
>  : : +- * Filter (12)
>  : : +- * ColumnarToRow (11)
>  : : +- Scan parquet default.date_dim (10)
>  : +- BroadcastExchange (24)
>  : +- * Project (23)
>  : +- * Filter (22)
>  : +- * ColumnarToRow (21)
>  : +- Scan parquet default.date_dim (20)
>  +- BroadcastExchange (36)
>  +- * Project (35)
>  +- * BroadcastHashJoin Inner BuildRight (34)
>  :- * HashAggregate (28)
>  : +- ReusedExchange (27)
>  +- BroadcastExchange (33)
>  +- * Project (32)
>  +- * Filter (31)
>  +- * ColumnarToRow (30)
>  +- Scan parquet default.date_dim (29)
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32571) yarnClient.killApplication(appId) is never called

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32571:
--
Priority: Major  (was: Critical)

> yarnClient.killApplication(appId) is never called
> -
>
> Key: SPARK-32571
> URL: https://issues.apache.org/jira/browse/SPARK-32571
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.0, 3.0.0
>Reporter: A Tester
>Priority: Major
>
> *Problem Statement:* 
> When an application is submitted using spark-submit in cluster mode using 
> yarn, the spark application continues to run on the cluster, even if 
> spark-submit itself has been requested to shutdown (Ctrl-C/SIGTERM/etc.)
> While there is code inside org.apache.spark.deploy.yarn.Client.scala that 
> would lead you to believe the spark application on the cluster will shut 
> down, this code is not currently reachable.
> Example of behavior:
> spark-submit ...
>  or kill -15 
> spark-submit itself dies
> job can still be found running on the cluster
>  
> *Expectation:*
> When spark-submit is in monitoring a yarn app and spark-submit itself is 
> requested to shutdown (SIGTERM, HUP,etc.), it should call 
> yarnClient.killApplication(appId) so that the actual spark application 
> running on the cluster is killed.
>  
>  
> *Proposal*
> There is already a shutdown hook registered which cleans up temp files.  
> Could this be extended to call yarnClient.killApplication? 
> I believe the default behavior should be to request yarn to kill the 
> application, however I can imagine use cases where you may still want it to 
> run.  So facilitate these use cases, an option should be provided to skip 
> this hook.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32570) Thriftserver LDAP failed

2020-08-08 Thread Rohit Mishra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohit Mishra updated SPARK-32570:
-
Priority: Major  (was: Critical)

> Thriftserver LDAP failed
> 
>
> Key: SPARK-32570
> URL: https://issues.apache.org/jira/browse/SPARK-32570
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Jie Zhang
>Priority: Major
>
> I downloaded spark-2.4.6-bin-hadoop2.7.tgz, added a new file to 
> conf/hive-site.xml, put the following parameters into it, ran 
> sbin/start-thriftserver.sh, then bin/beeline worked, able to query tables in 
> our hive-metastore. 
> {code:java}
> 
> hive.metastore.uris
> thrift://hive-metastore-service.company.com:9083
> 
> 
> hive.metastore.schema.verification
> false
> 
> 
> javax.jdo.option.ConnectionURL
> 
> jdbc:mysql://hive-metastore-db.company.com:3306/hive?createDatabaseIfNotExist=false
> 
> 
> javax.jdo.option.ConnectionDriverName
> org.mariadb.jdbc.Driver
> 
> 
> javax.jdo.option.ConnectionUserName
> x
>   
> 
> javax.jdo.option.ConnectionPassword
> x
>   
> 
> hive.metastore.connect.retries
> 15
> 
> {code}
> In order to enable LDAP, I added these parameters into conf/hive-site.xml, 
> stopped and started thriftserver, then bin/beeline complained invalid 
> credentials.
> I know my credentials works because I enabled LDAP on Hive-Server2 and it 
> worked. 
> {code:java}
> 
> hive.server2.authentication
> LDAP
>   
> 
> hive.server2.authentication.ldap.url
> ldaps://ldap-server.company.com:636
>   
> 
> hive.server2.authentication.ldap.baseDN
> ou=People,dc=company,dc=com
>   
> 
> hive.server2.authentication.ldap.userDNPattern
> cn=%s,ou=People,dc=company,dc=com
> 
> {code}
> The error message:
> {code:java}
> 20/08/07 21:05:39 ERROR TSaslTransport: SASL negotiation failure20/08/07 
> 21:05:39 ERROR TSaslTransport: SASL negotiation 
> failurejavax.security.sasl.SaslException: Error validating the login [Caused 
> by javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]] at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>  at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
> at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>  at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]] at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>  at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>  at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>  ... 8 more
> {code}
> Anything else I need to do in order to enable LDAP on Spark Thriftserver? 
> Thanks for your help. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32570) Thriftserver LDAP failed

2020-08-08 Thread Rohit Mishra (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173708#comment-17173708
 ] 

Rohit Mishra commented on SPARK-32570:
--

Thanks, [~maropu]. If I may add one more point- Please refrain from marking 
priority as critical as these are reserved for the committers. *Major* is the 
maximum anyone should use.

> Thriftserver LDAP failed
> 
>
> Key: SPARK-32570
> URL: https://issues.apache.org/jira/browse/SPARK-32570
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.6
>Reporter: Jie Zhang
>Priority: Major
>
> I downloaded spark-2.4.6-bin-hadoop2.7.tgz, added a new file to 
> conf/hive-site.xml, put the following parameters into it, ran 
> sbin/start-thriftserver.sh, then bin/beeline worked, able to query tables in 
> our hive-metastore. 
> {code:java}
> 
> hive.metastore.uris
> thrift://hive-metastore-service.company.com:9083
> 
> 
> hive.metastore.schema.verification
> false
> 
> 
> javax.jdo.option.ConnectionURL
> 
> jdbc:mysql://hive-metastore-db.company.com:3306/hive?createDatabaseIfNotExist=false
> 
> 
> javax.jdo.option.ConnectionDriverName
> org.mariadb.jdbc.Driver
> 
> 
> javax.jdo.option.ConnectionUserName
> x
>   
> 
> javax.jdo.option.ConnectionPassword
> x
>   
> 
> hive.metastore.connect.retries
> 15
> 
> {code}
> In order to enable LDAP, I added these parameters into conf/hive-site.xml, 
> stopped and started thriftserver, then bin/beeline complained invalid 
> credentials.
> I know my credentials works because I enabled LDAP on Hive-Server2 and it 
> worked. 
> {code:java}
> 
> hive.server2.authentication
> LDAP
>   
> 
> hive.server2.authentication.ldap.url
> ldaps://ldap-server.company.com:636
>   
> 
> hive.server2.authentication.ldap.baseDN
> ou=People,dc=company,dc=com
>   
> 
> hive.server2.authentication.ldap.userDNPattern
> cn=%s,ou=People,dc=company,dc=com
> 
> {code}
> The error message:
> {code:java}
> 20/08/07 21:05:39 ERROR TSaslTransport: SASL negotiation failure20/08/07 
> 21:05:39 ERROR TSaslTransport: SASL negotiation 
> failurejavax.security.sasl.SaslException: Error validating the login [Caused 
> by javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]]] at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:109)
>  at 
> org.apache.thrift.transport.TSaslTransport$SaslParticipant.evaluateChallengeOrResponse(TSaslTransport.java:539)
>  at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:283) 
> at 
> org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41)
>  at 
> org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)
>  at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: 
> javax.security.sasl.AuthenticationException: Error validating LDAP user 
> [Caused by javax.naming.AuthenticationException: [LDAP: error code 49 - 
> Invalid Credentials]] at 
> org.apache.hive.service.auth.LdapAuthenticationProviderImpl.Authenticate(LdapAuthenticationProviderImpl.java:77)
>  at 
> org.apache.hive.service.auth.PlainSaslHelper$PlainServerCallbackHandler.handle(PlainSaslHelper.java:106)
>  at 
> org.apache.hive.service.auth.PlainSaslServer.evaluateResponse(PlainSaslServer.java:102)
>  ... 8 more
> {code}
> Anything else I need to do in order to enable LDAP on Spark Thriftserver? 
> Thanks for your help. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173709#comment-17173709
 ] 

Dongjoon Hyun commented on SPARK-32558:
---

Hi, [~ramks]. Thank you for reporting.

First of all, according to your log, your hive and ORC tool is too old to read 
the new ORC version. It's Hive bug. Please file Apache Hive JIRA issue for that.
{code}
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
{code}

Second, `ORC table` should be created as `STORED AS`. Your code is completely 
wrong.
{code}
spark.sql("CREATE table df_table2(col1 string,col2 string)")
scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
org.apache.spark.sql.DataFrame = []
scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
org.apache.spark.sql.DataFrame = [col1: string, col2: string]
{code}

Especially, this is completely wrong because you are uploading ORC data into a 
plain table, `df_table2`.
{code}
scala> 
dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
{code}

BTW, please use the latest version of `orc-tools` like the following to verify 
ORC files.
{code}
$ orc-tools version
ORC 1.6.3
$ orc-tools meta /tmp/o
...
File Version: 0.12 with ORC_517
{code}

And, you may want to use `spark.sql.hive.convertMetastoreOrc=false`. 
`spark.sql.orc.impl` is not enough because Spark is converting the table 
automatically.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:

[jira] [Resolved] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32558.
---
Resolution: Invalid

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
> {code} 
> Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails with the same exception to fetch the metadata even

[jira] [Comment Edited] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173710#comment-17173710
 ] 

Dongjoon Hyun edited comment on SPARK-32558 at 8/8/20, 9:09 PM:


[~ramks]. Please see HIVE-16683. You are hitting HIVE-16683 which is fixed at 
Hive 2.2.0


was (Author: dongjoon):
[~ramks]. Please see HIVE-16683. You are hitting HIVE-16683.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
> {code} 
> Step 4) Copy the ORC files cre

[jira] [Commented] (SPARK-32558) ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 (work-around of using spark.sql.orc.impl=hive is also not working)

2020-08-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173710#comment-17173710
 ] 

Dongjoon Hyun commented on SPARK-32558:
---

[~ramks]. Please see HIVE-16683. You are hitting HIVE-16683.

> ORC target files that Spark_3.0 produces does not work with Hive_2.1.1 
> (work-around of using spark.sql.orc.impl=hive is also not working)
> -
>
> Key: SPARK-32558
> URL: https://issues.apache.org/jira/browse/SPARK-32558
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0 and Hadoop cluster having Hive_2.1.1 version. 
> (Linux Redhat)
>Reporter: Ramakrishna Prasad K S
>Priority: Major
>
> Steps to reproduce the issue:
> --- 
> Download Spark_3.0 from [https://spark.apache.org/downloads.html]
>  
> Step 1) Create ORC File by using the default Spark_3.0 Native API from spark 
> shell .
> {code}
> [linuxuser1@irlrhellinux1 bin]$ ./spark-shell
> Welcome to Spark version 3.0.0
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_191)
> Type in expressions to have them evaluated. Type :help for more information.
>  scala> spark.sql("set spark.sql.orc.impl").show()
> +-+
> |               key| value| 
> +-+
> |spark.sql.orc.impl|native|
> +-+
>  
> scala> spark.sql("CREATE table df_table(col1 string,col2 string)") res1: 
> org.apache.spark.sql.DataFrame = []
> scala> spark.sql("insert into df_table values('col1val1','col2val1')")
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame = spark.sql("select * from df_table") dFrame: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> dFrame.show()
> +-+
> |    col1|    col2|
> +-+
> |col1val1|col2val1|
> +-+
> scala> 
> dFrame.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table")
> {code}
>  
> Step 2) Copy the ORC files created in Step(1) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the ORC files. As you see below, it 
> fails to fetch the metadata from the ORC file.
> {code}
> adpqa@irlhadoop1 bug]$ hive --orcfiledump 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> Processing data file 
> /tmp/df_table/part-0-6ce5f13f-a33a-4bc0-b82b-3a89c27a5ddd-c000.snappy.orc 
> [length: 414]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 7
> at org.apache.orc.OrcFile$WriterVersion.from(OrcFile.java:145)
> at org.apache.orc.impl.OrcTail.getWriterVersion(OrcTail.java:74)
> at org.apache.orc.impl.ReaderImpl.(ReaderImpl.java:385)
> at org.apache.orc.OrcFile.createReader(OrcFile.java:222)
> at org.apache.orc.tools.FileDump.getReader(FileDump.java:255)
> at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:328)
> at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:307)
> at org.apache.orc.tools.FileDump.main(FileDump.java:154)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:313)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:227)
> {code}
> Step 3) Now Create ORC File using the Hive API (as suggested by Spark in 
> [https://spark.apache.org/docs/latest/sql-migration-guide.html] by setting 
> spark.sql.orc.impl as hive)
> {code}
> scala> spark.sql("set spark.sql.orc.impl=hive")
> res6: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> spark.sql("set spark.sql.orc.impl").show()
> ++
> |               key|value| 
> ++
> |spark.sql.orc.impl| hive|
> ++
> scala> spark.sql("CREATE table df_table2(col1 string,col2 string)")
> scala> spark.sql("insert into df_table2 values('col1val1','col2val1')") res8: 
> org.apache.spark.sql.DataFrame = []
> scala> val dFrame2 = spark.sql("select * from df_table2") dFrame2: 
> org.apache.spark.sql.DataFrame = [col1: string, col2: string]
> scala> 
> dFrame2.toDF().write.format("orc").save("/export/home/linuxuser1/spark-3.0.0-bin-hadoop2.7/tgt/df_table2")
> {code} 
> Step 4) Copy the ORC files created in Step(3) to HDFS /tmp on a Hadoop 
> cluster (which has Hive_2.1.1, for example CDH_6.x) and run the following 
> command to analyze or read metadata from the

[jira] [Assigned] (SPARK-32555) Add unique ID on QueryExecution to enable listeners to deduplicate

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-32555:
-

Assignee: Jungtaek Lim

> Add unique ID on QueryExecution to enable listeners to deduplicate 
> ---
>
> Key: SPARK-32555
> URL: https://issues.apache.org/jira/browse/SPARK-32555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
>
> Somehow there's a change Spark calls QueryExecutionListener multiple times on 
> same QueryExecution instance (even same funcName for onSuccess).
> There's no unique ID on QueryExecution, hence it's a bit tricky if the 
> listener would like to deal with same query execution only once.
> This issue tracks the effort on adding unique ID on QueryExecution, so that 
> listener can leverage the ID to deduplicate callbacks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32555) Add unique ID on QueryExecution to enable listeners to deduplicate

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32555.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29372
[https://github.com/apache/spark/pull/29372]

> Add unique ID on QueryExecution to enable listeners to deduplicate 
> ---
>
> Key: SPARK-32555
> URL: https://issues.apache.org/jira/browse/SPARK-32555
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.1.0
>
>
> Somehow there's a change Spark calls QueryExecutionListener multiple times on 
> same QueryExecution instance (even same funcName for onSuccess).
> There's no unique ID on QueryExecution, hence it's a bit tricky if the 
> listener would like to deal with same query execution only once.
> This issue tracks the effort on adding unique ID on QueryExecution, so that 
> listener can leverage the ID to deduplicate callbacks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32432) Add support for reading ORC/Parquet files with SymlinkTextInputFormat

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32432:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add support for reading ORC/Parquet files with SymlinkTextInputFormat
> -
>
> Key: SPARK-32432
> URL: https://issues.apache.org/jira/browse/SPARK-32432
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Noritaka Sekiyama
>Priority: Major
>
> Hive style symlink (SymlinkTextInputFormat) is commonly used in different 
> analytic engines including prestodb and prestosql.
> Currently SymlinkTextInputFormat works with JSON/CSV files but does not work 
> with ORC/Parquet files in Apache Spark (and Apache Hive).
> On the other hand, prestodb and prestosql support SymlinkTextInputFormat with 
> ORC/Parquet files.
> This issue is to add support for reading ORC/Parquet files with 
> SymlinkTextInputFormat in Apache Spark.
>  
> Related links
>  * Hive's SymlinkTextInputFormat: 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java]
>  * prestosql's implementation to add support for reading avro files with 
> SymlinkTextInputFormat: 
> [https://github.com/vincentpoon/prestosql/blob/master/presto-hive/src/main/java/io/prestosql/plugin/hive/BackgroundHiveSplitLoader.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31703:
--
Priority: Blocker  (was: Critical)

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Blocker
>  Labels: BigEndian, correctness
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31703:
--
Target Version/s: 2.4.7, 3.0.1

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Blocker
>  Labels: BigEndian, correctness
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-08-08 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173721#comment-17173721
 ] 

Dongjoon Hyun commented on SPARK-31703:
---

I raised the priority to `Blocker` with `Target Version` 2.4.7 and 3.0.1.

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Blocker
>  Labels: BigEndian, correctness
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173723#comment-17173723
 ] 

Apache Spark commented on SPARK-32559:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/29393

> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example,  sql  select cast("1中文" as float) gives 1 instead of null
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-08 Thread yx91490 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17172889#comment-17172889
 ] 

yx91490 edited comment on SPARK-32536 at 8/9/20, 2:55 AM:
--

the method org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace() is 
in standalone-metastore-1.21.2.3.1.4.0-315-hive3.jar,  the source code  is in 
[Hive.deleteOldPathForReplace()|https://github.com/hortonworks/hive-release/blob/HDP-3.1.4.0-315-tag/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L4647]


was (Author: yx91490):
the method org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace() is 
in standalone-metastore-1.21.2.3.1.4.0-315-hive3.jar, but I cannot found the 
source code.

btw, I will try to reproduce it this weekend:)

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32536) deleted not existing hdfs locations when use spark sql to execute "insert overwrite" statement to dynamic partition

2020-08-08 Thread yx91490 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173728#comment-17173728
 ] 

yx91490 commented on SPARK-32536:
-

I found that I mistake the issue condition, I reproduce it by delete  the hdfs 
partition directory of target table, but not drop the partition, then I get the 
wrong.

it seems that the method  not judge whether path exists when call 
fs.listStatus() in  Hive.cleanUpOneDirectoryForReplace().

Should this seemed as a bug or user's wrong operation?

> deleted not existing hdfs locations when use spark sql to execute "insert 
> overwrite" statement to dynamic partition
> ---
>
> Key: SPARK-32536
> URL: https://issues.apache.org/jira/browse/SPARK-32536
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32536.full.log
>
>
> when execute insert overwrite table statement to dynamic partition :
>  
> {code:java}
> set hive.exec.dynamic.partition=true;
> set hive.exec.dynamic.partition.mode=nostrict;
> insert overwrite table tmp.id_name2 partition(dt) select * from tmp.id_name 
> where dt='2001';
> {code}
> output log:
> {code:java}
> 20/08/05 14:38:05 ERROR Hive: Exception when loading partition with 
> parameters  
> partPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1/dt=2001,
>   table=id_name2,  partSpec={dt=2001},  loadFileType=REPLACE_ALL,  
> listBucketingLevel=0,  isAcid=false,  resetStatistics=false
> org.apache.hadoop.hive.ql.metadata.HiveException: Directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 could not be 
> cleaned up.
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4666)
> at org.apache.hadoop.hive.ql.metadata.Hive.replaceFiles(Hive.java:4597)
> at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:2132)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2588)
> at org.apache.hadoop.hive.ql.metadata.Hive$5.call(Hive.java:2579)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: File 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/dt=2001 does not exist.
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:1053)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.access$1000(DistributedFileSystem.java:131)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1113)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem$24.doCall(DistributedFileSystem.java:1110)
> at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:1120)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1868)
> at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1910)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.cleanUpOneDirectoryForReplace(Hive.java:4681)
> at 
> org.apache.hadoop.hive.ql.metadata.Hive.deleteOldPathForReplace(Hive.java:4661)
> ... 8 more
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Exception 
> when loading 1 in table id_name2 with 
> loadPath=hdfs://nameservice/user/hive/warehouse/tmp.db/id_name2/.hive-staging_hive_2020-08-05_14-38-00_715_3629476922121193803-1/-ext-1;
> {code}
> it seems that spark doesn't test if the partitions hdfs locations whether 
> exists before delete it.
> and Hive can successfully execute the same sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32502) Please fix CVE related to Guava 14.0.1

2020-08-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173744#comment-17173744
 ] 

L. C. Hsieh commented on SPARK-32502:
-

Currently I'm working on some changes at Hive side, including shading Guava and 
upgrade Guava to 27. Once we have progress at Hive side, we can then upgrade 
Guava version in Spark.

> Please fix CVE related to Guava 14.0.1
> --
>
> Key: SPARK-32502
> URL: https://issues.apache.org/jira/browse/SPARK-32502
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Rodney Aaron Stainback
>Priority: Major
>
> Please fix the following CVE related to Guava 14.0.1
> |cve|severity|cvss|
> |CVE-2018-10237|medium|5.9|
>  
> Our security team is trying to block us from using spark because of this issue
>  
> One thing that's very weird is I see from this [pom 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/pom.xml]]
>  you reference guava but it's not clear what version.
>  
> But if I look on 
> [maven|[https://mvnrepository.com/artifact/org.apache.spark/spark-network-common_2.12/3.0.0]]
>  the guava reference is not showing up
>  
> Is this reference somehow being shaded into the network common jar?  It's not 
> clear to me.
>  
> Also, I've noticed code like [this 
> file|[https://github.com/apache/spark/blob/v3.0.0/common/network-common/src/main/java/org/apache/spark/network/util/LimitedInputStream.java]]
>  which is a copy-paste of some guava source code.
>  
> The CVE scanner we use Twistlock/Palo Alto Networks - Prisma Cloud Compute 
> Edition is very thorough and will find CVEs in copy-pasted code and shaded 
> jars.
>  
> Please fix this CVE so we can use spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32462) Don't save the previous search text for datatable

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-32462.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29265
[https://github.com/apache/spark/pull/29265]

> Don't save the previous search text for datatable
> -
>
> Key: SPARK-32462
> URL: https://issues.apache.org/jira/browse/SPARK-32462
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> DataTable is used in stage-page and executors-page for pagination and filter 
> tasks/executors by search text.
> In the current implementation, search text is saved so if we visit stage-page 
> for a job, the previous search text is filled in the textbox and the task 
> table is filtered.
> I'm sometimes surprised by this behavior as the stage-page lists no tasks 
> because tasks are filtered by the previous search text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32462) Reset previous search text for datatable

2020-08-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32462:
--
Summary: Reset previous search text for datatable  (was: Don't save the 
previous search text for datatable)

> Reset previous search text for datatable
> 
>
> Key: SPARK-32462
> URL: https://issues.apache.org/jira/browse/SPARK-32462
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.1.0
>
>
> DataTable is used in stage-page and executors-page for pagination and filter 
> tasks/executors by search text.
> In the current implementation, search text is saved so if we visit stage-page 
> for a job, the previous search text is filled in the textbox and the task 
> table is filtered.
> I'm sometimes surprised by this behavior as the stage-page lists no tasks 
> because tasks are filtered by the previous search text.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32563) spark-sql doesn't support insert into mixed static & dynamic partition

2020-08-08 Thread yx91490 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173746#comment-17173746
 ] 

yx91490 commented on SPARK-32563:
-

I tried in apache-spark-3.0 and cannot reproduce it.

> spark-sql doesn't support insert into mixed static & dynamic partition 
> ---
>
> Key: SPARK-32563
> URL: https://issues.apache.org/jira/browse/SPARK-32563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32563.log
>
>
> spark-sql doesn't support insert into mixed static & dynamic partition, for 
> example:
> source table :
> {code:java}
> CREATE TABLE `id_name`(`id` int, `name` string)
> PARTITIONED BY (`dt` string)
> {code}
> dest table:
> {code:java}
> CREATE TABLE `id_name_dt1_dt2`(`id` int, `name` string)
> PARTITIONED BY (`dt1` string, `dt2` string)
> {code}
> insert sql:
> {code:java}
> insert into table tmp.id_name_dt1_dt2 partition(dt1='beijing',dt2) select * 
> from tmp.id_name;
> {code}
> result:
> data not inserted, dest table partition not added.
> and there are two warns:
> {code:java}
> 20/08/07 14:32:28 WARN warehouse: Cannot create partition spec from 
> hdfs://nameservice/; missing keys [dt1]
> 20/08/07 14:32:28 WARN FileOperations: Ignoring invalid DP directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name_dt1_dt2/.hive-staging_hive_2020-08-07_14-32-02_538_7897451753303149223-1/-ext-1/dt2=2002
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-32563) spark-sql doesn't support insert into mixed static & dynamic partition

2020-08-08 Thread yx91490 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173746#comment-17173746
 ] 

yx91490 edited comment on SPARK-32563 at 8/9/20, 5:17 AM:
--

I tried in apache-spark-3.0.0 and cannot reproduce it.


was (Author: yx91490):
I tried in apache-spark-3.0 and cannot reproduce it.

> spark-sql doesn't support insert into mixed static & dynamic partition 
> ---
>
> Key: SPARK-32563
> URL: https://issues.apache.org/jira/browse/SPARK-32563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32563.log
>
>
> spark-sql doesn't support insert into mixed static & dynamic partition, for 
> example:
> source table :
> {code:java}
> CREATE TABLE `id_name`(`id` int, `name` string)
> PARTITIONED BY (`dt` string)
> {code}
> dest table:
> {code:java}
> CREATE TABLE `id_name_dt1_dt2`(`id` int, `name` string)
> PARTITIONED BY (`dt1` string, `dt2` string)
> {code}
> insert sql:
> {code:java}
> insert into table tmp.id_name_dt1_dt2 partition(dt1='beijing',dt2) select * 
> from tmp.id_name;
> {code}
> result:
> data not inserted, dest table partition not added.
> and there are two warns:
> {code:java}
> 20/08/07 14:32:28 WARN warehouse: Cannot create partition spec from 
> hdfs://nameservice/; missing keys [dt1]
> 20/08/07 14:32:28 WARN FileOperations: Ignoring invalid DP directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name_dt1_dt2/.hive-staging_hive_2020-08-07_14-32-02_538_7897451753303149223-1/-ext-1/dt2=2002
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32571) yarnClient.killApplication(appId) is never called

2020-08-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173750#comment-17173750
 ] 

L. C. Hsieh commented on SPARK-32571:
-

I think by design in cluster mode the spark application is supposed to continue 
to run, even spark-submit process is killed. This is also the most common use 
cases for cluster mode. If you want to be able to stop the spark application, 
the client mode provides the control.



> yarnClient.killApplication(appId) is never called
> -
>
> Key: SPARK-32571
> URL: https://issues.apache.org/jira/browse/SPARK-32571
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, YARN
>Affects Versions: 2.4.0, 3.0.0
>Reporter: A Tester
>Priority: Major
>
> *Problem Statement:* 
> When an application is submitted using spark-submit in cluster mode using 
> yarn, the spark application continues to run on the cluster, even if 
> spark-submit itself has been requested to shutdown (Ctrl-C/SIGTERM/etc.)
> While there is code inside org.apache.spark.deploy.yarn.Client.scala that 
> would lead you to believe the spark application on the cluster will shut 
> down, this code is not currently reachable.
> Example of behavior:
> spark-submit ...
>  or kill -15 
> spark-submit itself dies
> job can still be found running on the cluster
>  
> *Expectation:*
> When spark-submit is in monitoring a yarn app and spark-submit itself is 
> requested to shutdown (SIGTERM, HUP,etc.), it should call 
> yarnClient.killApplication(appId) so that the actual spark application 
> running on the cluster is killed.
>  
>  
> *Proposal*
> There is already a shutdown hook registered which cleans up temp files.  
> Could this be extended to call yarnClient.killApplication? 
> I believe the default behavior should be to request yarn to kill the 
> application, however I can imagine use cases where you may still want it to 
> run.  So facilitate these use cases, an option should be provided to skip 
> this hook.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32563) spark-sql doesn't support insert into mixed static & dynamic partition

2020-08-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173751#comment-17173751
 ] 

L. C. Hsieh commented on SPARK-32563:
-

So it is not an issue anymore in 2.4, 3.0 branches, right? I'm not sure if we 
still plan to update 2.3 branch.

> spark-sql doesn't support insert into mixed static & dynamic partition 
> ---
>
> Key: SPARK-32563
> URL: https://issues.apache.org/jira/browse/SPARK-32563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32563.log
>
>
> spark-sql doesn't support insert into mixed static & dynamic partition, for 
> example:
> source table :
> {code:java}
> CREATE TABLE `id_name`(`id` int, `name` string)
> PARTITIONED BY (`dt` string)
> {code}
> dest table:
> {code:java}
> CREATE TABLE `id_name_dt1_dt2`(`id` int, `name` string)
> PARTITIONED BY (`dt1` string, `dt2` string)
> {code}
> insert sql:
> {code:java}
> insert into table tmp.id_name_dt1_dt2 partition(dt1='beijing',dt2) select * 
> from tmp.id_name;
> {code}
> result:
> data not inserted, dest table partition not added.
> and there are two warns:
> {code:java}
> 20/08/07 14:32:28 WARN warehouse: Cannot create partition spec from 
> hdfs://nameservice/; missing keys [dt1]
> 20/08/07 14:32:28 WARN FileOperations: Ignoring invalid DP directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name_dt1_dt2/.hive-staging_hive_2020-08-07_14-32-02_538_7897451753303149223-1/-ext-1/dt2=2002
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32563) spark-sql doesn't support insert into mixed static & dynamic partition

2020-08-08 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173752#comment-17173752
 ] 

Takeshi Yamamuro commented on SPARK-32563:
--

Yea, I think so. Then I will close this. Thanks!

> spark-sql doesn't support insert into mixed static & dynamic partition 
> ---
>
> Key: SPARK-32563
> URL: https://issues.apache.org/jira/browse/SPARK-32563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32563.log
>
>
> spark-sql doesn't support insert into mixed static & dynamic partition, for 
> example:
> source table :
> {code:java}
> CREATE TABLE `id_name`(`id` int, `name` string)
> PARTITIONED BY (`dt` string)
> {code}
> dest table:
> {code:java}
> CREATE TABLE `id_name_dt1_dt2`(`id` int, `name` string)
> PARTITIONED BY (`dt1` string, `dt2` string)
> {code}
> insert sql:
> {code:java}
> insert into table tmp.id_name_dt1_dt2 partition(dt1='beijing',dt2) select * 
> from tmp.id_name;
> {code}
> result:
> data not inserted, dest table partition not added.
> and there are two warns:
> {code:java}
> 20/08/07 14:32:28 WARN warehouse: Cannot create partition spec from 
> hdfs://nameservice/; missing keys [dt1]
> 20/08/07 14:32:28 WARN FileOperations: Ignoring invalid DP directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name_dt1_dt2/.hive-staging_hive_2020-08-07_14-32-02_538_7897451753303149223-1/-ext-1/dt2=2002
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32563) spark-sql doesn't support insert into mixed static & dynamic partition

2020-08-08 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32563.
--
Resolution: Not A Problem

> spark-sql doesn't support insert into mixed static & dynamic partition 
> ---
>
> Key: SPARK-32563
> URL: https://issues.apache.org/jira/browse/SPARK-32563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
> Environment: HDP version 2.3.2.3.1.4.0-315
>Reporter: yx91490
>Priority: Major
> Attachments: SPARK-32563.log
>
>
> spark-sql doesn't support insert into mixed static & dynamic partition, for 
> example:
> source table :
> {code:java}
> CREATE TABLE `id_name`(`id` int, `name` string)
> PARTITIONED BY (`dt` string)
> {code}
> dest table:
> {code:java}
> CREATE TABLE `id_name_dt1_dt2`(`id` int, `name` string)
> PARTITIONED BY (`dt1` string, `dt2` string)
> {code}
> insert sql:
> {code:java}
> insert into table tmp.id_name_dt1_dt2 partition(dt1='beijing',dt2) select * 
> from tmp.id_name;
> {code}
> result:
> data not inserted, dest table partition not added.
> and there are two warns:
> {code:java}
> 20/08/07 14:32:28 WARN warehouse: Cannot create partition spec from 
> hdfs://nameservice/; missing keys [dt1]
> 20/08/07 14:32:28 WARN FileOperations: Ignoring invalid DP directory 
> hdfs://nameservice/user/hive/warehouse/tmp.db/id_name_dt1_dt2/.hive-staging_hive_2020-08-07_14-32-02_538_7897451753303149223-1/-ext-1/dt2=2002
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32481) Support truncate table to move the data to trash

2020-08-08 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173757#comment-17173757
 ] 

L. C. Hsieh commented on SPARK-32481:
-

Why this needs to be subtask of SPARK-32480. Will SPARK-32480 contain more than 
one task to do?

> Support truncate table to move the data to trash
> 
>
> Key: SPARK-32481
> URL: https://issues.apache.org/jira/browse/SPARK-32481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>
> *Instead of deleting the data, move the data to trash.So from trash based on 
> configuration data can be deleted permanently.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

46 matches

Mail list logo