[jira] [Commented] (SPARK-29345) Add an API that allows a user to define and observe arbitrary metrics on batch and streaming queries

2020-06-16 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138120#comment-17138120
 ] 

Wenchen Fan commented on SPARK-29345:
-

I'm resolving it as the only remaining thing is a small improvement.

> Add an API that allows a user to define and observe arbitrary metrics on 
> batch and streaming queries
> 
>
> Key: SPARK-29345
> URL: https://issues.apache.org/jira/browse/SPARK-29345
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29345) Add an API that allows a user to define and observe arbitrary metrics on batch and streaming queries

2020-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29345.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Add an API that allows a user to define and observe arbitrary metrics on 
> batch and streaming queries
> 
>
> Key: SPARK-29345
> URL: https://issues.apache.org/jira/browse/SPARK-29345
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30108) Add robust accumulator for observable metrics

2020-06-16 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138118#comment-17138118
 ] 

Wenchen Fan commented on SPARK-30108:
-

is there any progress on it?

> Add robust accumulator for observable metrics
> -
>
> Key: SPARK-30108
> URL: https://issues.apache.org/jira/browse/SPARK-30108
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Spark accumulators reflect the work that has been done, and not the data that 
> has been processed. There are situations where one tuple can be processed 
> multiple times, e.g.: task/stage retries, speculation, determination of 
> ranges for global ordered, etc... For observed metrics we need the value of 
> the accumulator to be based on the data and not on processing.
> The current aggregating accumulator is already robust to some of these issues 
> (like task failure), but we need to add some additional checks to make sure 
> it is fool proof.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138100#comment-17138100
 ] 

Apache Spark commented on SPARK-32003:
--

User 'wypoon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28848

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32003:


Assignee: Apache Spark

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Assignee: Apache Spark
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138099#comment-17138099
 ] 

Apache Spark commented on SPARK-32003:
--

User 'wypoon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28848

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32003:


Assignee: (was: Apache Spark)

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30876) Optimizer cannot infer from inferred constraints with join

2020-06-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138072#comment-17138072
 ] 

Yuming Wang commented on SPARK-30876:
-

Thank you [~navinvishy].

> Optimizer cannot infer from inferred constraints with join
> --
>
> Key: SPARK-30876
> URL: https://issues.apache.org/jira/browse/SPARK-30876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table t1(a int, b int, c int);
> create table t2(a int, b int, c int);
> create table t3(a int, b int, c int);
> select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and 
> t3.c = 1);
> {code}
> Spark 2.3+:
> {noformat}
> == Physical Plan ==
> *(4) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, true, [id=#102]
>+- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(3) Project
>  +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
> :- *(3) Project [b#10]
> :  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
> : :- *(3) Project [a#6]
> : :  +- *(3) Filter isnotnull(a#6)
> : : +- *(3) ColumnarToRow
> : :+- FileScan parquet default.t1[a#6] Batched: true, 
> DataFilters: [isnotnull(a#6)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: 
> struct
> : +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#87]
> :+- *(1) Project [b#10]
> :   +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1))
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t2[b#10] Batched: 
> true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], 
> ReadSchema: struct
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#96]
>+- *(2) Project [c#14]
>   +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1))
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t3[c#14] Batched: true, 
> DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], 
> ReadSchema: struct
> Time taken: 3.785 seconds, Fetched 1 row(s)
> {noformat}
> Spark 2.2.x:
> {noformat}
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *Project
>  +- *SortMergeJoin [b#19], [c#23], Inner
> :- *Project [b#19]
> :  +- *SortMergeJoin [a#15], [b#19], Inner
> : :- *Sort [a#15 ASC NULLS FIRST], false, 0
> : :  +- Exchange hashpartitioning(a#15, 200)
> : : +- *Filter (isnotnull(a#15) && (a#15 = 1))
> : :+- HiveTableScan [a#15], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, 
> b#16, c#17]
> : +- *Sort [b#19 ASC NULLS FIRST], false, 0
> :+- Exchange hashpartitioning(b#19, 200)
> :   +- *Filter (isnotnull(b#19) && (b#19 = 1))
> :  +- HiveTableScan [b#19], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, 
> b#19, c#20]
> +- *Sort [c#23 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c#23, 200)
>   +- *Filter (isnotnull(c#23) && (c#23 = 1))
>  +- HiveTableScan [c#23], HiveTableRelation 
> `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, 
> b#22, c#23]
> Time taken: 0.728 seconds, Fetched 1 row(s)
> {noformat}
> Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now

2020-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32011.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28845
[https://github.com/apache/spark/pull/28845]

> Remove warnings about pin-thread modes and guide collectWithJobGroups for now
> -
>
> Key: SPARK-32011
> URL: https://issues.apache.org/jira/browse/SPARK-32011
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Currently, it guides users to use pin-thread mode to use multiple threads. 
> There's a thread leak issue in this mode (SPARK-32010).
> We shouldn't promote this mode in Spark 3.0 for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now

2020-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32011:


Assignee: Hyukjin Kwon

> Remove warnings about pin-thread modes and guide collectWithJobGroups for now
> -
>
> Key: SPARK-32011
> URL: https://issues.apache.org/jira/browse/SPARK-32011
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, it guides users to use pin-thread mode to use multiple threads. 
> There's a thread leak issue in this mode (SPARK-32010).
> We shouldn't promote this mode in Spark 3.0 for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31989) Generate JSON files with rebasing switch points using smaller steps

2020-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31989:


Assignee: Maxim Gekk

> Generate JSON files with rebasing switch points using smaller steps
> ---
>
> Key: SPARK-31989
> URL: https://issues.apache.org/jira/browse/SPARK-31989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, the JSON files 'gregorian-julian-rebase-micros.json' and 
> 'julian-gregorian-rebase-micros.json' are generated w/ step of 1 week. The 
> files can miss short spikes of diffs. The ticket aims to change the max step 
> from 1 week to 30 minutes, and speed up execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31989) Generate JSON files with rebasing switch points using smaller steps

2020-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31989.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28827
[https://github.com/apache/spark/pull/28827]

> Generate JSON files with rebasing switch points using smaller steps
> ---
>
> Key: SPARK-31989
> URL: https://issues.apache.org/jira/browse/SPARK-31989
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the JSON files 'gregorian-julian-rebase-micros.json' and 
> 'julian-gregorian-rebase-micros.json' are generated w/ step of 1 week. The 
> files can miss short spikes of diffs. The ticket aims to change the max step 
> from 1 week to 30 minutes, and speed up execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29148) Modify dynamic allocation manager for stage level scheduling

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138046#comment-17138046
 ] 

Apache Spark commented on SPARK-29148:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28847

> Modify dynamic allocation manager for stage level scheduling
> 
>
> Key: SPARK-29148
> URL: https://issues.apache.org/jira/browse/SPARK-29148
> Project: Spark
>  Issue Type: Story
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> To Support Stage Level Scheduling, the dynamic allocation manager has to 
> track the usage and need or executor per ResourceProfile.
> We will have to figure out what to do with the metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29148) Modify dynamic allocation manager for stage level scheduling

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138048#comment-17138048
 ] 

Apache Spark commented on SPARK-29148:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/28847

> Modify dynamic allocation manager for stage level scheduling
> 
>
> Key: SPARK-29148
> URL: https://issues.apache.org/jira/browse/SPARK-29148
> Project: Spark
>  Issue Type: Story
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.1.0
>
>
> To Support Stage Level Scheduling, the dynamic allocation manager has to 
> track the usage and need or executor per ResourceProfile.
> We will have to figure out what to do with the metrics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32002) spark error while select nest data

2020-06-16 Thread Yiqun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Zhang updated SPARK-32002:

Priority: Major  (was: Minor)

> spark error while select nest data
> --
>
> Key: SPARK-32002
> URL: https://issues.apache.org/jira/browse/SPARK-32002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Yiqun Zhang
>Priority: Major
>
> nest-data.json
> {code:java}
> {"a": [{"b": [{"c": [1,2]}]}]}
> {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
> {code:java}
> val df: DataFrame = spark.read.json(testFile("nest-data.json"))
> df.createTempView("nest_table")
> sql("select a.b.c from nest_table").show()
> {code}
> {color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 
> 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
> integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
>  {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color}
>  {color:#ff}+- SubqueryAlias `nest_table`{color}
>  {color:#ff} +- Relation[a#6|#6] json{color}
> {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor 
> for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) 
> match {color}GetArrayItem, extraction ("c") treat as an ordinal.
> org.apache.spark.sql.catalyst.expressions.ExtractValue
> {code:java}
> def apply(
>   child: Expression,
>   extraction: Expression,
>   resolver: Resolver): Expression = {
>(child.dataType, extraction) match {
>   case (StructType(fields), NonNullLiteral(v, StringType)) =>
> val fieldName = v.toString
> val ordinal = findField(fields, fieldName, resolver)
> GetStructField(child, ordinal, Some(fieldName))  
>   case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, 
> StringType)) =>
> val fieldName = v.toString
> val ordinal = findField(fields, fieldName, resolver)
> GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
>   ordinal, fields.length, containsNull)  
>   case (_: ArrayType, _) => GetArrayItem(child, extraction)  
>   case (MapType(kt, _, _), _) => GetMapValue(child, extraction)  
>   case (otherType, _) =>
> val errorMsg = otherType match {
>   case StructType(_) =>
> s"Field name should be String Literal, but it's $extraction"
>   case other =>
> s"Can't extract value from $child: need struct type but got 
> ${other.catalogString}"
> }
> throw new AnalysisException(errorMsg)
> }
>   }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32012:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Incrementally create and materialize query stage to avoid unnecessary local 
> shuffle
> ---
>
> Key: SPARK-32012
> URL: https://issues.apache.org/jira/browse/SPARK-32012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> The current way of creating query stage in AQE is in batch. For example, the 
> children of a sort merge join will be materialized as query stages in a 
> batch. Then AQE brings the optimization in and optimize sort merge join to 
> broadcast join. Except for the broadcasted exchange, we don't need do any 
> exchange on another side of join but we already materialized the exchange. 
> Currently AQE wraps the materialized exchange with local reader, but it still 
> brings unnecessary I/O. We can avoid unnecessary local shuffle by 
> incrementally creating query stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32012:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Incrementally create and materialize query stage to avoid unnecessary local 
> shuffle
> ---
>
> Key: SPARK-32012
> URL: https://issues.apache.org/jira/browse/SPARK-32012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> The current way of creating query stage in AQE is in batch. For example, the 
> children of a sort merge join will be materialized as query stages in a 
> batch. Then AQE brings the optimization in and optimize sort merge join to 
> broadcast join. Except for the broadcasted exchange, we don't need do any 
> exchange on another side of join but we already materialized the exchange. 
> Currently AQE wraps the materialized exchange with local reader, but it still 
> brings unnecessary I/O. We can avoid unnecessary local shuffle by 
> incrementally creating query stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138023#comment-17138023
 ] 

Apache Spark commented on SPARK-32012:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28846

> Incrementally create and materialize query stage to avoid unnecessary local 
> shuffle
> ---
>
> Key: SPARK-32012
> URL: https://issues.apache.org/jira/browse/SPARK-32012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> The current way of creating query stage in AQE is in batch. For example, the 
> children of a sort merge join will be materialized as query stages in a 
> batch. Then AQE brings the optimization in and optimize sort merge join to 
> broadcast join. Except for the broadcasted exchange, we don't need do any 
> exchange on another side of join but we already materialized the exchange. 
> Currently AQE wraps the materialized exchange with local reader, but it still 
> brings unnecessary I/O. We can avoid unnecessary local shuffle by 
> incrementally creating query stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138024#comment-17138024
 ] 

Apache Spark commented on SPARK-32012:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/28846

> Incrementally create and materialize query stage to avoid unnecessary local 
> shuffle
> ---
>
> Key: SPARK-32012
> URL: https://issues.apache.org/jira/browse/SPARK-32012
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> The current way of creating query stage in AQE is in batch. For example, the 
> children of a sort merge join will be materialized as query stages in a 
> batch. Then AQE brings the optimization in and optimize sort merge join to 
> broadcast join. Except for the broadcasted exchange, we don't need do any 
> exchange on another side of join but we already materialized the exchange. 
> Currently AQE wraps the materialized exchange with local reader, but it still 
> brings unnecessary I/O. We can avoid unnecessary local shuffle by 
> incrementally creating query stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle

2020-06-16 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-32012:
---

 Summary: Incrementally create and materialize query stage to avoid 
unnecessary local shuffle
 Key: SPARK-32012
 URL: https://issues.apache.org/jira/browse/SPARK-32012
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


The current way of creating query stage in AQE is in batch. For example, the 
children of a sort merge join will be materialized as query stages in a batch. 
Then AQE brings the optimization in and optimize sort merge join to broadcast 
join. Except for the broadcasted exchange, we don't need do any exchange on 
another side of join but we already materialized the exchange. Currently AQE 
wraps the materialized exchange with local reader, but it still brings 
unnecessary I/O. We can avoid unnecessary local shuffle by incrementally 
creating query stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138014#comment-17138014
 ] 

Apache Spark commented on SPARK-32011:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28845

> Remove warnings about pin-thread modes and guide collectWithJobGroups for now
> -
>
> Key: SPARK-32011
> URL: https://issues.apache.org/jira/browse/SPARK-32011
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, it guides users to use pin-thread mode to use multiple threads. 
> There's a thread leak issue in this mode (SPARK-32010).
> We shouldn't promote this mode in Spark 3.0 for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32011:


Assignee: Apache Spark

> Remove warnings about pin-thread modes and guide collectWithJobGroups for now
> -
>
> Key: SPARK-32011
> URL: https://issues.apache.org/jira/browse/SPARK-32011
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, it guides users to use pin-thread mode to use multiple threads. 
> There's a thread leak issue in this mode (SPARK-32010).
> We shouldn't promote this mode in Spark 3.0 for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32010) Thread leaks in pinned thread mode

2020-06-16 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138012#comment-17138012
 ] 

Hyukjin Kwon commented on SPARK-32010:
--

I have a kind of a POC fix already. Will open a PR later.

> Thread leaks in pinned thread mode
> --
>
> Key: SPARK-32010
> URL: https://issues.apache.org/jira/browse/SPARK-32010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-22340 introduced a pin thread mode which guarantees you to sync Python 
> thread and JVM thread.
> However, looks like the JVM threads are not finished even when the Python 
> thread is finished. It can be debugged via YourKit, and run multiple jobs 
> with multiple threads at the same time.
> Easiest reproducer is:
> {code}
> PYSPARK_PIN_THREAD=true ./bin/pyspark
> {code}
> {code}
> >>> from threading import Thread
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> spark._jvm._gateway_client.deque
> deque([, 
> , 
> , 
> , 
> ])
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> spark._jvm._gateway_client.deque
> deque([, 
> , 
> , 
> , 
> , 
> ])
> {code}
> The connection doesn't get closed, and it holds JVM thread running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138013#comment-17138013
 ] 

Apache Spark commented on SPARK-32011:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28845

> Remove warnings about pin-thread modes and guide collectWithJobGroups for now
> -
>
> Key: SPARK-32011
> URL: https://issues.apache.org/jira/browse/SPARK-32011
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, it guides users to use pin-thread mode to use multiple threads. 
> There's a thread leak issue in this mode (SPARK-32010).
> We shouldn't promote this mode in Spark 3.0 for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32011:


Assignee: (was: Apache Spark)

> Remove warnings about pin-thread modes and guide collectWithJobGroups for now
> -
>
> Key: SPARK-32011
> URL: https://issues.apache.org/jira/browse/SPARK-32011
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, it guides users to use pin-thread mode to use multiple threads. 
> There's a thread leak issue in this mode (SPARK-32010).
> We shouldn't promote this mode in Spark 3.0 for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32010) Thread leaks in pinned thread mode

2020-06-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32010:


 Summary: Thread leaks in pinned thread mode
 Key: SPARK-32010
 URL: https://issues.apache.org/jira/browse/SPARK-32010
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.1.0
Reporter: Hyukjin Kwon


SPARK-22340 introduced a pin thread mode which guarantees you to sync Python 
thread and JVM thread.

However, looks like the JVM threads are not finished even when the Python 
thread is finished. It can be debugged via YourKit, and run multiple jobs with 
multiple threads at the same time.

Easiest reproducer is:

{code}
PYSPARK_PIN_THREAD=true ./bin/pyspark
{code}

{code}
>>> from threading import Thread
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([, 
, 
, 
, 
])
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([, 
, 
, 
, 
, 
])
{code}

The connection doesn't get closed, and it holds JVM thread running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32010) Thread leaks in pinned thread mode

2020-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32010:


Assignee: Hyukjin Kwon

> Thread leaks in pinned thread mode
> --
>
> Key: SPARK-32010
> URL: https://issues.apache.org/jira/browse/SPARK-32010
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-22340 introduced a pin thread mode which guarantees you to sync Python 
> thread and JVM thread.
> However, looks like the JVM threads are not finished even when the Python 
> thread is finished. It can be debugged via YourKit, and run multiple jobs 
> with multiple threads at the same time.
> Easiest reproducer is:
> {code}
> PYSPARK_PIN_THREAD=true ./bin/pyspark
> {code}
> {code}
> >>> from threading import Thread
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> spark._jvm._gateway_client.deque
> deque([, 
> , 
> , 
> , 
> ])
> >>> Thread(target=lambda: spark.range(1000).collect()).start()
> >>> spark._jvm._gateway_client.deque
> deque([, 
> , 
> , 
> , 
> , 
> ])
> {code}
> The connection doesn't get closed, and it holds JVM thread running.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31337) Support MS Sql Kerberos login in JDBC connector

2020-06-16 Thread Marcelo Masiero Vanzin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-31337.

Fix Version/s: 3.1.0
 Assignee: Gabor Somogyi
   Resolution: Fixed

> Support MS Sql Kerberos login in JDBC connector
> ---
>
> Key: SPARK-31337
> URL: https://issues.apache.org/jira/browse/SPARK-31337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now

2020-06-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-32011:


 Summary: Remove warnings about pin-thread modes and guide 
collectWithJobGroups for now
 Key: SPARK-32011
 URL: https://issues.apache.org/jira/browse/SPARK-32011
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Currently, it guides users to use pin-thread mode to use multiple threads. 
There's a thread leak issue in this mode (SPARK-32010).

We shouldn't promote this mode in Spark 3.0 for now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137996#comment-17137996
 ] 

Apache Spark commented on SPARK-32009:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28844

> remove deprecated method BisectingKMeansModel.computeCost
> -
>
> Key: SPARK-32009
> URL: https://issues.apache.org/jira/browse/SPARK-32009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> remove BisectingKMeans.computeCost which is deprecated in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32009:


Assignee: Apache Spark

> remove deprecated method BisectingKMeansModel.computeCost
> -
>
> Key: SPARK-32009
> URL: https://issues.apache.org/jira/browse/SPARK-32009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> remove BisectingKMeans.computeCost which is deprecated in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137995#comment-17137995
 ] 

Apache Spark commented on SPARK-32009:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28844

> remove deprecated method BisectingKMeansModel.computeCost
> -
>
> Key: SPARK-32009
> URL: https://issues.apache.org/jira/browse/SPARK-32009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> remove BisectingKMeans.computeCost which is deprecated in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32009:


Assignee: (was: Apache Spark)

> remove deprecated method BisectingKMeansModel.computeCost
> -
>
> Key: SPARK-32009
> URL: https://issues.apache.org/jira/browse/SPARK-32009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> remove BisectingKMeans.computeCost which is deprecated in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost

2020-06-16 Thread Huaxin Gao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-32009:
---
Summary: remove deprecated method BisectingKMeansModel.computeCost  (was: 
remove deprecated method BisectingKMeans.computeCost)

> remove deprecated method BisectingKMeansModel.computeCost
> -
>
> Key: SPARK-32009
> URL: https://issues.apache.org/jira/browse/SPARK-32009
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> remove BisectingKMeans.computeCost which is deprecated in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31777) CrossValidator supports user-supplied folds

2020-06-16 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-31777.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28704
[https://github.com/apache/spark/pull/28704]

> CrossValidator supports user-supplied folds 
> 
>
> Key: SPARK-31777
> URL: https://issues.apache.org/jira/browse/SPARK-31777
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> As a user, I can specify how CrossValidator should create folds by specifying 
> a foldCol, which should be integer type with range [0, numFolds). If foldCol 
> is specified, Spark won't do random k-fold split. This is useful if there are 
> custom logics to create folds, e.g., random split by users instead of random 
> splits of events.
> This is similar to SPARK-16206, which is for the RDD-based APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31777) CrossValidator supports user-supplied folds

2020-06-16 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-31777:
---

Assignee: L. C. Hsieh

> CrossValidator supports user-supplied folds 
> 
>
> Key: SPARK-31777
> URL: https://issues.apache.org/jira/browse/SPARK-31777
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Assignee: L. C. Hsieh
>Priority: Major
>
> As a user, I can specify how CrossValidator should create folds by specifying 
> a foldCol, which should be integer type with range [0, numFolds). If foldCol 
> is specified, Spark won't do random k-fold split. This is useful if there are 
> custom logics to create folds, e.g., random split by users instead of random 
> splits of events.
> This is similar to SPARK-16206, which is for the RDD-based APIs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32009) remove deprecated method BisectingKMeans.computeCost

2020-06-16 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-32009:
--

 Summary: remove deprecated method BisectingKMeans.computeCost
 Key: SPARK-32009
 URL: https://issues.apache.org/jira/browse/SPARK-32009
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 3.1.0
Reporter: Huaxin Gao


remove BisectingKMeans.computeCost which is deprecated in 3.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32008) 3.0.0 release build fails

2020-06-16 Thread Shivaram Venkataraman (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137937#comment-17137937
 ] 

Shivaram Venkataraman commented on SPARK-32008:
---

It looks like the R vignette build failed and looking at the error message this 
seems related to https://github.com/rstudio/rmarkdown/issues/1831 -- I think it 
should work fine if you try to use R version >= 3.6

> 3.0.0 release build fails
> -
>
> Key: SPARK-32008
> URL: https://issues.apache.org/jira/browse/SPARK-32008
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation
>Affects Versions: 3.0.0
>Reporter: Philipp Dallig
>Priority: Major
>
> Hi,
> I try to build the spark release 3.0.0 by myself.
> I got the following error.
> {code}  
> 20/06/16 15:20:49 WARN PrefixSpan: Input data is not cached.
> 20/06/16 15:20:50 WARN Instrumentation: [b307b568] regParam is zero, which 
> might cause numerical instability and overfitting.
> Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
> 'vignetteInfo' is not an exported object from 'namespace:tools'
> Execution halted
> {code}
> I can reproduce this error with a small Dockerfile.
> {code}
> FROM ubuntu:18.04 as builder
> ENV MVN_VERSION=3.6.3 \
> M2_HOME=/opt/apache-maven \
> MAVEN_HOME=/opt/apache-maven \
> MVN_HOME=/opt/apache-maven \
> 
> MVN_SHA512=c35a1803a6e70a126e80b2b3ae33eed961f83ed74d18fcd16909b2d44d7dada3203f1ffe726c17ef8dcca2dcaa9fca676987befeadc9b9f759967a8cb77181c0
>  \
> MAVEN_OPTS="-Xmx3g -XX:ReservedCodeCacheSize=1g" \
> R_HOME=/usr/lib/R \
> GIT_REPO=https://github.com/apache/spark.git \
> GIT_BRANCH=v3.0.0 \
> SPARK_DISTRO_NAME=hadoop3.2 \
> SPARK_LOCAL_HOSTNAME=localhost
> # Preparation
> RUN /usr/bin/apt-get update && \
> # APT
> INSTALL_PKGS="openjdk-8-jdk-headless git wget python3 python3-pip 
> python3-setuptools r-base r-base-dev pandoc pandoc-citeproc 
> libcurl4-openssl-dev libssl-dev libxml2-dev texlive qpdf language-pack-en" && 
> \
> DEBIAN_FRONTEND=noninteractive /usr/bin/apt-get -y install 
> --no-install-recommends $INSTALL_PKGS && \
> rm -rf /var/lib/apt/lists/* && \
> Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 
> 'testthat', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \
> # Maven
> /usr/bin/wget -nv -O apache-maven.tar.gz 
> "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz;
>  && \
> echo "${MVN_SHA512} apache-maven.tar.gz" > apache-maven.sha512 && \
> sha512sum --strict -c apache-maven.sha512 && \
> tar -xvzf apache-maven.tar.gz -C /opt && \
> rm -v apache-maven.sha512 apache-maven.tar.gz && \
> /bin/ln -vs /opt/apache-maven-${MVN_VERSION} /opt/apache-maven && \
> /bin/ln -vs /opt/apache-maven/bin/mvn /usr/bin/mvn
> # Spark Distribution Build
> RUN mkdir -p /workspace && \
> cd /workspace && \
> git clone --branch ${GIT_BRANCH} ${GIT_REPO} && \
> cd /workspace/spark && \
> ./dev/make-distribution.sh --name ${SPARK_DISTRO_NAME} --pip --r --tgz 
> -Psparkr -Phadoop-3.2 -Phive-2.3 -Phive-thriftserver -Pyarn -Pkubernetes
> {code}
> I am very grateful to all helpers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf

2020-06-16 Thread Kouhei Sutou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137931#comment-17137931
 ] 

Kouhei Sutou commented on SPARK-31998:
--

FYI: The corresponding change in Apache Arrow: 
https://github.com/apache/arrow/pull/6729


> Change package references for ArrowBuf
> --
>
> Key: SPARK-31998
> URL: https://issues.apache.org/jira/browse/SPARK-31998
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Liya Fan
>Priority: Major
>
> Recently, we have moved class ArrowBuf from package io.netty.buffer to 
> org.apache.arrow.memory. So after upgrading Arrow library, we need to update 
> the references to ArrowBuf with the correct package name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming or just loading from a file data source, I've 
encountered a number of occasions where I want to be able to stream from a 
folder containing any number of historical files in CSV format.  When I start 
reading from a folder, however, I might only care about files that were created 
after a certain time.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under F_ileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

 

+*Consumers seeking to replicate or achieve this behavior*+

*Stack Overflow -(spark structured streaming file source read from a certain 
partition onwards)*
[https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards]

*Stack Overflow - (Spark Structured Streaming File Source Starting Offset)*
[https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset/51399134#51399134]

  was:
When using structured streaming or just loading from a file data source, I've 
encountered a number of occasions where I want to be able to stream from a 
folder containing any number of historical files in CSV format.  When I start 
reading from a folder, however, I might only care about files that were created 
after a certain time.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under F_ileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

 

Stack Overflow -(spark structured streaming file source read from a 

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming or just loading from a file data source, I've 
encountered a number of occasions where I want to be able to stream from a 
folder containing any number of historical files in CSV format.  When I start 
reading from a folder, however, I might only care about files that were created 
after a certain time.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under F_ileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

 

Stack Overflow -(spark structured streaming file source read from a certain 
partition onwards
)|[https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards]]

Stack Overflow - (Spark Structured Streaming File Source Starting 
Offset)|[https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset/51399134#51399134]]

  was:
When using structured streaming or just loading from a file data source, I've 
encountered a number of occasions where I want to be able to stream from a 
folder containing any number of historical files in CSV format.  When I start 
reading from a folder, however, I might only care about files that were created 
after a certain time.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under F_ileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

 

Stack Overflow -[spark structured streaming file source read from a certain 
partition onwards

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming or just loading from a file data source, I've 
encountered a number of occasions where I want to be able to stream from a 
folder containing any number of historical files in CSV format.  When I start 
reading from a folder, however, I might only care about files that were created 
after a certain time.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under F_ileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

 

Stack Overflow -[spark structured streaming file source read from a certain 
partition onwards
](https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards)|http://example.com]

Stack Overflow - [Spark Structured Streaming File Source Starting 
Offset](https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset/51399134#51399134)|http://example.com]

  was:
When using structured streaming or just loading from a file data source, I've 
encountered a number of occasions where I want to be able to stream from a 
folder containing any number of historical files in CSV format.  When I start 
reading from a folder, however, I might only care about files that were created 
after a certain time.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under F_ileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.


> Provide option to load files after a specified date when reading from a 
> folder path
> 

[jira] [Created] (SPARK-32008) 3.0.0 release build fails

2020-06-16 Thread Philipp Dallig (Jira)
Philipp Dallig created SPARK-32008:
--

 Summary: 3.0.0 release build fails
 Key: SPARK-32008
 URL: https://issues.apache.org/jira/browse/SPARK-32008
 Project: Spark
  Issue Type: Bug
  Components: Build, Documentation
Affects Versions: 3.0.0
Reporter: Philipp Dallig


Hi,
I try to build the spark release 3.0.0 by myself.

I got the following error.
{code}  
20/06/16 15:20:49 WARN PrefixSpan: Input data is not cached.
20/06/16 15:20:50 WARN Instrumentation: [b307b568] regParam is zero, which 
might cause numerical instability and overfitting.
Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics:
'vignetteInfo' is not an exported object from 'namespace:tools'
Execution halted
{code}

I can reproduce this error with a small Dockerfile.
{code}
FROM ubuntu:18.04 as builder

ENV MVN_VERSION=3.6.3 \
M2_HOME=/opt/apache-maven \
MAVEN_HOME=/opt/apache-maven \
MVN_HOME=/opt/apache-maven \

MVN_SHA512=c35a1803a6e70a126e80b2b3ae33eed961f83ed74d18fcd16909b2d44d7dada3203f1ffe726c17ef8dcca2dcaa9fca676987befeadc9b9f759967a8cb77181c0
 \
MAVEN_OPTS="-Xmx3g -XX:ReservedCodeCacheSize=1g" \
R_HOME=/usr/lib/R \
GIT_REPO=https://github.com/apache/spark.git \
GIT_BRANCH=v3.0.0 \
SPARK_DISTRO_NAME=hadoop3.2 \
SPARK_LOCAL_HOSTNAME=localhost

# Preparation
RUN /usr/bin/apt-get update && \
# APT
INSTALL_PKGS="openjdk-8-jdk-headless git wget python3 python3-pip 
python3-setuptools r-base r-base-dev pandoc pandoc-citeproc 
libcurl4-openssl-dev libssl-dev libxml2-dev texlive qpdf language-pack-en" && \
DEBIAN_FRONTEND=noninteractive /usr/bin/apt-get -y install 
--no-install-recommends $INSTALL_PKGS && \
rm -rf /var/lib/apt/lists/* && \
Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 
'testthat', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \
# Maven
/usr/bin/wget -nv -O apache-maven.tar.gz 
"https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz;
 && \
echo "${MVN_SHA512} apache-maven.tar.gz" > apache-maven.sha512 && \
sha512sum --strict -c apache-maven.sha512 && \
tar -xvzf apache-maven.tar.gz -C /opt && \
rm -v apache-maven.sha512 apache-maven.tar.gz && \
/bin/ln -vs /opt/apache-maven-${MVN_VERSION} /opt/apache-maven && \
/bin/ln -vs /opt/apache-maven/bin/mvn /usr/bin/mvn

# Spark Distribution Build
RUN mkdir -p /workspace && \
cd /workspace && \
git clone --branch ${GIT_BRANCH} ${GIT_REPO} && \
cd /workspace/spark && \
./dev/make-distribution.sh --name ${SPARK_DISTRO_NAME} --pip --r --tgz 
-Psparkr -Phadoop-3.2 -Phive-2.3 -Phive-thriftserver -Pyarn -Pkubernetes
{code}

I am very grateful to all helpers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32007) Spark Driver Supervise does not work reliably

2020-06-16 Thread Suraj Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suraj Sharma updated SPARK-32007:
-
Environment: 
|Java Version|1.8.0_121 (Oracle Corporation)|
|Java Home|/usr/java/jdk1.8.0_121/jre|
|Scala Version|version 2.11.12|
|OS|Amazon Linux|
h4.  

  was:
||Name||Value||
|Java Version|1.8.0_121 (Oracle Corporation)|
|Java Home|/usr/java/jdk1.8.0_121/jre|
|Scala Version|version 2.11.12|
|OS|Amazon Linux|
h4.  


> Spark Driver Supervise does not work reliably
> -
>
> Key: SPARK-32007
> URL: https://issues.apache.org/jira/browse/SPARK-32007
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: |Java Version|1.8.0_121 (Oracle Corporation)|
> |Java Home|/usr/java/jdk1.8.0_121/jre|
> |Scala Version|version 2.11.12|
> |OS|Amazon Linux|
> h4.  
>Reporter: Suraj Sharma
>Priority: Critical
>
> I have a standalone cluster setup. I DO NOT have a streaming use case. I use 
> AWS EC2 machines to have spark master and worker processes.
> *Problem*: If a spark worker machine running some drivers and executor dies, 
> then the driver is not spawned again on other healthy machines.
> *Below are my findings:*
> ||Action/Behaviour||Executor||Driver||
> |Worker Machine Stop|Relaunches on an active machine|NO Relaunch|
> |kill -9 to process|Relaunches on other machines|Relaunches on other machines|
> |kill to process|Relaunches on other machines|Relaunches on other machines|
> *Cluster Setup:*
>  # I have a spark standalone cluster
>  # {{spark.driver.supervise=true}}
>  # Spark Master HA is enabled and is backed by zookeeper
>  # Spark version = 2.4.4
>  # I am using a systemd script for the spark worker process



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32007) Spark Driver Supervise does not work reliably

2020-06-16 Thread Suraj Sharma (Jira)
Suraj Sharma created SPARK-32007:


 Summary: Spark Driver Supervise does not work reliably
 Key: SPARK-32007
 URL: https://issues.apache.org/jira/browse/SPARK-32007
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 2.4.4
 Environment: ||Name||Value||
|Java Version|1.8.0_121 (Oracle Corporation)|
|Java Home|/usr/java/jdk1.8.0_121/jre|
|Scala Version|version 2.11.12|
|OS|Amazon Linux|
h4.  
Reporter: Suraj Sharma


I have a standalone cluster setup. I DO NOT have a streaming use case. I use 
AWS EC2 machines to have spark master and worker processes.

*Problem*: If a spark worker machine running some drivers and executor dies, 
then the driver is not spawned again on other healthy machines.

*Below are my findings:*
||Action/Behaviour||Executor||Driver||
|Worker Machine Stop|Relaunches on an active machine|NO Relaunch|
|kill -9 to process|Relaunches on other machines|Relaunches on other machines|
|kill to process|Relaunches on other machines|Relaunches on other machines|

*Cluster Setup:*
 # I have a spark standalone cluster
 # {{spark.driver.supervise=true}}
 # Spark Master HA is enabled and is backed by zookeeper
 # Spark version = 2.4.4
 # I am using a systemd script for the spark worker process



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32006:


Assignee: Apache Spark

> Create date/timestamp formatters once before collect in `hiveResultString()`
> 
>
> Key: SPARK-32006
> URL: https://issues.apache.org/jira/browse/SPARK-32006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark 2.4 re-uses one instance of SimpleDateFormat while formatting 
> timestamps in toHiveString. Currently, toHiveString() creates 
> timestampFormatter per each value. Even w/ caching, it causes additional 
> overhead comparing to Spark 2.4. The ticket aims to create an instance of 
> TimestampFormatter before collect in hiveResultString()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137852#comment-17137852
 ] 

Apache Spark commented on SPARK-32006:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28842

> Create date/timestamp formatters once before collect in `hiveResultString()`
> 
>
> Key: SPARK-32006
> URL: https://issues.apache.org/jira/browse/SPARK-32006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark 2.4 re-uses one instance of SimpleDateFormat while formatting 
> timestamps in toHiveString. Currently, toHiveString() creates 
> timestampFormatter per each value. Even w/ caching, it causes additional 
> overhead comparing to Spark 2.4. The ticket aims to create an instance of 
> TimestampFormatter before collect in hiveResultString()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32006:


Assignee: (was: Apache Spark)

> Create date/timestamp formatters once before collect in `hiveResultString()`
> 
>
> Key: SPARK-32006
> URL: https://issues.apache.org/jira/browse/SPARK-32006
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Spark 2.4 re-uses one instance of SimpleDateFormat while formatting 
> timestamps in toHiveString. Currently, toHiveString() creates 
> timestampFormatter per each value. Even w/ caching, it causes additional 
> overhead comparing to Spark 2.4. The ticket aims to create an instance of 
> TimestampFormatter before collect in hiveResultString()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32005) Add aggregate functions for computing percentiles on weighted data

2020-06-16 Thread Devesh Parekh (Jira)
Devesh Parekh created SPARK-32005:
-

 Summary: Add aggregate functions for computing percentiles on 
weighted data
 Key: SPARK-32005
 URL: https://issues.apache.org/jira/browse/SPARK-32005
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Devesh Parekh


SPARK-30569 adds percentile_approx functions for computing percentiles for a 
column with equal weights. It would be useful to have variants that also take a 
weight column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`

2020-06-16 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-32006:
--

 Summary: Create date/timestamp formatters once before collect in 
`hiveResultString()`
 Key: SPARK-32006
 URL: https://issues.apache.org/jira/browse/SPARK-32006
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Maxim Gekk


Spark 2.4 re-uses one instance of SimpleDateFormat while formatting timestamps 
in toHiveString. Currently, toHiveString() creates timestampFormatter per each 
value. Even w/ caching, it causes additional overhead comparing to Spark 2.4. 
The ticket aims to create an instance of TimestampFormatter before collect in 
hiveResultString()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32004) Drop references to slave

2020-06-16 Thread Holden Karau (Jira)
Holden Karau created SPARK-32004:


 Summary: Drop references to slave
 Key: SPARK-32004
 URL: https://issues.apache.org/jira/browse/SPARK-32004
 Project: Spark
  Issue Type: Improvement
  Components: Mesos, Spark Core, SQL, Structured Streaming
Affects Versions: 3.1.0
Reporter: Holden Karau


We have a lot of references to "slave" in the code base which doesn't match the 
terminology in the rest of our code base and we should clean it up. In many 
situations it would be clearer with "executor", "worker", or "replica" 
depending on the context (so this is not just a search and replace but actually 
read through the code and make it consistent).

 

We may want to (in a follow on) explore renaming master to something more 
precise.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31099) Create migration script for metastore_db

2020-06-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31099.
-

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
>   at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303)
>   at 
> 

[jira] [Resolved] (SPARK-31929) local cache size exceeding "spark.history.store.maxDiskUsage" triggered "java.io.IOException" in history server on Windows

2020-06-16 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31929.
--
Fix Version/s: 3.1.0
 Assignee: zhli
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/28769

> local cache size exceeding "spark.history.store.maxDiskUsage" triggered 
> "java.io.IOException" in history server on Windows
> --
>
> Key: SPARK-31929
> URL: https://issues.apache.org/jira/browse/SPARK-31929
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
> Environment: System: Windows
> Config: 
> spark.history.retainedApplications 200
> spark.history.store.maxDiskUsage 2g
> spark.history.store.path d://cache_hs
>Reporter: zhli
>Assignee: zhli
>Priority: Minor
> Fix For: 3.1.0
>
>
> h2.  
> h2. HTTP ERROR 500
> Problem accessing /history/app-20190711215551-0001/stages/. Reason:
> Server Error
>  
> h3. Caused by:
> java.io.IOException: Unable to delete file: 
> d:\cache_hs\apps\app-20190711215551-0001.ldb\MANIFEST-07 at 
> org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2381) at 
> org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1679) at 
> org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1575) at 
> org.apache.spark.deploy.history.HistoryServerDiskManager.org$apache$spark$deploy$history$HistoryServerDiskManager$$deleteStore(HistoryServerDiskManager.scala:198)
>  at 
> org.apache.spark.deploy.history.HistoryServerDiskManager.$anonfun$release$1(HistoryServerDiskManager.scala:161)
>  at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.java:23) 
> at scala.Option.foreach(Option.scala:407) at 
> org.apache.spark.deploy.history.HistoryServerDiskManager.release(HistoryServerDiskManager.scala:156)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$loadDiskStore$1(FsHistoryProvider.scala:1163)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$loadDiskStore$1$adapted(FsHistoryProvider.scala:1157)
>  at scala.Option.foreach(Option.scala:407) at 
> org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1157)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:363)
>  at 
> org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:191)
>  at 
> org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163)
>  at 
> org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:135)
>  at 
> org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161)
>  at 
> org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:56)
>  at 
> org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:52)
>  at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:89)
>  at 
> org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:101)
>  at 
> org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:248)
>  at 
> org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:101)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at 
> javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at 
> org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) 
> at 
> org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
>  at 
> org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>  at 
> org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>  at 
> 

[jira] [Updated] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Wing Yew Poon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wing Yew Poon updated SPARK-32003:
--
Affects Version/s: 3.0.0

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Wing Yew Poon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136792#comment-17136792
 ] 

Wing Yew Poon commented on SPARK-32003:
---

I will open a PR soon with a solution.

> Shuffle files for lost executor are not unregistered if fetch failure occurs 
> after executor is lost
> ---
>
> Key: SPARK-32003
> URL: https://issues.apache.org/jira/browse/SPARK-32003
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.4.6
>Reporter: Wing Yew Poon
>Priority: Major
>
> A customer's cluster has a node that goes down while a Spark application is 
> running. (They are running Spark on YARN with the external shuffle service 
> enabled.) An executor is lost (apparently the only one running on the node). 
> This executor lost event is handled in the DAGScheduler, which removes the 
> executor from its BlockManagerMaster. At this point, there is no 
> unregistering of shuffle files for the executor or the node. Soon after, 
> tasks trying to fetch shuffle files output by that executor fail with 
> FetchFailed (because the node is down, there is no NodeManager available to 
> serve shuffle files). By right, such fetch failures should cause the shuffle 
> files for the executor to be unregistered, but they do not.
> Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
> fetch failure form the lost executor's shuffle output. This time, since the 
> failed epoch for the executor is higher, the executor is removed again (this 
> doesn't really do anything, the executor was already removed when it was 
> lost) and this time the shuffle output is unregistered.
> So it takes two stage attempts instead of one to clear the shuffle output. We 
> get 4 attempts by default. The customer was unlucky and two nodes went down 
> during the stage, i.e., the same problem happened twice. So they used up 4 
> stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31996) Specify the version of ChromeDriver and RemoteWebDriver which can work with guava 14.0.1

2020-06-16 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta reassigned SPARK-31996:
--

Assignee: (was: Kousuke Saruta)

> Specify the version of ChromeDriver and RemoteWebDriver which can work with 
> guava 14.0.1
> 
>
> Key: SPARK-31996
> URL: https://issues.apache.org/jira/browse/SPARK-31996
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> SPARK-31765 upgraded HtmlUnit due to a security reason and Selenium was also 
> needed to be upgraded to work with the upgraded HtmlUnit.
> After upgrading Selenium, ChromeDriver and RemoteWebDriver are implicitly 
> upgraded because of dependency and the the implicitly upgraded modules can't 
> work with guava 14.0.1 due to an API compatibility so we need to run 
> ChromeUISeleniumSuite with a guava version specified like 
> -Dguava.version=25.0-jre.
> {code:java}
> $ build/sbt -Dguava.version=25.0-jre 
> -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver 
> -Dtest.default.exclude.tags= "testOnly 
> org.apache.spark.ui.UISeleniumSuite"{code}
> It's a little bit inconvenience so let's use older version which can work 
> with guava 14.0.1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements

2020-06-16 Thread Sonia Dalwani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136773#comment-17136773
 ] 

Sonia Dalwani edited comment on SPARK-26833 at 6/16/20, 4:08 PM:
-

Documentation can be changed like the following blog

[http://blog.brainlounge.de/memoryleaks/getting-started-with-spark-on-kubernetes/]

It gives better explanation for beginners with respect to user submission 
credentials. 


was (Author: sdalwani):
Documentation cab be changed like the following blog

[http://blog.brainlounge.de/memoryleaks/getting-started-with-spark-on-kubernetes/]

It gives better explanation for beginners with respect to user submission 
credentials. 

> Kubernetes RBAC documentation is unclear on exact RBAC requirements
> ---
>
> Key: SPARK-26833
> URL: https://issues.apache.org/jira/browse/SPARK-26833
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Rob Vesse
>Priority: Major
>
> I've seen a couple of users get bitten by this in informal discussions on 
> GitHub and Slack.  Basically the user sets up the service account and 
> configures Spark to use it as described in the documentation but then when 
> they try and run a job they encounter an error like the following:
> {quote}019-02-05 20:29:02 WARN  WatchConnectionManager:185 - Exec Failure: 
> HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: 
> User "system:anonymous" cannot watch pods in the namespace "default"
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: pods 
> "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot 
> watch pods in the namespace "default"{quote}
> This error stems from the fact that the configured service account is only 
> used by the driver pod and not by the submission client.  The submission 
> client wants to do driver pod monitoring which it does with the users 
> submission credentials *NOT* the service account as the user might expect.
> It seems like there are two ways to resolve this issue:
> * Improve the documentation to clarify the current situation
> * Ensure that if a service account is configured we always use it even on the 
> submission client
> The former is the easy fix, the latter is more invasive and may have other 
> knock on effects so we should start with the former and discuss the 
> feasibility of the latter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements

2020-06-16 Thread Sonia Dalwani (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136773#comment-17136773
 ] 

Sonia Dalwani commented on SPARK-26833:
---

Documentation cab be changed like the following blog

[http://blog.brainlounge.de/memoryleaks/getting-started-with-spark-on-kubernetes/]

It gives better explanation for beginners with respect to user submission 
credentials. 

> Kubernetes RBAC documentation is unclear on exact RBAC requirements
> ---
>
> Key: SPARK-26833
> URL: https://issues.apache.org/jira/browse/SPARK-26833
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Rob Vesse
>Priority: Major
>
> I've seen a couple of users get bitten by this in informal discussions on 
> GitHub and Slack.  Basically the user sets up the service account and 
> configures Spark to use it as described in the documentation but then when 
> they try and run a job they encounter an error like the following:
> {quote}019-02-05 20:29:02 WARN  WatchConnectionManager:185 - Exec Failure: 
> HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: 
> User "system:anonymous" cannot watch pods in the namespace "default"
> java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden'
> Exception in thread "main" 
> io.fabric8.kubernetes.client.KubernetesClientException: pods 
> "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot 
> watch pods in the namespace "default"{quote}
> This error stems from the fact that the configured service account is only 
> used by the driver pod and not by the submission client.  The submission 
> client wants to do driver pod monitoring which it does with the users 
> submission credentials *NOT* the service account as the user might expect.
> It seems like there are two ways to resolve this issue:
> * Improve the documentation to clarify the current situation
> * Ensure that if a service account is configured we always use it even on the 
> submission client
> The former is the easy fix, the latter is more invasive and may have other 
> knock on effects so we should start with the former and discuss the 
> feasibility of the latter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost

2020-06-16 Thread Wing Yew Poon (Jira)
Wing Yew Poon created SPARK-32003:
-

 Summary: Shuffle files for lost executor are not unregistered if 
fetch failure occurs after executor is lost
 Key: SPARK-32003
 URL: https://issues.apache.org/jira/browse/SPARK-32003
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 2.4.6
Reporter: Wing Yew Poon


A customer's cluster has a node that goes down while a Spark application is 
running. (They are running Spark on YARN with the external shuffle service 
enabled.) An executor is lost (apparently the only one running on the node). 
This executor lost event is handled in the DAGScheduler, which removes the 
executor from its BlockManagerMaster. At this point, there is no unregistering 
of shuffle files for the executor or the node. Soon after, tasks trying to 
fetch shuffle files output by that executor fail with FetchFailed (because the 
node is down, there is no NodeManager available to serve shuffle files). By 
right, such fetch failures should cause the shuffle files for the executor to 
be unregistered, but they do not.

Due to task failure, the stage is re-attempted. Tasks continue to fail due to 
fetch failure form the lost executor's shuffle output. This time, since the 
failed epoch for the executor is higher, the executor is removed again (this 
doesn't really do anything, the executor was already removed when it was lost) 
and this time the shuffle output is unregistered.

So it takes two stage attempts instead of one to clear the shuffle output. We 
get 4 attempts by default. The customer was unlucky and two nodes went down 
during the stage, i.e., the same problem happened twice. So they used up 4 
stage attempts and the stage failed and thus the job. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30876) Optimizer cannot infer from inferred constraints with join

2020-06-16 Thread Navin Viswanath (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136753#comment-17136753
 ] 

Navin Viswanath commented on SPARK-30876:
-

Hi [~yumwang] I'd be happy to take a look at this if that's ok.

> Optimizer cannot infer from inferred constraints with join
> --
>
> Key: SPARK-30876
> URL: https://issues.apache.org/jira/browse/SPARK-30876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce this issue:
> {code:sql}
> create table t1(a int, b int, c int);
> create table t2(a int, b int, c int);
> create table t3(a int, b int, c int);
> select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and 
> t3.c = 1);
> {code}
> Spark 2.3+:
> {noformat}
> == Physical Plan ==
> *(4) HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition, true, [id=#102]
>+- *(3) HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *(3) Project
>  +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight
> :- *(3) Project [b#10]
> :  +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight
> : :- *(3) Project [a#6]
> : :  +- *(3) Filter isnotnull(a#6)
> : : +- *(3) ColumnarToRow
> : :+- FileScan parquet default.t1[a#6] Batched: true, 
> DataFilters: [isnotnull(a#6)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: 
> struct
> : +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#87]
> :+- *(1) Project [b#10]
> :   +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1))
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t2[b#10] Batched: 
> true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], 
> ReadSchema: struct
> +- BroadcastExchange 
> HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), 
> [id=#96]
>+- *(2) Project [c#14]
>   +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1))
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t3[c#14] Batched: true, 
> DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], 
> ReadSchema: struct
> Time taken: 3.785 seconds, Fetched 1 row(s)
> {noformat}
> Spark 2.2.x:
> {noformat}
> == Physical Plan ==
> *HashAggregate(keys=[], functions=[count(1)])
> +- Exchange SinglePartition
>+- *HashAggregate(keys=[], functions=[partial_count(1)])
>   +- *Project
>  +- *SortMergeJoin [b#19], [c#23], Inner
> :- *Project [b#19]
> :  +- *SortMergeJoin [a#15], [b#19], Inner
> : :- *Sort [a#15 ASC NULLS FIRST], false, 0
> : :  +- Exchange hashpartitioning(a#15, 200)
> : : +- *Filter (isnotnull(a#15) && (a#15 = 1))
> : :+- HiveTableScan [a#15], HiveTableRelation 
> `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, 
> b#16, c#17]
> : +- *Sort [b#19 ASC NULLS FIRST], false, 0
> :+- Exchange hashpartitioning(b#19, 200)
> :   +- *Filter (isnotnull(b#19) && (b#19 = 1))
> :  +- HiveTableScan [b#19], HiveTableRelation 
> `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, 
> b#19, c#20]
> +- *Sort [c#23 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c#23, 200)
>   +- *Filter (isnotnull(c#23) && (c#23 = 1))
>  +- HiveTableScan [c#23], HiveTableRelation 
> `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, 
> b#22, c#23]
> Time taken: 0.728 seconds, Fetched 1 row(s)
> {noformat}
> Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31710) Fail casting numeric to timestamp by default

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136674#comment-17136674
 ] 

Apache Spark commented on SPARK-31710:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28843

> Fail casting numeric to timestamp by default
> 
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Assignee: philipse
>Priority: Major
> Fix For: 3.1.0
>
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31710) Fail casting numeric to timestamp by default

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136672#comment-17136672
 ] 

Apache Spark commented on SPARK-31710:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/28843

> Fail casting numeric to timestamp by default
> 
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Assignee: philipse
>Priority: Major
> Fix For: 3.1.0
>
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32002) spark error while select nest data

2020-06-16 Thread Yiqun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Zhang updated SPARK-32002:

Description: 
nest-data.json
{code:java}
{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
{code:java}
val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

{code}
{color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 
'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
 {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color}
 {color:#ff}+- SubqueryAlias `nest_table`{color}
 {color:#ff} +- Relation[a#6|#6] json{color}

{color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for 
c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match 
{color}GetArrayItem, extraction ("c") treat as an ordinal.

org.apache.spark.sql.catalyst.expressions.ExtractValue
{code:java}
def apply(
  child: Expression,
  extraction: Expression,
  resolver: Resolver): Expression = {
   (child.dataType, extraction) match {
  case (StructType(fields), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetStructField(child, ordinal, Some(fieldName))  
  case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, 
StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
  ordinal, fields.length, containsNull)  
  case (_: ArrayType, _) => GetArrayItem(child, extraction)  
  case (MapType(kt, _, _), _) => GetMapValue(child, extraction)  
  case (otherType, _) =>
val errorMsg = otherType match {
  case StructType(_) =>
s"Field name should be String Literal, but it's $extraction"
  case other =>
s"Can't extract value from $child: need struct type but got 
${other.catalogString}"
}
throw new AnalysisException(errorMsg)
}
  }{code}
 

  was:
nest-data.json
{code:java}
{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
{code:java}
val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

{code}
{color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 
'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
 {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color}
 {color:#ff}+- SubqueryAlias `nest_table`{color}
 {color:#ff} +- Relation[a#6|#6] json{color}

{color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for 
c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match 
{color}GetArrayItem, extraction ("c") treat as an ordinal.
{code:java}
def apply(
  child: Expression,
  extraction: Expression,
  resolver: Resolver): Expression = {
   (child.dataType, extraction) match {
  case (StructType(fields), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetStructField(child, ordinal, Some(fieldName))  
  case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, 
StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
  ordinal, fields.length, containsNull)  
  case (_: ArrayType, _) => GetArrayItem(child, extraction)  
  case (MapType(kt, _, _), _) => GetMapValue(child, extraction)  
  case (otherType, _) =>
val errorMsg = otherType match {
  case StructType(_) =>
s"Field name should be String Literal, but it's $extraction"
  case other =>
s"Can't extract value from $child: need struct type but got 
${other.catalogString}"
}
throw new AnalysisException(errorMsg)
}
  }{code}
 


> spark error while select nest data
> --
>
> Key: SPARK-32002
> URL: https://issues.apache.org/jira/browse/SPARK-32002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Yiqun Zhang
>Priority: Minor
>
> nest-data.json
> {code:java}
> {"a": [{"b": [{"c": [1,2]}]}]}
> {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
> {code:java}
> val df: DataFrame = 

[jira] [Updated] (SPARK-32002) spark error while select nest data

2020-06-16 Thread Yiqun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Zhang updated SPARK-32002:

Description: 
nest-data.json
{code:java}
{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
{code:java}
val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

{code}
{color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 
'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
 {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color}
 {color:#ff}+- SubqueryAlias `nest_table`{color}
 {color:#ff} +- Relation[a#6|#6] json{color}

{color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for 
c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match 
{color}GetArrayItem, extraction ("c") treat as an ordinal.
{code:java}
def apply(
  child: Expression,
  extraction: Expression,
  resolver: Resolver): Expression = {
   (child.dataType, extraction) match {
  case (StructType(fields), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetStructField(child, ordinal, Some(fieldName))  
  case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, 
StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
  ordinal, fields.length, containsNull)  
  case (_: ArrayType, _) => GetArrayItem(child, extraction)  
  case (MapType(kt, _, _), _) => GetMapValue(child, extraction)  
  case (otherType, _) =>
val errorMsg = otherType match {
  case StructType(_) =>
s"Field name should be String Literal, but it's $extraction"
  case other =>
s"Can't extract value from $child: need struct type but got 
${other.catalogString}"
}
throw new AnalysisException(errorMsg)
}
  }{code}
 

  was:
nest-data.json
{code:java}
{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
{code:java}
val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

{code}
{color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 
'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
 {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color}
 {color:#ff}+- SubqueryAlias `nest_table`{color}
 {color:#ff} +- Relation[a#6|#6] json{color}

{color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for 
c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match 
{color}GetArrayItem, extraction ("c") treat as an ordinal.
{code:java}
def apply(
  child: Expression,
  extraction: Expression,
  resolver: Resolver): Expression = {(child.dataType, extraction) match 
{
  case (StructType(fields), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetStructField(child, ordinal, Some(fieldName))  
  case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, 
StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
  ordinal, fields.length, containsNull)  
  case (_: ArrayType, _) => GetArrayItem(child, extraction)  
  case (MapType(kt, _, _), _) => GetMapValue(child, extraction)  
  case (otherType, _) =>
val errorMsg = otherType match {
  case StructType(_) =>
s"Field name should be String Literal, but it's $extraction"
  case other =>
s"Can't extract value from $child: need struct type but got 
${other.catalogString}"
}
throw new AnalysisException(errorMsg)
}
  }{code}
 


> spark error while select nest data
> --
>
> Key: SPARK-32002
> URL: https://issues.apache.org/jira/browse/SPARK-32002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Yiqun Zhang
>Priority: Minor
>
> nest-data.json
> {code:java}
> {"a": [{"b": [{"c": [1,2]}]}]}
> {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
> {code:java}
> val df: DataFrame = spark.read.json(testFile("nest-data.json"))
> df.createTempView("nest_table")
> sql("select a.b.c 

[jira] [Updated] (SPARK-32002) spark error while select nest data

2020-06-16 Thread Yiqun Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yiqun Zhang updated SPARK-32002:

Description: 
nest-data.json
{code:java}
{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
{code:java}
val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

{code}
{color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 
'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
 {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color}
 {color:#ff}+- SubqueryAlias `nest_table`{color}
 {color:#ff} +- Relation[a#6|#6] json{color}

{color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for 
c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match 
{color}GetArrayItem, extraction ("c") treat as an ordinal.
{code:java}
def apply(
  child: Expression,
  extraction: Expression,
  resolver: Resolver): Expression = {(child.dataType, extraction) match 
{
  case (StructType(fields), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetStructField(child, ordinal, Some(fieldName))  
  case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, 
StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
  ordinal, fields.length, containsNull)  
  case (_: ArrayType, _) => GetArrayItem(child, extraction)  
  case (MapType(kt, _, _), _) => GetMapValue(child, extraction)  
  case (otherType, _) =>
val errorMsg = otherType match {
  case StructType(_) =>
s"Field name should be String Literal, but it's $extraction"
  case other =>
s"Can't extract value from $child: need struct type but got 
${other.catalogString}"
}
throw new AnalysisException(errorMsg)
}
  }{code}
 

  was:
nest-data.json
{code:java}
{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
{code:java}
val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

{code}
{color:#FF}org.apache.spark.sql.AnalysisException: cannot resolve 
'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
{color:#FF}'Project [a#6.b[c] AS c#8]{color}
{color:#FF}+- SubqueryAlias `nest_table`{color}
{color:#FF} +- Relation[a#6] json{color}

{color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for 
c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match 
{color}GetArrayItem, extraction ("c") treat as an ordinal.
{code:java}
def apply(
  child: Expression,
  extraction: Expression,
  resolver: Resolver): Expression = {(child.dataType, extraction) match 
{
  case (StructType(fields), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetStructField(child, ordinal, Some(fieldName))  case 
(ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
  ordinal, fields.length, containsNull)  case (_: ArrayType, _) => 
GetArrayItem(child, extraction)  case (MapType(kt, _, _), _) => 
GetMapValue(child, extraction)  case (otherType, _) =>
val errorMsg = otherType match {
  case StructType(_) =>
s"Field name should be String Literal, but it's $extraction"
  case other =>
s"Can't extract value from $child: need struct type but got 
${other.catalogString}"
}
throw new AnalysisException(errorMsg)
}
  }{code}
 


> spark error while select nest data
> --
>
> Key: SPARK-32002
> URL: https://issues.apache.org/jira/browse/SPARK-32002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Yiqun Zhang
>Priority: Minor
>
> nest-data.json
> {code:java}
> {"a": [{"b": [{"c": [1,2]}]}]}
> {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
> {code:java}
> val df: DataFrame = spark.read.json(testFile("nest-data.json"))
> df.createTempView("nest_table")
> sql("select a.b.c from nest_table").show()
> {code}
> 

[jira] [Created] (SPARK-32002) spark error while select nest data

2020-06-16 Thread Yiqun Zhang (Jira)
Yiqun Zhang created SPARK-32002:
---

 Summary: spark error while select nest data
 Key: SPARK-32002
 URL: https://issues.apache.org/jira/browse/SPARK-32002
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Yiqun Zhang


nest-data.json
{code:java}
{"a": [{"b": [{"c": [1,2]}]}]}
{"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code}
{code:java}
val df: DataFrame = spark.read.json(testFile("nest-data.json"))
df.createTempView("nest_table")
sql("select a.b.c from nest_table").show()

{code}
{color:#FF}org.apache.spark.sql.AnalysisException: cannot resolve 
'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires 
integral type, however, ''c'' is of string type.; line 1 pos 7;{color}
{color:#FF}'Project [a#6.b[c] AS c#8]{color}
{color:#FF}+- SubqueryAlias `nest_table`{color}
{color:#FF} +- Relation[a#6] json{color}

{color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for 
c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match 
{color}GetArrayItem, extraction ("c") treat as an ordinal.
{code:java}
def apply(
  child: Expression,
  extraction: Expression,
  resolver: Resolver): Expression = {(child.dataType, extraction) match 
{
  case (StructType(fields), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetStructField(child, ordinal, Some(fieldName))  case 
(ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) =>
val fieldName = v.toString
val ordinal = findField(fields, fieldName, resolver)
GetArrayStructFields(child, fields(ordinal).copy(name = fieldName),
  ordinal, fields.length, containsNull)  case (_: ArrayType, _) => 
GetArrayItem(child, extraction)  case (MapType(kt, _, _), _) => 
GetMapValue(child, extraction)  case (otherType, _) =>
val errorMsg = otherType match {
  case StructType(_) =>
s"Field name should be String Literal, but it's $extraction"
  case other =>
s"Can't extract value from $child: need struct type but got 
${other.catalogString}"
}
throw new AnalysisException(errorMsg)
}
  }{code}
 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31984) Make micros rebasing functions via local timestamps pure

2020-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31984:
---

Assignee: Maxim Gekk

> Make micros rebasing functions via local timestamps pure
> 
>
> Key: SPARK-31984
> URL: https://issues.apache.org/jira/browse/SPARK-31984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The functions rebaseGregorianToJulianMicros(zoneId: ZoneId, ...) and 
> rebaseJulianToGregorianMicros(zoneId: ZoneId, ...) accept the zone id as the 
> first parameter but use it only while forming ZonedDateTime and ignore in 
> Java 7 GregorianCalendar. The Calendar instance uses the default JVM time 
> zone internally. This causes the following problems:
> # The functions depend on the global state variable. And calling the 
> functions from different threads can return wrong results if the the default 
> JVM time zone is changed during the execution.
> # It is impossible to speed up generation of JSON files with diff/switch via 
> parallelisation.
> # The functions don't fully use the passed ZoneId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31984) Make micros rebasing functions via local timestamps pure

2020-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31984.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28824
[https://github.com/apache/spark/pull/28824]

> Make micros rebasing functions via local timestamps pure
> 
>
> Key: SPARK-31984
> URL: https://issues.apache.org/jira/browse/SPARK-31984
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.1.0
>
>
> The functions rebaseGregorianToJulianMicros(zoneId: ZoneId, ...) and 
> rebaseJulianToGregorianMicros(zoneId: ZoneId, ...) accept the zone id as the 
> first parameter but use it only while forming ZonedDateTime and ignore in 
> Java 7 GregorianCalendar. The Calendar instance uses the default JVM time 
> zone internally. This causes the following problems:
> # The functions depend on the global state variable. And calling the 
> functions from different threads can return wrong results if the the default 
> JVM time zone is changed during the execution.
> # It is impossible to speed up generation of JSON files with diff/switch via 
> parallelisation.
> # The functions don't fully use the passed ZoneId.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31981) Keep TimestampType when taking an average of a Timestamp

2020-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31981:

Fix Version/s: (was: 3.1.0)

> Keep TimestampType when taking an average of a Timestamp
> 
>
> Key: SPARK-31981
> URL: https://issues.apache.org/jira/browse/SPARK-31981
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Fokko Driesprong
>Priority: Major
>
> Currently, when you take an average of a Timestamp, you'll end up with a 
> Double, representing the seconds since epoch. This is because of old Hive 
> behavior. I strongly believe that it is better to return a Timestamp.
> root@8c4241b617ec:/# psql postgres postgres
> psql (12.3 (Debian 12.3-1.pgdg100+1))
> Type "help" for help.
> postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP);
> CREATE TABLE
> postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
> INSERT 0 1
> postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
> INSERT 0 1
> postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
> INSERT 0 1
> postgres=# SELECT AVG(ts) FROM timestamp_demo;
> ERROR: function avg(timestamp without time zone) does not exist
> LINE 1: SELECT AVG(ts) FROM timestamp_demo;
>  
> root@bab43a5731e8:/# mysql
> Welcome to the MySQL monitor. Commands end with ; or \g.
> Your MySQL connection id is 9
> Server version: 8.0.20 MySQL Community Server - GPL
> Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.
> Oracle is a registered trademark of Oracle Corporation and/or its
> affiliates. Other names may be trademarks of their respective
> owners.
> Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.
> mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP);
> Query OK, 0 rows affected (0.05 sec)
> mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11');
> Query OK, 1 row affected (0.01 sec)
> mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11');
> Query OK, 1 row affected (0.01 sec)
> mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11');
> Query OK, 1 row affected (0.01 sec)
> mysql> SELECT AVG(ts) FROM timestamp_demo;
> +-+
> | AVG(ts) |
> +-+
> | 20180101182211. |
> +-+
> 1 row in set (0.00 sec)
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-16 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-32001:
--
Description: JDBC connection providers must be loaded independently just 
like delegation token providers.

> Create Kerberos authentication provider API in JDBC connector
> -
>
> Key: SPARK-32001
> URL: https://issues.apache.org/jira/browse/SPARK-32001
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>
> JDBC connection providers must be loaded independently just like delegation 
> token providers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector

2020-06-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136579#comment-17136579
 ] 

Gabor Somogyi commented on SPARK-31857:
---

My plan is to support external providers which will be implemented here: 
https://issues.apache.org/jira/browse/SPARK-32001


> Support Azure SQLDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31857
> URL: https://issues.apache.org/jira/browse/SPARK-31857
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector

2020-06-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136580#comment-17136580
 ] 

Gabor Somogyi commented on SPARK-31815:
---

My plan is to support external providers which will be implemented here: 
https://issues.apache.org/jira/browse/SPARK-32001


> Support Hive Kerberos login in JDBC connector
> -
>
> Key: SPARK-31815
> URL: https://issues.apache.org/jira/browse/SPARK-31815
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector

2020-06-16 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136578#comment-17136578
 ] 

Gabor Somogyi commented on SPARK-31884:
---

My plan is to support external providers which will be implemented here: 
https://issues.apache.org/jira/browse/SPARK-32001


> Support MongoDB Kerberos login in JDBC connector
> 
>
> Key: SPARK-31884
> URL: https://issues.apache.org/jira/browse/SPARK-31884
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector

2020-06-16 Thread Gabor Somogyi (Jira)
Gabor Somogyi created SPARK-32001:
-

 Summary: Create Kerberos authentication provider API in JDBC 
connector
 Key: SPARK-32001
 URL: https://issues.apache.org/jira/browse/SPARK-32001
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Gabor Somogyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-16 Thread Lin Gang Deng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136550#comment-17136550
 ] 

Lin Gang Deng commented on SPARK-31955:
---

[~dengzh], the version of beeline in spark2.4 and before is 1.2.1,  spark3.0 
upgrade the beeline to 2.3.7. 

For spark,The issue was fixed in spark3.0.

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
> Fix For: 3.0.0
>
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-16 Thread Zhihua Deng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136535#comment-17136535
 ] 

Zhihua Deng commented on SPARK-31955:
-

The issue seems have been fixed in  
[HIVE-10541|https://issues.apache.org/jira/browse/HIVE-10541].

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
> Fix For: 3.0.0
>
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-16 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136527#comment-17136527
 ] 

Yuming Wang commented on SPARK-31955:
-

The issue fixed by upgrading the beeline to 2.3.7. How to reproduce this issue:

*Prepare table*
{code:sql}
create table test_beeline using parquet as select id from range(5);
{code}

*Prepare SQL*
{code:sql}
echo -en "select * from test_beeline\n where id=2;" >> test.sql
{code}

*Spark 2.4*:
{noformat}
[root@spark-3267648 spark-2.4.4-bin-hadoop2.7]# bin/beeline -u 
"jdbc:hive2://localhost:1" -f /root/spark-3.0.0-bin-hadoop3.2/test.sql
Connecting to jdbc:hive2://localhost:1
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connected to: Spark SQL (version 3.0.0)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:1> select * from test_beeline
0: jdbc:hive2://localhost:1>  where id=2;+-+--+
| id  |
+-+--+
| 0   |
| 2   |
| 1   |
| 3   |
| 4   |
+-+--+
5 rows selected (5.622 seconds)
0: jdbc:hive2://localhost:1>  where id=2;
Closing: 0: jdbc:hive2://localhost:1
{noformat}

*Spark 3.0*:
{noformat}
[root@spark-3267648 spark-3.0.0-bin-hadoop3.2]# bin/beeline -u 
"jdbc:hive2://localhost:1" -f test.sql
log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
info.
Connecting to jdbc:hive2://localhost:1
Connected to: Spark SQL (version 3.0.0)
Driver: Hive JDBC (version 2.3.7)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:1> select * from test_beeline
. . . . . . . . . . . . . . . .>  where id=2;
+-+
| id  |
+-+
| 2   |
+-+
1 row selected (7.749 seconds)
0: jdbc:hive2://localhost:1> Closing: 0: jdbc:hive2://localhost:1
{noformat}





> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
> Fix For: 3.0.0
>
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reopened SPARK-31955:
-

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
> Fix For: 3.0.0
>
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-31955.
-
Resolution: Fixed

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
> Fix For: 3.0.0
>
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline

2020-06-16 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31955:

Fix Version/s: 3.0.0

> Beeline discard the last line of the sql file when submited to  thriftserver 
> via beeline
> 
>
> Key: SPARK-31955
> URL: https://issues.apache.org/jira/browse/SPARK-31955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4
>Reporter: Lin Gang Deng
>Priority: Major
> Fix For: 3.0.0
>
>
> I submitted a sql file on beeline and the result returned is wrong. After 
> many tests, it was found that the sql executed by Spark would discard the 
> last line.This should be beeline's bug parsing sql file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31997) Should drop test_udtf table when SingleSessionSuite completed

2020-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31997:


Assignee: Yang Jie

> Should drop test_udtf table when SingleSessionSuite completed
> -
>
> Key: SPARK-31997
> URL: https://issues.apache.org/jira/browse/SPARK-31997
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> If we execute mvn test SingleSessionSuite and HiveThriftBinaryServerSuite in 
> order,  the test case "SPARK-11595 ADD JAR with input path having URL scheme" 
>  in HiveThriftBinaryServerSuite will failed as following:
>  
> {code:java}
> - SPARK-11595 ADD JAR with input path having URL scheme *** FAILED *** 
> java.sql.SQLException: Error running query: 
> org.apache.spark.sql.AnalysisException: Can not create the managed 
> table('`default`.`test_udtf`'). The associated 
> location('file:/home/yarn/spark_ut/spark_ut/baidu/inf-spark/spark-source/sql/hive-thriftserver/spark-warehouse/test_udtf')
>  already exists.; at 
> org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)
>  at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65(HiveThriftServer2Suites.scala:603)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65$adapted(HiveThriftServer2Suites.scala:603)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63(HiveThriftServer2Suites.scala:603)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63$adapted(HiveThriftServer2Suites.scala:573)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3(HiveThriftServer2Suites.scala:1074)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3$adapted(HiveThriftServer2Suites.scala:1074)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> {code}
>  
> because SingleSessionSuite do `create table test_udtf` and not drop it when 
> test  completed and HiveThriftBinaryServerSuite want to re-create this table.
> If we execute mvn test HiveThriftBinaryServerSuite and SingleSessionSuite in 
> order,both test suites will succeed, but we shouldn't rely on their execution 
> order
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31997) Should drop test_udtf table when SingleSessionSuite completed

2020-06-16 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31997.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 28838
[https://github.com/apache/spark/pull/28838]

> Should drop test_udtf table when SingleSessionSuite completed
> -
>
> Key: SPARK-31997
> URL: https://issues.apache.org/jira/browse/SPARK-31997
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> If we execute mvn test SingleSessionSuite and HiveThriftBinaryServerSuite in 
> order,  the test case "SPARK-11595 ADD JAR with input path having URL scheme" 
>  in HiveThriftBinaryServerSuite will failed as following:
>  
> {code:java}
> - SPARK-11595 ADD JAR with input path having URL scheme *** FAILED *** 
> java.sql.SQLException: Error running query: 
> org.apache.spark.sql.AnalysisException: Can not create the managed 
> table('`default`.`test_udtf`'). The associated 
> location('file:/home/yarn/spark_ut/spark_ut/baidu/inf-spark/spark-source/sql/hive-thriftserver/spark-warehouse/test_udtf')
>  already exists.; at 
> org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385)
>  at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65(HiveThriftServer2Suites.scala:603)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65$adapted(HiveThriftServer2Suites.scala:603)
>  at scala.collection.immutable.List.foreach(List.scala:392) at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63(HiveThriftServer2Suites.scala:603)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63$adapted(HiveThriftServer2Suites.scala:573)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3(HiveThriftServer2Suites.scala:1074)
>  at 
> org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3$adapted(HiveThriftServer2Suites.scala:1074)
>  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
> {code}
>  
> because SingleSessionSuite do `create table test_udtf` and not drop it when 
> test  completed and HiveThriftBinaryServerSuite want to re-create this table.
> If we execute mvn test HiveThriftBinaryServerSuite and SingleSessionSuite in 
> order,both test suites will succeed, but we shouldn't rely on their execution 
> order
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136495#comment-17136495
 ] 

Apache Spark commented on SPARK-31962:
--

User 'cchighman' has created a pull request for this issue:
https://github.com/apache/spark/pull/28841

> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming or just loading from a file data source, I've 
> encountered a number of occasions where I want to be able to stream from a 
> folder containing any number of historical files in CSV format.  When I start 
> reading from a folder, however, I might only care about files that were 
> created after a certain time.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
>  there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
> containing an implicit _modificationDate_ property.  We may already iterate 
> the resulting files if a filter is applied to the path.  In this case, its 
> trivial to do a primitive comparison against _modificationDate_ and a date 
> specified from an option.  Without the filter specified, we would be 
> expending less effort than if the filter were applied by itself since we are 
> comparing primitives.  
> Having the ability to provide an option where specifying a timestamp when 
> loading files from a path would minimize complexity for consumers who 
> leverage the ability to load files or do structured streaming from a folder 
> path but do not have an interest in reading what could be thousands of files 
> that are not relevant.
> One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
> below.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .option("fileModifiedDate", "2020-05-01T12:00:00")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have been modified at or later than the 
> specified time in order to be consumed for purposes of reading files from a 
> folder path or via structured streaming.
>  I have unit tests passing under F_ileIndexSuite_ in the 
> _spark.sql.execution.datasources_ package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136494#comment-17136494
 ] 

Apache Spark commented on SPARK-31962:
--

User 'cchighman' has created a pull request for this issue:
https://github.com/apache/spark/pull/28841

> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming or just loading from a file data source, I've 
> encountered a number of occasions where I want to be able to stream from a 
> folder containing any number of historical files in CSV format.  When I start 
> reading from a folder, however, I might only care about files that were 
> created after a certain time.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
>  there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
> containing an implicit _modificationDate_ property.  We may already iterate 
> the resulting files if a filter is applied to the path.  In this case, its 
> trivial to do a primitive comparison against _modificationDate_ and a date 
> specified from an option.  Without the filter specified, we would be 
> expending less effort than if the filter were applied by itself since we are 
> comparing primitives.  
> Having the ability to provide an option where specifying a timestamp when 
> loading files from a path would minimize complexity for consumers who 
> leverage the ability to load files or do structured streaming from a folder 
> path but do not have an interest in reading what could be thousands of files 
> that are not relevant.
> One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
> below.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .option("fileModifiedDate", "2020-05-01T12:00:00")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have been modified at or later than the 
> specified time in order to be consumed for purposes of reading files from a 
> folder path or via structured streaming.
>  I have unit tests passing under F_ileIndexSuite_ in the 
> _spark.sql.execution.datasources_ package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31962:


Assignee: (was: Apache Spark)

> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming or just loading from a file data source, I've 
> encountered a number of occasions where I want to be able to stream from a 
> folder containing any number of historical files in CSV format.  When I start 
> reading from a folder, however, I might only care about files that were 
> created after a certain time.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
>  there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
> containing an implicit _modificationDate_ property.  We may already iterate 
> the resulting files if a filter is applied to the path.  In this case, its 
> trivial to do a primitive comparison against _modificationDate_ and a date 
> specified from an option.  Without the filter specified, we would be 
> expending less effort than if the filter were applied by itself since we are 
> comparing primitives.  
> Having the ability to provide an option where specifying a timestamp when 
> loading files from a path would minimize complexity for consumers who 
> leverage the ability to load files or do structured streaming from a folder 
> path but do not have an interest in reading what could be thousands of files 
> that are not relevant.
> One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
> below.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .option("fileModifiedDate", "2020-05-01T12:00:00")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have been modified at or later than the 
> specified time in order to be consumed for purposes of reading files from a 
> folder path or via structured streaming.
>  I have unit tests passing under F_ileIndexSuite_ in the 
> _spark.sql.execution.datasources_ package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31962:


Assignee: Apache Spark

> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Assignee: Apache Spark
>Priority: Minor
>
> When using structured streaming or just loading from a file data source, I've 
> encountered a number of occasions where I want to be able to stream from a 
> folder containing any number of historical files in CSV format.  When I start 
> reading from a folder, however, I might only care about files that were 
> created after a certain time.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
>  there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
> containing an implicit _modificationDate_ property.  We may already iterate 
> the resulting files if a filter is applied to the path.  In this case, its 
> trivial to do a primitive comparison against _modificationDate_ and a date 
> specified from an option.  Without the filter specified, we would be 
> expending less effort than if the filter were applied by itself since we are 
> comparing primitives.  
> Having the ability to provide an option where specifying a timestamp when 
> loading files from a path would minimize complexity for consumers who 
> leverage the ability to load files or do structured streaming from a folder 
> path but do not have an interest in reading what could be thousands of files 
> that are not relevant.
> One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
> below.
> {code:java}
> spark.read
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .option("fileModifiedDate", "2020-05-01T12:00:00")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have been modified at or later than the 
> specified time in order to be consumed for purposes of reading files from a 
> folder path or via structured streaming.
>  I have unit tests passing under F_ileIndexSuite_ in the 
> _spark.sql.execution.datasources_ package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming or just loading from a file data source, I've 
encountered a number of occasions where I want to be able to stream from a 
folder containing any number of historical files in CSV format.  When I start 
reading from a folder, however, I might only care about files that were created 
after a certain time.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.read
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under F_ileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming or just loading from a file data 

[jira] [Assigned] (SPARK-31999) Add refresh function command

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31999:


Assignee: Apache Spark

> Add refresh function command
> 
>
> Key: SPARK-31999
> URL: https://issues.apache.org/jira/browse/SPARK-31999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31999) Add refresh function command

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31999:


Assignee: (was: Apache Spark)

> Add refresh function command
> 
>
> Key: SPARK-31999
> URL: https://issues.apache.org/jira/browse/SPARK-31999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Description: 
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_fileModifiedDate_" accepting a UTC datetime like 
below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("fileModifiedDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.

  was:
When using structured streaming with a FileDataSource, I've encountered a 
number of occasions where I want to be able to stream from a folder containing 
any number of historical files in CSV format.  When I start reading from a 
folder, however, I might only care about files that were created after a 
certain time.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .format("csv")
 .load("/mnt/Deltas")
{code}
In 
[https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
 there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
containing an implicit _modificationDate_ property.  We may already iterate the 
resulting files if a filter is applied to the path.  In this case, its trivial 
to do a primitive comparison against _modificationDate_ and a date specified 
from an option.  Without the filter specified, we would be expending less 
effort than if the filter were applied by itself since we are comparing 
primitives.  

Having the ability to provide an option where specifying a timestamp when 
loading files from a path would minimize complexity for consumers who leverage 
the ability to load files or do structured streaming from a folder path but do 
not have an interest in reading what could be thousands of files that are not 
relevant.

One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime 
like below.
{code:java}
spark.readStream
 .option("header", "true")
 .option("delimiter", "\t")
 .option("filesModifiedAfterDate", "2020-05-01T12:00:00")
 .format("csv")
 .load("/mnt/Deltas")
{code}
If this option is specified, the expected behavior would be that files within 
the _"/mnt/Deltas/"_ path must have been modified at or later than the 
specified time in order to be consumed for purposes of reading files from a 
folder path or via structured streaming.

 I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
_spark.sql.execution.datasources_ package.


> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a 

[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path

2020-06-16 Thread Christopher Highman (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Highman updated SPARK-31962:

Component/s: (was: Structured Streaming)

> Provide option to load files after a specified date when reading from a 
> folder path
> ---
>
> Key: SPARK-31962
> URL: https://issues.apache.org/jira/browse/SPARK-31962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Christopher Highman
>Priority: Minor
>
> When using structured streaming with a FileDataSource, I've encountered a 
> number of occasions where I want to be able to stream from a folder 
> containing any number of historical files in CSV format.  When I start 
> reading from a folder, however, I might only care about files that were 
> created after a certain time.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> In 
> [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala],
>  there is a method, _listLeafFiles,_ which builds _FileStatus_ objects 
> containing an implicit _modificationDate_ property.  We may already iterate 
> the resulting files if a filter is applied to the path.  In this case, its 
> trivial to do a primitive comparison against _modificationDate_ and a date 
> specified from an option.  Without the filter specified, we would be 
> expending less effort than if the filter were applied by itself since we are 
> comparing primitives.  
> Having the ability to provide an option where specifying a timestamp when 
> loading files from a path would minimize complexity for consumers who 
> leverage the ability to load files or do structured streaming from a folder 
> path but do not have an interest in reading what could be thousands of files 
> that are not relevant.
> One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime 
> like below.
> {code:java}
> spark.readStream
>  .option("header", "true")
>  .option("delimiter", "\t")
>  .option("filesModifiedAfterDate", "2020-05-01T12:00:00")
>  .format("csv")
>  .load("/mnt/Deltas")
> {code}
> If this option is specified, the expected behavior would be that files within 
> the _"/mnt/Deltas/"_ path must have been modified at or later than the 
> specified time in order to be consumed for purposes of reading files from a 
> folder path or via structured streaming.
>  I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the 
> _spark.sql.execution.datasources_ package.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31999) Add refresh function command

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136458#comment-17136458
 ] 

Apache Spark commented on SPARK-31999:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28840

> Add refresh function command
> 
>
> Key: SPARK-31999
> URL: https://issues.apache.org/jira/browse/SPARK-31999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31999) Add refresh function command

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136460#comment-17136460
 ] 

Apache Spark commented on SPARK-31999:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/28840

> Add refresh function command
> 
>
> Key: SPARK-31999
> URL: https://issues.apache.org/jira/browse/SPARK-31999
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32000:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Fix the flaky testcase for partially launched task in barrier-mode.
> ---
>
> Key: SPARK-32000
> URL: https://issues.apache.org/jira/browse/SPARK-32000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> I noticed sometimes the testcase for SPARK-31485 fails.
> The reason should be related to the locality wait. 
> If the scheduler waits for a resource offer which meets the preferred 
> location for a task until the time-limit of process-local but no resource can 
> be offered for the locality level, the scheduler will give up the preferred 
> location. In this case, such task can be assigned to off-preferred location.
> The testcase for SPARK-31485, there are two tasks and only one task is 
> supposed to be assigned at one schedule round but both two tasks can be 
> assigned in that situation mentioned above and the testcase will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.

2020-06-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136452#comment-17136452
 ] 

Apache Spark commented on SPARK-32000:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28839

> Fix the flaky testcase for partially launched task in barrier-mode.
> ---
>
> Key: SPARK-32000
> URL: https://issues.apache.org/jira/browse/SPARK-32000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> I noticed sometimes the testcase for SPARK-31485 fails.
> The reason should be related to the locality wait. 
> If the scheduler waits for a resource offer which meets the preferred 
> location for a task until the time-limit of process-local but no resource can 
> be offered for the locality level, the scheduler will give up the preferred 
> location. In this case, such task can be assigned to off-preferred location.
> The testcase for SPARK-31485, there are two tasks and only one task is 
> supposed to be assigned at one schedule round but both two tasks can be 
> assigned in that situation mentioned above and the testcase will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.

2020-06-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32000:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Fix the flaky testcase for partially launched task in barrier-mode.
> ---
>
> Key: SPARK-32000
> URL: https://issues.apache.org/jira/browse/SPARK-32000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> I noticed sometimes the testcase for SPARK-31485 fails.
> The reason should be related to the locality wait. 
> If the scheduler waits for a resource offer which meets the preferred 
> location for a task until the time-limit of process-local but no resource can 
> be offered for the locality level, the scheduler will give up the preferred 
> location. In this case, such task can be assigned to off-preferred location.
> The testcase for SPARK-31485, there are two tasks and only one task is 
> supposed to be assigned at one schedule round but both two tasks can be 
> assigned in that situation mentioned above and the testcase will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.

2020-06-16 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-32000:
---
Issue Type: Bug  (was: Improvement)

> Fix the flaky testcase for partially launched task in barrier-mode.
> ---
>
> Key: SPARK-32000
> URL: https://issues.apache.org/jira/browse/SPARK-32000
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> I noticed sometimes the testcase for SPARK-31485 fails.
> The reason should be related to the locality wait. 
> If the scheduler waits for a resource offer which meets the preferred 
> location for a task until the time-limit of process-local but no resource can 
> be offered for the locality level, the scheduler will give up the preferred 
> location. In this case, such task can be assigned to off-preferred location.
> The testcase for SPARK-31485, there are two tasks and only one task is 
> supposed to be assigned at one schedule round but both two tasks can be 
> assigned in that situation mentioned above and the testcase will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.

2020-06-16 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-32000:
--

 Summary: Fix the flaky testcase for partially launched task in 
barrier-mode.
 Key: SPARK-32000
 URL: https://issues.apache.org/jira/browse/SPARK-32000
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Tests
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


I noticed sometimes the testcase for SPARK-31485 fails.
The reason should be related to the locality wait. 

If the scheduler waits for a resource offer which meets the preferred location 
for a task until the time-limit of process-local but no resource can be offered 
for the locality level, the scheduler will give up the preferred location. In 
this case, such task can be assigned to off-preferred location.

The testcase for SPARK-31485, there are two tasks and only one task is supposed 
to be assigned at one schedule round but both two tasks can be assigned in that 
situation mentioned above and the testcase will fail.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31710) Fail casting numeric to timestamp by default

2020-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31710.
-
Fix Version/s: 3.1.0
 Assignee: philipse
   Resolution: Fixed

> Fail casting numeric to timestamp by default
> 
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Assignee: philipse
>Priority: Major
> Fix For: 3.1.0
>
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31710) Fail casting numeric to timestamp by default

2020-06-16 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31710:

Summary: Fail casting numeric to timestamp by default  (was: result is the 
not the same when query and execute jobs)

> Fail casting numeric to timestamp by default
> 
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >