[jira] [Commented] (SPARK-29345) Add an API that allows a user to define and observe arbitrary metrics on batch and streaming queries
[ https://issues.apache.org/jira/browse/SPARK-29345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138120#comment-17138120 ] Wenchen Fan commented on SPARK-29345: - I'm resolving it as the only remaining thing is a small improvement. > Add an API that allows a user to define and observe arbitrary metrics on > batch and streaming queries > > > Key: SPARK-29345 > URL: https://issues.apache.org/jira/browse/SPARK-29345 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29345) Add an API that allows a user to define and observe arbitrary metrics on batch and streaming queries
[ https://issues.apache.org/jira/browse/SPARK-29345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29345. - Fix Version/s: 3.0.0 Resolution: Fixed > Add an API that allows a user to define and observe arbitrary metrics on > batch and streaming queries > > > Key: SPARK-29345 > URL: https://issues.apache.org/jira/browse/SPARK-29345 > Project: Spark > Issue Type: Epic > Components: SQL >Affects Versions: 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30108) Add robust accumulator for observable metrics
[ https://issues.apache.org/jira/browse/SPARK-30108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138118#comment-17138118 ] Wenchen Fan commented on SPARK-30108: - is there any progress on it? > Add robust accumulator for observable metrics > - > > Key: SPARK-30108 > URL: https://issues.apache.org/jira/browse/SPARK-30108 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Herman van Hövell >Priority: Major > > Spark accumulators reflect the work that has been done, and not the data that > has been processed. There are situations where one tuple can be processed > multiple times, e.g.: task/stage retries, speculation, determination of > ranges for global ordered, etc... For observed metrics we need the value of > the accumulator to be based on the data and not on processing. > The current aggregating accumulator is already robust to some of these issues > (like task failure), but we need to add some additional checks to make sure > it is fool proof. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138100#comment-17138100 ] Apache Spark commented on SPARK-32003: -- User 'wypoon' has created a pull request for this issue: https://github.com/apache/spark/pull/28848 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32003: Assignee: Apache Spark > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Assignee: Apache Spark >Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138099#comment-17138099 ] Apache Spark commented on SPARK-32003: -- User 'wypoon' has created a pull request for this issue: https://github.com/apache/spark/pull/28848 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32003: Assignee: (was: Apache Spark) > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30876) Optimizer cannot infer from inferred constraints with join
[ https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138072#comment-17138072 ] Yuming Wang commented on SPARK-30876: - Thank you [~navinvishy]. > Optimizer cannot infer from inferred constraints with join > -- > > Key: SPARK-30876 > URL: https://issues.apache.org/jira/browse/SPARK-30876 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table t1(a int, b int, c int); > create table t2(a int, b int, c int); > create table t3(a int, b int, c int); > select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and > t3.c = 1); > {code} > Spark 2.3+: > {noformat} > == Physical Plan == > *(4) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition, true, [id=#102] >+- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(3) Project > +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight > :- *(3) Project [b#10] > : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight > : :- *(3) Project [a#6] > : : +- *(3) Filter isnotnull(a#6) > : : +- *(3) ColumnarToRow > : :+- FileScan parquet default.t1[a#6] Batched: true, > DataFilters: [isnotnull(a#6)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: > struct > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#87] > :+- *(1) Project [b#10] > : +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1)) > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t2[b#10] Batched: > true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], > ReadSchema: struct > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#96] >+- *(2) Project [c#14] > +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1)) > +- *(2) ColumnarToRow > +- FileScan parquet default.t3[c#14] Batched: true, > DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], > ReadSchema: struct > Time taken: 3.785 seconds, Fetched 1 row(s) > {noformat} > Spark 2.2.x: > {noformat} > == Physical Plan == > *HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[partial_count(1)]) > +- *Project > +- *SortMergeJoin [b#19], [c#23], Inner > :- *Project [b#19] > : +- *SortMergeJoin [a#15], [b#19], Inner > : :- *Sort [a#15 ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(a#15, 200) > : : +- *Filter (isnotnull(a#15) && (a#15 = 1)) > : :+- HiveTableScan [a#15], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, > b#16, c#17] > : +- *Sort [b#19 ASC NULLS FIRST], false, 0 > :+- Exchange hashpartitioning(b#19, 200) > : +- *Filter (isnotnull(b#19) && (b#19 = 1)) > : +- HiveTableScan [b#19], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, > b#19, c#20] > +- *Sort [c#23 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(c#23, 200) > +- *Filter (isnotnull(c#23) && (c#23 = 1)) > +- HiveTableScan [c#23], HiveTableRelation > `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, > b#22, c#23] > Time taken: 0.728 seconds, Fetched 1 row(s) > {noformat} > Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now
[ https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32011. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 28845 [https://github.com/apache/spark/pull/28845] > Remove warnings about pin-thread modes and guide collectWithJobGroups for now > - > > Key: SPARK-32011 > URL: https://issues.apache.org/jira/browse/SPARK-32011 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Currently, it guides users to use pin-thread mode to use multiple threads. > There's a thread leak issue in this mode (SPARK-32010). > We shouldn't promote this mode in Spark 3.0 for now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now
[ https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32011: Assignee: Hyukjin Kwon > Remove warnings about pin-thread modes and guide collectWithJobGroups for now > - > > Key: SPARK-32011 > URL: https://issues.apache.org/jira/browse/SPARK-32011 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, it guides users to use pin-thread mode to use multiple threads. > There's a thread leak issue in this mode (SPARK-32010). > We shouldn't promote this mode in Spark 3.0 for now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31989) Generate JSON files with rebasing switch points using smaller steps
[ https://issues.apache.org/jira/browse/SPARK-31989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31989: Assignee: Maxim Gekk > Generate JSON files with rebasing switch points using smaller steps > --- > > Key: SPARK-31989 > URL: https://issues.apache.org/jira/browse/SPARK-31989 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Currently, the JSON files 'gregorian-julian-rebase-micros.json' and > 'julian-gregorian-rebase-micros.json' are generated w/ step of 1 week. The > files can miss short spikes of diffs. The ticket aims to change the max step > from 1 week to 30 minutes, and speed up execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31989) Generate JSON files with rebasing switch points using smaller steps
[ https://issues.apache.org/jira/browse/SPARK-31989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31989. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28827 [https://github.com/apache/spark/pull/28827] > Generate JSON files with rebasing switch points using smaller steps > --- > > Key: SPARK-31989 > URL: https://issues.apache.org/jira/browse/SPARK-31989 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > Currently, the JSON files 'gregorian-julian-rebase-micros.json' and > 'julian-gregorian-rebase-micros.json' are generated w/ step of 1 week. The > files can miss short spikes of diffs. The ticket aims to change the max step > from 1 week to 30 minutes, and speed up execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29148) Modify dynamic allocation manager for stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-29148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138046#comment-17138046 ] Apache Spark commented on SPARK-29148: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28847 > Modify dynamic allocation manager for stage level scheduling > > > Key: SPARK-29148 > URL: https://issues.apache.org/jira/browse/SPARK-29148 > Project: Spark > Issue Type: Story > Components: Scheduler, Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > To Support Stage Level Scheduling, the dynamic allocation manager has to > track the usage and need or executor per ResourceProfile. > We will have to figure out what to do with the metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29148) Modify dynamic allocation manager for stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-29148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138048#comment-17138048 ] Apache Spark commented on SPARK-29148: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28847 > Modify dynamic allocation manager for stage level scheduling > > > Key: SPARK-29148 > URL: https://issues.apache.org/jira/browse/SPARK-29148 > Project: Spark > Issue Type: Story > Components: Scheduler, Spark Core >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > To Support Stage Level Scheduling, the dynamic allocation manager has to > track the usage and need or executor per ResourceProfile. > We will have to figure out what to do with the metrics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32002) spark error while select nest data
[ https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Zhang updated SPARK-32002: Priority: Major (was: Minor) > spark error while select nest data > -- > > Key: SPARK-32002 > URL: https://issues.apache.org/jira/browse/SPARK-32002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Yiqun Zhang >Priority: Major > > nest-data.json > {code:java} > {"a": [{"b": [{"c": [1,2]}]}]} > {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} > {code:java} > val df: DataFrame = spark.read.json(testFile("nest-data.json")) > df.createTempView("nest_table") > sql("select a.b.c from nest_table").show() > {code} > {color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve > 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires > integral type, however, ''c'' is of string type.; line 1 pos 7;{color} > {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color} > {color:#ff}+- SubqueryAlias `nest_table`{color} > {color:#ff} +- Relation[a#6|#6] json{color} > {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor > for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) > match {color}GetArrayItem, extraction ("c") treat as an ordinal. > org.apache.spark.sql.catalyst.expressions.ExtractValue > {code:java} > def apply( > child: Expression, > extraction: Expression, > resolver: Resolver): Expression = { >(child.dataType, extraction) match { > case (StructType(fields), NonNullLiteral(v, StringType)) => > val fieldName = v.toString > val ordinal = findField(fields, fieldName, resolver) > GetStructField(child, ordinal, Some(fieldName)) > case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, > StringType)) => > val fieldName = v.toString > val ordinal = findField(fields, fieldName, resolver) > GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), > ordinal, fields.length, containsNull) > case (_: ArrayType, _) => GetArrayItem(child, extraction) > case (MapType(kt, _, _), _) => GetMapValue(child, extraction) > case (otherType, _) => > val errorMsg = otherType match { > case StructType(_) => > s"Field name should be String Literal, but it's $extraction" > case other => > s"Can't extract value from $child: need struct type but got > ${other.catalogString}" > } > throw new AnalysisException(errorMsg) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle
[ https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32012: Assignee: L. C. Hsieh (was: Apache Spark) > Incrementally create and materialize query stage to avoid unnecessary local > shuffle > --- > > Key: SPARK-32012 > URL: https://issues.apache.org/jira/browse/SPARK-32012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > The current way of creating query stage in AQE is in batch. For example, the > children of a sort merge join will be materialized as query stages in a > batch. Then AQE brings the optimization in and optimize sort merge join to > broadcast join. Except for the broadcasted exchange, we don't need do any > exchange on another side of join but we already materialized the exchange. > Currently AQE wraps the materialized exchange with local reader, but it still > brings unnecessary I/O. We can avoid unnecessary local shuffle by > incrementally creating query stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle
[ https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32012: Assignee: Apache Spark (was: L. C. Hsieh) > Incrementally create and materialize query stage to avoid unnecessary local > shuffle > --- > > Key: SPARK-32012 > URL: https://issues.apache.org/jira/browse/SPARK-32012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > The current way of creating query stage in AQE is in batch. For example, the > children of a sort merge join will be materialized as query stages in a > batch. Then AQE brings the optimization in and optimize sort merge join to > broadcast join. Except for the broadcasted exchange, we don't need do any > exchange on another side of join but we already materialized the exchange. > Currently AQE wraps the materialized exchange with local reader, but it still > brings unnecessary I/O. We can avoid unnecessary local shuffle by > incrementally creating query stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle
[ https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138023#comment-17138023 ] Apache Spark commented on SPARK-32012: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28846 > Incrementally create and materialize query stage to avoid unnecessary local > shuffle > --- > > Key: SPARK-32012 > URL: https://issues.apache.org/jira/browse/SPARK-32012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > The current way of creating query stage in AQE is in batch. For example, the > children of a sort merge join will be materialized as query stages in a > batch. Then AQE brings the optimization in and optimize sort merge join to > broadcast join. Except for the broadcasted exchange, we don't need do any > exchange on another side of join but we already materialized the exchange. > Currently AQE wraps the materialized exchange with local reader, but it still > brings unnecessary I/O. We can avoid unnecessary local shuffle by > incrementally creating query stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle
[ https://issues.apache.org/jira/browse/SPARK-32012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138024#comment-17138024 ] Apache Spark commented on SPARK-32012: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28846 > Incrementally create and materialize query stage to avoid unnecessary local > shuffle > --- > > Key: SPARK-32012 > URL: https://issues.apache.org/jira/browse/SPARK-32012 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > The current way of creating query stage in AQE is in batch. For example, the > children of a sort merge join will be materialized as query stages in a > batch. Then AQE brings the optimization in and optimize sort merge join to > broadcast join. Except for the broadcasted exchange, we don't need do any > exchange on another side of join but we already materialized the exchange. > Currently AQE wraps the materialized exchange with local reader, but it still > brings unnecessary I/O. We can avoid unnecessary local shuffle by > incrementally creating query stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32012) Incrementally create and materialize query stage to avoid unnecessary local shuffle
L. C. Hsieh created SPARK-32012: --- Summary: Incrementally create and materialize query stage to avoid unnecessary local shuffle Key: SPARK-32012 URL: https://issues.apache.org/jira/browse/SPARK-32012 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh The current way of creating query stage in AQE is in batch. For example, the children of a sort merge join will be materialized as query stages in a batch. Then AQE brings the optimization in and optimize sort merge join to broadcast join. Except for the broadcasted exchange, we don't need do any exchange on another side of join but we already materialized the exchange. Currently AQE wraps the materialized exchange with local reader, but it still brings unnecessary I/O. We can avoid unnecessary local shuffle by incrementally creating query stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now
[ https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138014#comment-17138014 ] Apache Spark commented on SPARK-32011: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28845 > Remove warnings about pin-thread modes and guide collectWithJobGroups for now > - > > Key: SPARK-32011 > URL: https://issues.apache.org/jira/browse/SPARK-32011 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, it guides users to use pin-thread mode to use multiple threads. > There's a thread leak issue in this mode (SPARK-32010). > We shouldn't promote this mode in Spark 3.0 for now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now
[ https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32011: Assignee: Apache Spark > Remove warnings about pin-thread modes and guide collectWithJobGroups for now > - > > Key: SPARK-32011 > URL: https://issues.apache.org/jira/browse/SPARK-32011 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, it guides users to use pin-thread mode to use multiple threads. > There's a thread leak issue in this mode (SPARK-32010). > We shouldn't promote this mode in Spark 3.0 for now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32010) Thread leaks in pinned thread mode
[ https://issues.apache.org/jira/browse/SPARK-32010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138012#comment-17138012 ] Hyukjin Kwon commented on SPARK-32010: -- I have a kind of a POC fix already. Will open a PR later. > Thread leaks in pinned thread mode > -- > > Key: SPARK-32010 > URL: https://issues.apache.org/jira/browse/SPARK-32010 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-22340 introduced a pin thread mode which guarantees you to sync Python > thread and JVM thread. > However, looks like the JVM threads are not finished even when the Python > thread is finished. It can be debugged via YourKit, and run multiple jobs > with multiple threads at the same time. > Easiest reproducer is: > {code} > PYSPARK_PIN_THREAD=true ./bin/pyspark > {code} > {code} > >>> from threading import Thread > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> spark._jvm._gateway_client.deque > deque([, > , > , > , > ]) > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> spark._jvm._gateway_client.deque > deque([, > , > , > , > , > ]) > {code} > The connection doesn't get closed, and it holds JVM thread running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now
[ https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17138013#comment-17138013 ] Apache Spark commented on SPARK-32011: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28845 > Remove warnings about pin-thread modes and guide collectWithJobGroups for now > - > > Key: SPARK-32011 > URL: https://issues.apache.org/jira/browse/SPARK-32011 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, it guides users to use pin-thread mode to use multiple threads. > There's a thread leak issue in this mode (SPARK-32010). > We shouldn't promote this mode in Spark 3.0 for now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now
[ https://issues.apache.org/jira/browse/SPARK-32011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32011: Assignee: (was: Apache Spark) > Remove warnings about pin-thread modes and guide collectWithJobGroups for now > - > > Key: SPARK-32011 > URL: https://issues.apache.org/jira/browse/SPARK-32011 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, it guides users to use pin-thread mode to use multiple threads. > There's a thread leak issue in this mode (SPARK-32010). > We shouldn't promote this mode in Spark 3.0 for now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32010) Thread leaks in pinned thread mode
Hyukjin Kwon created SPARK-32010: Summary: Thread leaks in pinned thread mode Key: SPARK-32010 URL: https://issues.apache.org/jira/browse/SPARK-32010 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.1.0 Reporter: Hyukjin Kwon SPARK-22340 introduced a pin thread mode which guarantees you to sync Python thread and JVM thread. However, looks like the JVM threads are not finished even when the Python thread is finished. It can be debugged via YourKit, and run multiple jobs with multiple threads at the same time. Easiest reproducer is: {code} PYSPARK_PIN_THREAD=true ./bin/pyspark {code} {code} >>> from threading import Thread >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([, , , , ]) >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([, , , , , ]) {code} The connection doesn't get closed, and it holds JVM thread running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32010) Thread leaks in pinned thread mode
[ https://issues.apache.org/jira/browse/SPARK-32010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32010: Assignee: Hyukjin Kwon > Thread leaks in pinned thread mode > -- > > Key: SPARK-32010 > URL: https://issues.apache.org/jira/browse/SPARK-32010 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-22340 introduced a pin thread mode which guarantees you to sync Python > thread and JVM thread. > However, looks like the JVM threads are not finished even when the Python > thread is finished. It can be debugged via YourKit, and run multiple jobs > with multiple threads at the same time. > Easiest reproducer is: > {code} > PYSPARK_PIN_THREAD=true ./bin/pyspark > {code} > {code} > >>> from threading import Thread > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> spark._jvm._gateway_client.deque > deque([, > , > , > , > ]) > >>> Thread(target=lambda: spark.range(1000).collect()).start() > >>> spark._jvm._gateway_client.deque > deque([, > , > , > , > , > ]) > {code} > The connection doesn't get closed, and it holds JVM thread running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31337) Support MS Sql Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Masiero Vanzin resolved SPARK-31337. Fix Version/s: 3.1.0 Assignee: Gabor Somogyi Resolution: Fixed > Support MS Sql Kerberos login in JDBC connector > --- > > Key: SPARK-31337 > URL: https://issues.apache.org/jira/browse/SPARK-31337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32011) Remove warnings about pin-thread modes and guide collectWithJobGroups for now
Hyukjin Kwon created SPARK-32011: Summary: Remove warnings about pin-thread modes and guide collectWithJobGroups for now Key: SPARK-32011 URL: https://issues.apache.org/jira/browse/SPARK-32011 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Currently, it guides users to use pin-thread mode to use multiple threads. There's a thread leak issue in this mode (SPARK-32010). We shouldn't promote this mode in Spark 3.0 for now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137996#comment-17137996 ] Apache Spark commented on SPARK-32009: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28844 > remove deprecated method BisectingKMeansModel.computeCost > - > > Key: SPARK-32009 > URL: https://issues.apache.org/jira/browse/SPARK-32009 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > remove BisectingKMeans.computeCost which is deprecated in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32009: Assignee: Apache Spark > remove deprecated method BisectingKMeansModel.computeCost > - > > Key: SPARK-32009 > URL: https://issues.apache.org/jira/browse/SPARK-32009 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > remove BisectingKMeans.computeCost which is deprecated in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137995#comment-17137995 ] Apache Spark commented on SPARK-32009: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28844 > remove deprecated method BisectingKMeansModel.computeCost > - > > Key: SPARK-32009 > URL: https://issues.apache.org/jira/browse/SPARK-32009 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > remove BisectingKMeans.computeCost which is deprecated in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32009: Assignee: (was: Apache Spark) > remove deprecated method BisectingKMeansModel.computeCost > - > > Key: SPARK-32009 > URL: https://issues.apache.org/jira/browse/SPARK-32009 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > remove BisectingKMeans.computeCost which is deprecated in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32009) remove deprecated method BisectingKMeansModel.computeCost
[ https://issues.apache.org/jira/browse/SPARK-32009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-32009: --- Summary: remove deprecated method BisectingKMeansModel.computeCost (was: remove deprecated method BisectingKMeans.computeCost) > remove deprecated method BisectingKMeansModel.computeCost > - > > Key: SPARK-32009 > URL: https://issues.apache.org/jira/browse/SPARK-32009 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > remove BisectingKMeans.computeCost which is deprecated in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31777) CrossValidator supports user-supplied folds
[ https://issues.apache.org/jira/browse/SPARK-31777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-31777. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28704 [https://github.com/apache/spark/pull/28704] > CrossValidator supports user-supplied folds > > > Key: SPARK-31777 > URL: https://issues.apache.org/jira/browse/SPARK-31777 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiangrui Meng >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > As a user, I can specify how CrossValidator should create folds by specifying > a foldCol, which should be integer type with range [0, numFolds). If foldCol > is specified, Spark won't do random k-fold split. This is useful if there are > custom logics to create folds, e.g., random split by users instead of random > splits of events. > This is similar to SPARK-16206, which is for the RDD-based APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31777) CrossValidator supports user-supplied folds
[ https://issues.apache.org/jira/browse/SPARK-31777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-31777: --- Assignee: L. C. Hsieh > CrossValidator supports user-supplied folds > > > Key: SPARK-31777 > URL: https://issues.apache.org/jira/browse/SPARK-31777 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.1.0 >Reporter: Xiangrui Meng >Assignee: L. C. Hsieh >Priority: Major > > As a user, I can specify how CrossValidator should create folds by specifying > a foldCol, which should be integer type with range [0, numFolds). If foldCol > is specified, Spark won't do random k-fold split. This is useful if there are > custom logics to create folds, e.g., random split by users instead of random > splits of events. > This is similar to SPARK-16206, which is for the RDD-based APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32009) remove deprecated method BisectingKMeans.computeCost
Huaxin Gao created SPARK-32009: -- Summary: remove deprecated method BisectingKMeans.computeCost Key: SPARK-32009 URL: https://issues.apache.org/jira/browse/SPARK-32009 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.1.0 Reporter: Huaxin Gao remove BisectingKMeans.computeCost which is deprecated in 3.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32008) 3.0.0 release build fails
[ https://issues.apache.org/jira/browse/SPARK-32008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137937#comment-17137937 ] Shivaram Venkataraman commented on SPARK-32008: --- It looks like the R vignette build failed and looking at the error message this seems related to https://github.com/rstudio/rmarkdown/issues/1831 -- I think it should work fine if you try to use R version >= 3.6 > 3.0.0 release build fails > - > > Key: SPARK-32008 > URL: https://issues.apache.org/jira/browse/SPARK-32008 > Project: Spark > Issue Type: Bug > Components: Build, Documentation >Affects Versions: 3.0.0 >Reporter: Philipp Dallig >Priority: Major > > Hi, > I try to build the spark release 3.0.0 by myself. > I got the following error. > {code} > 20/06/16 15:20:49 WARN PrefixSpan: Input data is not cached. > 20/06/16 15:20:50 WARN Instrumentation: [b307b568] regParam is zero, which > might cause numerical instability and overfitting. > Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: > 'vignetteInfo' is not an exported object from 'namespace:tools' > Execution halted > {code} > I can reproduce this error with a small Dockerfile. > {code} > FROM ubuntu:18.04 as builder > ENV MVN_VERSION=3.6.3 \ > M2_HOME=/opt/apache-maven \ > MAVEN_HOME=/opt/apache-maven \ > MVN_HOME=/opt/apache-maven \ > > MVN_SHA512=c35a1803a6e70a126e80b2b3ae33eed961f83ed74d18fcd16909b2d44d7dada3203f1ffe726c17ef8dcca2dcaa9fca676987befeadc9b9f759967a8cb77181c0 > \ > MAVEN_OPTS="-Xmx3g -XX:ReservedCodeCacheSize=1g" \ > R_HOME=/usr/lib/R \ > GIT_REPO=https://github.com/apache/spark.git \ > GIT_BRANCH=v3.0.0 \ > SPARK_DISTRO_NAME=hadoop3.2 \ > SPARK_LOCAL_HOSTNAME=localhost > # Preparation > RUN /usr/bin/apt-get update && \ > # APT > INSTALL_PKGS="openjdk-8-jdk-headless git wget python3 python3-pip > python3-setuptools r-base r-base-dev pandoc pandoc-citeproc > libcurl4-openssl-dev libssl-dev libxml2-dev texlive qpdf language-pack-en" && > \ > DEBIAN_FRONTEND=noninteractive /usr/bin/apt-get -y install > --no-install-recommends $INSTALL_PKGS && \ > rm -rf /var/lib/apt/lists/* && \ > Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', > 'testthat', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \ > # Maven > /usr/bin/wget -nv -O apache-maven.tar.gz > "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz; > && \ > echo "${MVN_SHA512} apache-maven.tar.gz" > apache-maven.sha512 && \ > sha512sum --strict -c apache-maven.sha512 && \ > tar -xvzf apache-maven.tar.gz -C /opt && \ > rm -v apache-maven.sha512 apache-maven.tar.gz && \ > /bin/ln -vs /opt/apache-maven-${MVN_VERSION} /opt/apache-maven && \ > /bin/ln -vs /opt/apache-maven/bin/mvn /usr/bin/mvn > # Spark Distribution Build > RUN mkdir -p /workspace && \ > cd /workspace && \ > git clone --branch ${GIT_BRANCH} ${GIT_REPO} && \ > cd /workspace/spark && \ > ./dev/make-distribution.sh --name ${SPARK_DISTRO_NAME} --pip --r --tgz > -Psparkr -Phadoop-3.2 -Phive-2.3 -Phive-thriftserver -Pyarn -Pkubernetes > {code} > I am very grateful to all helpers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31998) Change package references for ArrowBuf
[ https://issues.apache.org/jira/browse/SPARK-31998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137931#comment-17137931 ] Kouhei Sutou commented on SPARK-31998: -- FYI: The corresponding change in Apache Arrow: https://github.com/apache/arrow/pull/6729 > Change package references for ArrowBuf > -- > > Key: SPARK-31998 > URL: https://issues.apache.org/jira/browse/SPARK-31998 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liya Fan >Priority: Major > > Recently, we have moved class ArrowBuf from package io.netty.buffer to > org.apache.arrow.memory. So after upgrading Arrow library, we need to update > the references to ArrowBuf with the correct package name. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming or just loading from a file data source, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under F_ileIndexSuite_ in the _spark.sql.execution.datasources_ package. +*Consumers seeking to replicate or achieve this behavior*+ *Stack Overflow -(spark structured streaming file source read from a certain partition onwards)* [https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards] *Stack Overflow - (Spark Structured Streaming File Source Starting Offset)* [https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset/51399134#51399134] was: When using structured streaming or just loading from a file data source, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under F_ileIndexSuite_ in the _spark.sql.execution.datasources_ package. Stack Overflow -(spark structured streaming file source read from a
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming or just loading from a file data source, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under F_ileIndexSuite_ in the _spark.sql.execution.datasources_ package. Stack Overflow -(spark structured streaming file source read from a certain partition onwards )|[https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards]] Stack Overflow - (Spark Structured Streaming File Source Starting Offset)|[https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset/51399134#51399134]] was: When using structured streaming or just loading from a file data source, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under F_ileIndexSuite_ in the _spark.sql.execution.datasources_ package. Stack Overflow -[spark structured streaming file source read from a certain partition onwards
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming or just loading from a file data source, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under F_ileIndexSuite_ in the _spark.sql.execution.datasources_ package. Stack Overflow -[spark structured streaming file source read from a certain partition onwards ](https://stackoverflow.com/questions/58004832/spark-structured-streaming-file-source-read-from-a-certain-partition-onwards)|http://example.com] Stack Overflow - [Spark Structured Streaming File Source Starting Offset](https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset/51399134#51399134)|http://example.com] was: When using structured streaming or just loading from a file data source, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under F_ileIndexSuite_ in the _spark.sql.execution.datasources_ package. > Provide option to load files after a specified date when reading from a > folder path >
[jira] [Created] (SPARK-32008) 3.0.0 release build fails
Philipp Dallig created SPARK-32008: -- Summary: 3.0.0 release build fails Key: SPARK-32008 URL: https://issues.apache.org/jira/browse/SPARK-32008 Project: Spark Issue Type: Bug Components: Build, Documentation Affects Versions: 3.0.0 Reporter: Philipp Dallig Hi, I try to build the spark release 3.0.0 by myself. I got the following error. {code} 20/06/16 15:20:49 WARN PrefixSpan: Input data is not cached. 20/06/16 15:20:50 WARN Instrumentation: [b307b568] regParam is zero, which might cause numerical instability and overfitting. Error: processing vignette 'sparkr-vignettes.Rmd' failed with diagnostics: 'vignetteInfo' is not an exported object from 'namespace:tools' Execution halted {code} I can reproduce this error with a small Dockerfile. {code} FROM ubuntu:18.04 as builder ENV MVN_VERSION=3.6.3 \ M2_HOME=/opt/apache-maven \ MAVEN_HOME=/opt/apache-maven \ MVN_HOME=/opt/apache-maven \ MVN_SHA512=c35a1803a6e70a126e80b2b3ae33eed961f83ed74d18fcd16909b2d44d7dada3203f1ffe726c17ef8dcca2dcaa9fca676987befeadc9b9f759967a8cb77181c0 \ MAVEN_OPTS="-Xmx3g -XX:ReservedCodeCacheSize=1g" \ R_HOME=/usr/lib/R \ GIT_REPO=https://github.com/apache/spark.git \ GIT_BRANCH=v3.0.0 \ SPARK_DISTRO_NAME=hadoop3.2 \ SPARK_LOCAL_HOSTNAME=localhost # Preparation RUN /usr/bin/apt-get update && \ # APT INSTALL_PKGS="openjdk-8-jdk-headless git wget python3 python3-pip python3-setuptools r-base r-base-dev pandoc pandoc-citeproc libcurl4-openssl-dev libssl-dev libxml2-dev texlive qpdf language-pack-en" && \ DEBIAN_FRONTEND=noninteractive /usr/bin/apt-get -y install --no-install-recommends $INSTALL_PKGS && \ rm -rf /var/lib/apt/lists/* && \ Rscript -e "install.packages(c('knitr', 'rmarkdown', 'devtools', 'testthat', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \ # Maven /usr/bin/wget -nv -O apache-maven.tar.gz "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download=maven/maven-3/${MVN_VERSION}/binaries/apache-maven-${MVN_VERSION}-bin.tar.gz; && \ echo "${MVN_SHA512} apache-maven.tar.gz" > apache-maven.sha512 && \ sha512sum --strict -c apache-maven.sha512 && \ tar -xvzf apache-maven.tar.gz -C /opt && \ rm -v apache-maven.sha512 apache-maven.tar.gz && \ /bin/ln -vs /opt/apache-maven-${MVN_VERSION} /opt/apache-maven && \ /bin/ln -vs /opt/apache-maven/bin/mvn /usr/bin/mvn # Spark Distribution Build RUN mkdir -p /workspace && \ cd /workspace && \ git clone --branch ${GIT_BRANCH} ${GIT_REPO} && \ cd /workspace/spark && \ ./dev/make-distribution.sh --name ${SPARK_DISTRO_NAME} --pip --r --tgz -Psparkr -Phadoop-3.2 -Phive-2.3 -Phive-thriftserver -Pyarn -Pkubernetes {code} I am very grateful to all helpers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32007) Spark Driver Supervise does not work reliably
[ https://issues.apache.org/jira/browse/SPARK-32007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suraj Sharma updated SPARK-32007: - Environment: |Java Version|1.8.0_121 (Oracle Corporation)| |Java Home|/usr/java/jdk1.8.0_121/jre| |Scala Version|version 2.11.12| |OS|Amazon Linux| h4. was: ||Name||Value|| |Java Version|1.8.0_121 (Oracle Corporation)| |Java Home|/usr/java/jdk1.8.0_121/jre| |Scala Version|version 2.11.12| |OS|Amazon Linux| h4. > Spark Driver Supervise does not work reliably > - > > Key: SPARK-32007 > URL: https://issues.apache.org/jira/browse/SPARK-32007 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.4 > Environment: |Java Version|1.8.0_121 (Oracle Corporation)| > |Java Home|/usr/java/jdk1.8.0_121/jre| > |Scala Version|version 2.11.12| > |OS|Amazon Linux| > h4. >Reporter: Suraj Sharma >Priority: Critical > > I have a standalone cluster setup. I DO NOT have a streaming use case. I use > AWS EC2 machines to have spark master and worker processes. > *Problem*: If a spark worker machine running some drivers and executor dies, > then the driver is not spawned again on other healthy machines. > *Below are my findings:* > ||Action/Behaviour||Executor||Driver|| > |Worker Machine Stop|Relaunches on an active machine|NO Relaunch| > |kill -9 to process|Relaunches on other machines|Relaunches on other machines| > |kill to process|Relaunches on other machines|Relaunches on other machines| > *Cluster Setup:* > # I have a spark standalone cluster > # {{spark.driver.supervise=true}} > # Spark Master HA is enabled and is backed by zookeeper > # Spark version = 2.4.4 > # I am using a systemd script for the spark worker process -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32007) Spark Driver Supervise does not work reliably
Suraj Sharma created SPARK-32007: Summary: Spark Driver Supervise does not work reliably Key: SPARK-32007 URL: https://issues.apache.org/jira/browse/SPARK-32007 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 2.4.4 Environment: ||Name||Value|| |Java Version|1.8.0_121 (Oracle Corporation)| |Java Home|/usr/java/jdk1.8.0_121/jre| |Scala Version|version 2.11.12| |OS|Amazon Linux| h4. Reporter: Suraj Sharma I have a standalone cluster setup. I DO NOT have a streaming use case. I use AWS EC2 machines to have spark master and worker processes. *Problem*: If a spark worker machine running some drivers and executor dies, then the driver is not spawned again on other healthy machines. *Below are my findings:* ||Action/Behaviour||Executor||Driver|| |Worker Machine Stop|Relaunches on an active machine|NO Relaunch| |kill -9 to process|Relaunches on other machines|Relaunches on other machines| |kill to process|Relaunches on other machines|Relaunches on other machines| *Cluster Setup:* # I have a spark standalone cluster # {{spark.driver.supervise=true}} # Spark Master HA is enabled and is backed by zookeeper # Spark version = 2.4.4 # I am using a systemd script for the spark worker process -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`
[ https://issues.apache.org/jira/browse/SPARK-32006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32006: Assignee: Apache Spark > Create date/timestamp formatters once before collect in `hiveResultString()` > > > Key: SPARK-32006 > URL: https://issues.apache.org/jira/browse/SPARK-32006 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Spark 2.4 re-uses one instance of SimpleDateFormat while formatting > timestamps in toHiveString. Currently, toHiveString() creates > timestampFormatter per each value. Even w/ caching, it causes additional > overhead comparing to Spark 2.4. The ticket aims to create an instance of > TimestampFormatter before collect in hiveResultString() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`
[ https://issues.apache.org/jira/browse/SPARK-32006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17137852#comment-17137852 ] Apache Spark commented on SPARK-32006: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28842 > Create date/timestamp formatters once before collect in `hiveResultString()` > > > Key: SPARK-32006 > URL: https://issues.apache.org/jira/browse/SPARK-32006 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Spark 2.4 re-uses one instance of SimpleDateFormat while formatting > timestamps in toHiveString. Currently, toHiveString() creates > timestampFormatter per each value. Even w/ caching, it causes additional > overhead comparing to Spark 2.4. The ticket aims to create an instance of > TimestampFormatter before collect in hiveResultString() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`
[ https://issues.apache.org/jira/browse/SPARK-32006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32006: Assignee: (was: Apache Spark) > Create date/timestamp formatters once before collect in `hiveResultString()` > > > Key: SPARK-32006 > URL: https://issues.apache.org/jira/browse/SPARK-32006 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Priority: Major > > Spark 2.4 re-uses one instance of SimpleDateFormat while formatting > timestamps in toHiveString. Currently, toHiveString() creates > timestampFormatter per each value. Even w/ caching, it causes additional > overhead comparing to Spark 2.4. The ticket aims to create an instance of > TimestampFormatter before collect in hiveResultString() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32005) Add aggregate functions for computing percentiles on weighted data
Devesh Parekh created SPARK-32005: - Summary: Add aggregate functions for computing percentiles on weighted data Key: SPARK-32005 URL: https://issues.apache.org/jira/browse/SPARK-32005 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0, 2.4.6 Reporter: Devesh Parekh SPARK-30569 adds percentile_approx functions for computing percentiles for a column with equal weights. It would be useful to have variants that also take a weight column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32006) Create date/timestamp formatters once before collect in `hiveResultString()`
Maxim Gekk created SPARK-32006: -- Summary: Create date/timestamp formatters once before collect in `hiveResultString()` Key: SPARK-32006 URL: https://issues.apache.org/jira/browse/SPARK-32006 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.1.0 Reporter: Maxim Gekk Spark 2.4 re-uses one instance of SimpleDateFormat while formatting timestamps in toHiveString. Currently, toHiveString() creates timestampFormatter per each value. Even w/ caching, it causes additional overhead comparing to Spark 2.4. The ticket aims to create an instance of TimestampFormatter before collect in hiveResultString() -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32004) Drop references to slave
Holden Karau created SPARK-32004: Summary: Drop references to slave Key: SPARK-32004 URL: https://issues.apache.org/jira/browse/SPARK-32004 Project: Spark Issue Type: Improvement Components: Mesos, Spark Core, SQL, Structured Streaming Affects Versions: 3.1.0 Reporter: Holden Karau We have a lot of references to "slave" in the code base which doesn't match the terminology in the rest of our code base and we should clean it up. In many situations it would be clearer with "executor", "worker", or "replica" depending on the context (so this is not just a search and replace but actually read through the code and make it consistent). We may want to (in a follow on) explore renaming master to something more precise. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-31099. - > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) > at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144) > at > org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:303) > at >
[jira] [Resolved] (SPARK-31929) local cache size exceeding "spark.history.store.maxDiskUsage" triggered "java.io.IOException" in history server on Windows
[ https://issues.apache.org/jira/browse/SPARK-31929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31929. -- Fix Version/s: 3.1.0 Assignee: zhli Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28769 > local cache size exceeding "spark.history.store.maxDiskUsage" triggered > "java.io.IOException" in history server on Windows > -- > > Key: SPARK-31929 > URL: https://issues.apache.org/jira/browse/SPARK-31929 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.4.4, 3.0.0 > Environment: System: Windows > Config: > spark.history.retainedApplications 200 > spark.history.store.maxDiskUsage 2g > spark.history.store.path d://cache_hs >Reporter: zhli >Assignee: zhli >Priority: Minor > Fix For: 3.1.0 > > > h2. > h2. HTTP ERROR 500 > Problem accessing /history/app-20190711215551-0001/stages/. Reason: > Server Error > > h3. Caused by: > java.io.IOException: Unable to delete file: > d:\cache_hs\apps\app-20190711215551-0001.ldb\MANIFEST-07 at > org.apache.commons.io.FileUtils.forceDelete(FileUtils.java:2381) at > org.apache.commons.io.FileUtils.cleanDirectory(FileUtils.java:1679) at > org.apache.commons.io.FileUtils.deleteDirectory(FileUtils.java:1575) at > org.apache.spark.deploy.history.HistoryServerDiskManager.org$apache$spark$deploy$history$HistoryServerDiskManager$$deleteStore(HistoryServerDiskManager.scala:198) > at > org.apache.spark.deploy.history.HistoryServerDiskManager.$anonfun$release$1(HistoryServerDiskManager.scala:161) > at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.java:23) > at scala.Option.foreach(Option.scala:407) at > org.apache.spark.deploy.history.HistoryServerDiskManager.release(HistoryServerDiskManager.scala:156) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$loadDiskStore$1(FsHistoryProvider.scala:1163) > at > org.apache.spark.deploy.history.FsHistoryProvider.$anonfun$loadDiskStore$1$adapted(FsHistoryProvider.scala:1157) > at scala.Option.foreach(Option.scala:407) at > org.apache.spark.deploy.history.FsHistoryProvider.loadDiskStore(FsHistoryProvider.scala:1157) > at > org.apache.spark.deploy.history.FsHistoryProvider.getAppUI(FsHistoryProvider.scala:363) > at > org.apache.spark.deploy.history.HistoryServer.getAppUI(HistoryServer.scala:191) > at > org.apache.spark.deploy.history.ApplicationCache.$anonfun$loadApplicationEntry$2(ApplicationCache.scala:163) > at > org.apache.spark.deploy.history.ApplicationCache.time(ApplicationCache.scala:135) > at > org.apache.spark.deploy.history.ApplicationCache.org$apache$spark$deploy$history$ApplicationCache$$loadApplicationEntry(ApplicationCache.scala:161) > at > org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:56) > at > org.apache.spark.deploy.history.ApplicationCache$$anon$1.load(ApplicationCache.scala:52) > at > org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599) > at > org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > at > org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) > at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at > org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at > org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spark.deploy.history.ApplicationCache.get(ApplicationCache.scala:89) > at > org.apache.spark.deploy.history.ApplicationCache.withSparkUI(ApplicationCache.scala:101) > at > org.apache.spark.deploy.history.HistoryServer.org$apache$spark$deploy$history$HistoryServer$$loadAppUi(HistoryServer.scala:248) > at > org.apache.spark.deploy.history.HistoryServer$$anon$1.doGet(HistoryServer.scala:101) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at > org.sparkproject.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) > at > org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.sparkproject.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) > at > org.sparkproject.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) > at > org.sparkproject.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at >
[jira] [Updated] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wing Yew Poon updated SPARK-32003: -- Affects Version/s: 3.0.0 > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6, 3.0.0 >Reporter: Wing Yew Poon >Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
[ https://issues.apache.org/jira/browse/SPARK-32003?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136792#comment-17136792 ] Wing Yew Poon commented on SPARK-32003: --- I will open a PR soon with a solution. > Shuffle files for lost executor are not unregistered if fetch failure occurs > after executor is lost > --- > > Key: SPARK-32003 > URL: https://issues.apache.org/jira/browse/SPARK-32003 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.4.6 >Reporter: Wing Yew Poon >Priority: Major > > A customer's cluster has a node that goes down while a Spark application is > running. (They are running Spark on YARN with the external shuffle service > enabled.) An executor is lost (apparently the only one running on the node). > This executor lost event is handled in the DAGScheduler, which removes the > executor from its BlockManagerMaster. At this point, there is no > unregistering of shuffle files for the executor or the node. Soon after, > tasks trying to fetch shuffle files output by that executor fail with > FetchFailed (because the node is down, there is no NodeManager available to > serve shuffle files). By right, such fetch failures should cause the shuffle > files for the executor to be unregistered, but they do not. > Due to task failure, the stage is re-attempted. Tasks continue to fail due to > fetch failure form the lost executor's shuffle output. This time, since the > failed epoch for the executor is higher, the executor is removed again (this > doesn't really do anything, the executor was already removed when it was > lost) and this time the shuffle output is unregistered. > So it takes two stage attempts instead of one to clear the shuffle output. We > get 4 attempts by default. The customer was unlucky and two nodes went down > during the stage, i.e., the same problem happened twice. So they used up 4 > stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31996) Specify the version of ChromeDriver and RemoteWebDriver which can work with guava 14.0.1
[ https://issues.apache.org/jira/browse/SPARK-31996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta reassigned SPARK-31996: -- Assignee: (was: Kousuke Saruta) > Specify the version of ChromeDriver and RemoteWebDriver which can work with > guava 14.0.1 > > > Key: SPARK-31996 > URL: https://issues.apache.org/jira/browse/SPARK-31996 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Priority: Minor > > SPARK-31765 upgraded HtmlUnit due to a security reason and Selenium was also > needed to be upgraded to work with the upgraded HtmlUnit. > After upgrading Selenium, ChromeDriver and RemoteWebDriver are implicitly > upgraded because of dependency and the the implicitly upgraded modules can't > work with guava 14.0.1 due to an API compatibility so we need to run > ChromeUISeleniumSuite with a guava version specified like > -Dguava.version=25.0-jre. > {code:java} > $ build/sbt -Dguava.version=25.0-jre > -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver > -Dtest.default.exclude.tags= "testOnly > org.apache.spark.ui.UISeleniumSuite"{code} > It's a little bit inconvenience so let's use older version which can work > with guava 14.0.1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements
[ https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136773#comment-17136773 ] Sonia Dalwani edited comment on SPARK-26833 at 6/16/20, 4:08 PM: - Documentation can be changed like the following blog [http://blog.brainlounge.de/memoryleaks/getting-started-with-spark-on-kubernetes/] It gives better explanation for beginners with respect to user submission credentials. was (Author: sdalwani): Documentation cab be changed like the following blog [http://blog.brainlounge.de/memoryleaks/getting-started-with-spark-on-kubernetes/] It gives better explanation for beginners with respect to user submission credentials. > Kubernetes RBAC documentation is unclear on exact RBAC requirements > --- > > Key: SPARK-26833 > URL: https://issues.apache.org/jira/browse/SPARK-26833 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Rob Vesse >Priority: Major > > I've seen a couple of users get bitten by this in informal discussions on > GitHub and Slack. Basically the user sets up the service account and > configures Spark to use it as described in the documentation but then when > they try and run a job they encounter an error like the following: > {quote}019-02-05 20:29:02 WARN WatchConnectionManager:185 - Exec Failure: > HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: > User "system:anonymous" cannot watch pods in the namespace "default" > java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: pods > "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot > watch pods in the namespace "default"{quote} > This error stems from the fact that the configured service account is only > used by the driver pod and not by the submission client. The submission > client wants to do driver pod monitoring which it does with the users > submission credentials *NOT* the service account as the user might expect. > It seems like there are two ways to resolve this issue: > * Improve the documentation to clarify the current situation > * Ensure that if a service account is configured we always use it even on the > submission client > The former is the easy fix, the latter is more invasive and may have other > knock on effects so we should start with the former and discuss the > feasibility of the latter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26833) Kubernetes RBAC documentation is unclear on exact RBAC requirements
[ https://issues.apache.org/jira/browse/SPARK-26833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136773#comment-17136773 ] Sonia Dalwani commented on SPARK-26833: --- Documentation cab be changed like the following blog [http://blog.brainlounge.de/memoryleaks/getting-started-with-spark-on-kubernetes/] It gives better explanation for beginners with respect to user submission credentials. > Kubernetes RBAC documentation is unclear on exact RBAC requirements > --- > > Key: SPARK-26833 > URL: https://issues.apache.org/jira/browse/SPARK-26833 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Rob Vesse >Priority: Major > > I've seen a couple of users get bitten by this in informal discussions on > GitHub and Slack. Basically the user sets up the service account and > configures Spark to use it as described in the documentation but then when > they try and run a job they encounter an error like the following: > {quote}019-02-05 20:29:02 WARN WatchConnectionManager:185 - Exec Failure: > HTTP 403, Status: 403 - pods "spark-pi-1549416541302-driver" is forbidden: > User "system:anonymous" cannot watch pods in the namespace "default" > java.net.ProtocolException: Expected HTTP 101 response but was '403 Forbidden' > Exception in thread "main" > io.fabric8.kubernetes.client.KubernetesClientException: pods > "spark-pi-1549416541302-driver" is forbidden: User "system:anonymous" cannot > watch pods in the namespace "default"{quote} > This error stems from the fact that the configured service account is only > used by the driver pod and not by the submission client. The submission > client wants to do driver pod monitoring which it does with the users > submission credentials *NOT* the service account as the user might expect. > It seems like there are two ways to resolve this issue: > * Improve the documentation to clarify the current situation > * Ensure that if a service account is configured we always use it even on the > submission client > The former is the easy fix, the latter is more invasive and may have other > knock on effects so we should start with the former and discuss the > feasibility of the latter. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32003) Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost
Wing Yew Poon created SPARK-32003: - Summary: Shuffle files for lost executor are not unregistered if fetch failure occurs after executor is lost Key: SPARK-32003 URL: https://issues.apache.org/jira/browse/SPARK-32003 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.4.6 Reporter: Wing Yew Poon A customer's cluster has a node that goes down while a Spark application is running. (They are running Spark on YARN with the external shuffle service enabled.) An executor is lost (apparently the only one running on the node). This executor lost event is handled in the DAGScheduler, which removes the executor from its BlockManagerMaster. At this point, there is no unregistering of shuffle files for the executor or the node. Soon after, tasks trying to fetch shuffle files output by that executor fail with FetchFailed (because the node is down, there is no NodeManager available to serve shuffle files). By right, such fetch failures should cause the shuffle files for the executor to be unregistered, but they do not. Due to task failure, the stage is re-attempted. Tasks continue to fail due to fetch failure form the lost executor's shuffle output. This time, since the failed epoch for the executor is higher, the executor is removed again (this doesn't really do anything, the executor was already removed when it was lost) and this time the shuffle output is unregistered. So it takes two stage attempts instead of one to clear the shuffle output. We get 4 attempts by default. The customer was unlucky and two nodes went down during the stage, i.e., the same problem happened twice. So they used up 4 stage attempts and the stage failed and thus the job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30876) Optimizer cannot infer from inferred constraints with join
[ https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136753#comment-17136753 ] Navin Viswanath commented on SPARK-30876: - Hi [~yumwang] I'd be happy to take a look at this if that's ok. > Optimizer cannot infer from inferred constraints with join > -- > > Key: SPARK-30876 > URL: https://issues.apache.org/jira/browse/SPARK-30876 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table t1(a int, b int, c int); > create table t2(a int, b int, c int); > create table t3(a int, b int, c int); > select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and > t3.c = 1); > {code} > Spark 2.3+: > {noformat} > == Physical Plan == > *(4) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition, true, [id=#102] >+- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(3) Project > +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight > :- *(3) Project [b#10] > : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight > : :- *(3) Project [a#6] > : : +- *(3) Filter isnotnull(a#6) > : : +- *(3) ColumnarToRow > : :+- FileScan parquet default.t1[a#6] Batched: true, > DataFilters: [isnotnull(a#6)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: > struct > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#87] > :+- *(1) Project [b#10] > : +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1)) > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t2[b#10] Batched: > true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], > ReadSchema: struct > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#96] >+- *(2) Project [c#14] > +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1)) > +- *(2) ColumnarToRow > +- FileScan parquet default.t3[c#14] Batched: true, > DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], > ReadSchema: struct > Time taken: 3.785 seconds, Fetched 1 row(s) > {noformat} > Spark 2.2.x: > {noformat} > == Physical Plan == > *HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[partial_count(1)]) > +- *Project > +- *SortMergeJoin [b#19], [c#23], Inner > :- *Project [b#19] > : +- *SortMergeJoin [a#15], [b#19], Inner > : :- *Sort [a#15 ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(a#15, 200) > : : +- *Filter (isnotnull(a#15) && (a#15 = 1)) > : :+- HiveTableScan [a#15], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, > b#16, c#17] > : +- *Sort [b#19 ASC NULLS FIRST], false, 0 > :+- Exchange hashpartitioning(b#19, 200) > : +- *Filter (isnotnull(b#19) && (b#19 = 1)) > : +- HiveTableScan [b#19], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, > b#19, c#20] > +- *Sort [c#23 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(c#23, 200) > +- *Filter (isnotnull(c#23) && (c#23 = 1)) > +- HiveTableScan [c#23], HiveTableRelation > `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, > b#22, c#23] > Time taken: 0.728 seconds, Fetched 1 row(s) > {noformat} > Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) Fail casting numeric to timestamp by default
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136674#comment-17136674 ] Apache Spark commented on SPARK-31710: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28843 > Fail casting numeric to timestamp by default > > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Assignee: philipse >Priority: Major > Fix For: 3.1.0 > > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) Fail casting numeric to timestamp by default
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136672#comment-17136672 ] Apache Spark commented on SPARK-31710: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28843 > Fail casting numeric to timestamp by default > > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Assignee: philipse >Priority: Major > Fix For: 3.1.0 > > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32002) spark error while select nest data
[ https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Zhang updated SPARK-32002: Description: nest-data.json {code:java} {"a": [{"b": [{"c": [1,2]}]}]} {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} {code:java} val df: DataFrame = spark.read.json(testFile("nest-data.json")) df.createTempView("nest_table") sql("select a.b.c from nest_table").show() {code} {color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;{color} {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color} {color:#ff}+- SubqueryAlias `nest_table`{color} {color:#ff} +- Relation[a#6|#6] json{color} {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match {color}GetArrayItem, extraction ("c") treat as an ordinal. org.apache.spark.sql.catalyst.expressions.ExtractValue {code:java} def apply( child: Expression, extraction: Expression, resolver: Resolver): Expression = { (child.dataType, extraction) match { case (StructType(fields), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetStructField(child, ordinal, Some(fieldName)) case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), ordinal, fields.length, containsNull) case (_: ArrayType, _) => GetArrayItem(child, extraction) case (MapType(kt, _, _), _) => GetMapValue(child, extraction) case (otherType, _) => val errorMsg = otherType match { case StructType(_) => s"Field name should be String Literal, but it's $extraction" case other => s"Can't extract value from $child: need struct type but got ${other.catalogString}" } throw new AnalysisException(errorMsg) } }{code} was: nest-data.json {code:java} {"a": [{"b": [{"c": [1,2]}]}]} {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} {code:java} val df: DataFrame = spark.read.json(testFile("nest-data.json")) df.createTempView("nest_table") sql("select a.b.c from nest_table").show() {code} {color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;{color} {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color} {color:#ff}+- SubqueryAlias `nest_table`{color} {color:#ff} +- Relation[a#6|#6] json{color} {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match {color}GetArrayItem, extraction ("c") treat as an ordinal. {code:java} def apply( child: Expression, extraction: Expression, resolver: Resolver): Expression = { (child.dataType, extraction) match { case (StructType(fields), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetStructField(child, ordinal, Some(fieldName)) case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), ordinal, fields.length, containsNull) case (_: ArrayType, _) => GetArrayItem(child, extraction) case (MapType(kt, _, _), _) => GetMapValue(child, extraction) case (otherType, _) => val errorMsg = otherType match { case StructType(_) => s"Field name should be String Literal, but it's $extraction" case other => s"Can't extract value from $child: need struct type but got ${other.catalogString}" } throw new AnalysisException(errorMsg) } }{code} > spark error while select nest data > -- > > Key: SPARK-32002 > URL: https://issues.apache.org/jira/browse/SPARK-32002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Yiqun Zhang >Priority: Minor > > nest-data.json > {code:java} > {"a": [{"b": [{"c": [1,2]}]}]} > {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} > {code:java} > val df: DataFrame =
[jira] [Updated] (SPARK-32002) spark error while select nest data
[ https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Zhang updated SPARK-32002: Description: nest-data.json {code:java} {"a": [{"b": [{"c": [1,2]}]}]} {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} {code:java} val df: DataFrame = spark.read.json(testFile("nest-data.json")) df.createTempView("nest_table") sql("select a.b.c from nest_table").show() {code} {color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;{color} {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color} {color:#ff}+- SubqueryAlias `nest_table`{color} {color:#ff} +- Relation[a#6|#6] json{color} {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match {color}GetArrayItem, extraction ("c") treat as an ordinal. {code:java} def apply( child: Expression, extraction: Expression, resolver: Resolver): Expression = { (child.dataType, extraction) match { case (StructType(fields), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetStructField(child, ordinal, Some(fieldName)) case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), ordinal, fields.length, containsNull) case (_: ArrayType, _) => GetArrayItem(child, extraction) case (MapType(kt, _, _), _) => GetMapValue(child, extraction) case (otherType, _) => val errorMsg = otherType match { case StructType(_) => s"Field name should be String Literal, but it's $extraction" case other => s"Can't extract value from $child: need struct type but got ${other.catalogString}" } throw new AnalysisException(errorMsg) } }{code} was: nest-data.json {code:java} {"a": [{"b": [{"c": [1,2]}]}]} {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} {code:java} val df: DataFrame = spark.read.json(testFile("nest-data.json")) df.createTempView("nest_table") sql("select a.b.c from nest_table").show() {code} {color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;{color} {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color} {color:#ff}+- SubqueryAlias `nest_table`{color} {color:#ff} +- Relation[a#6|#6] json{color} {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match {color}GetArrayItem, extraction ("c") treat as an ordinal. {code:java} def apply( child: Expression, extraction: Expression, resolver: Resolver): Expression = {(child.dataType, extraction) match { case (StructType(fields), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetStructField(child, ordinal, Some(fieldName)) case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), ordinal, fields.length, containsNull) case (_: ArrayType, _) => GetArrayItem(child, extraction) case (MapType(kt, _, _), _) => GetMapValue(child, extraction) case (otherType, _) => val errorMsg = otherType match { case StructType(_) => s"Field name should be String Literal, but it's $extraction" case other => s"Can't extract value from $child: need struct type but got ${other.catalogString}" } throw new AnalysisException(errorMsg) } }{code} > spark error while select nest data > -- > > Key: SPARK-32002 > URL: https://issues.apache.org/jira/browse/SPARK-32002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Yiqun Zhang >Priority: Minor > > nest-data.json > {code:java} > {"a": [{"b": [{"c": [1,2]}]}]} > {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} > {code:java} > val df: DataFrame = spark.read.json(testFile("nest-data.json")) > df.createTempView("nest_table") > sql("select a.b.c
[jira] [Updated] (SPARK-32002) spark error while select nest data
[ https://issues.apache.org/jira/browse/SPARK-32002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yiqun Zhang updated SPARK-32002: Description: nest-data.json {code:java} {"a": [{"b": [{"c": [1,2]}]}]} {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} {code:java} val df: DataFrame = spark.read.json(testFile("nest-data.json")) df.createTempView("nest_table") sql("select a.b.c from nest_table").show() {code} {color:#ff}org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;{color} {color:#ff}'Project [a#6.b[c] AS c#8|#6.b[c] AS c#8]{color} {color:#ff}+- SubqueryAlias `nest_table`{color} {color:#ff} +- Relation[a#6|#6] json{color} {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match {color}GetArrayItem, extraction ("c") treat as an ordinal. {code:java} def apply( child: Expression, extraction: Expression, resolver: Resolver): Expression = {(child.dataType, extraction) match { case (StructType(fields), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetStructField(child, ordinal, Some(fieldName)) case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), ordinal, fields.length, containsNull) case (_: ArrayType, _) => GetArrayItem(child, extraction) case (MapType(kt, _, _), _) => GetMapValue(child, extraction) case (otherType, _) => val errorMsg = otherType match { case StructType(_) => s"Field name should be String Literal, but it's $extraction" case other => s"Can't extract value from $child: need struct type but got ${other.catalogString}" } throw new AnalysisException(errorMsg) } }{code} was: nest-data.json {code:java} {"a": [{"b": [{"c": [1,2]}]}]} {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} {code:java} val df: DataFrame = spark.read.json(testFile("nest-data.json")) df.createTempView("nest_table") sql("select a.b.c from nest_table").show() {code} {color:#FF}org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;{color} {color:#FF}'Project [a#6.b[c] AS c#8]{color} {color:#FF}+- SubqueryAlias `nest_table`{color} {color:#FF} +- Relation[a#6] json{color} {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match {color}GetArrayItem, extraction ("c") treat as an ordinal. {code:java} def apply( child: Expression, extraction: Expression, resolver: Resolver): Expression = {(child.dataType, extraction) match { case (StructType(fields), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetStructField(child, ordinal, Some(fieldName)) case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), ordinal, fields.length, containsNull) case (_: ArrayType, _) => GetArrayItem(child, extraction) case (MapType(kt, _, _), _) => GetMapValue(child, extraction) case (otherType, _) => val errorMsg = otherType match { case StructType(_) => s"Field name should be String Literal, but it's $extraction" case other => s"Can't extract value from $child: need struct type but got ${other.catalogString}" } throw new AnalysisException(errorMsg) } }{code} > spark error while select nest data > -- > > Key: SPARK-32002 > URL: https://issues.apache.org/jira/browse/SPARK-32002 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Yiqun Zhang >Priority: Minor > > nest-data.json > {code:java} > {"a": [{"b": [{"c": [1,2]}]}]} > {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} > {code:java} > val df: DataFrame = spark.read.json(testFile("nest-data.json")) > df.createTempView("nest_table") > sql("select a.b.c from nest_table").show() > {code} >
[jira] [Created] (SPARK-32002) spark error while select nest data
Yiqun Zhang created SPARK-32002: --- Summary: spark error while select nest data Key: SPARK-32002 URL: https://issues.apache.org/jira/browse/SPARK-32002 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: Yiqun Zhang nest-data.json {code:java} {"a": [{"b": [{"c": [1,2]}]}]} {"a": [{"b": [{"c": [1]}, {"c": [2]}]}]}{code} {code:java} val df: DataFrame = spark.read.json(testFile("nest-data.json")) df.createTempView("nest_table") sql("select a.b.c from nest_table").show() {code} {color:#FF}org.apache.spark.sql.AnalysisException: cannot resolve 'nest_table.`a`.`b`['c']' due to data type mismatch: argument 2 requires integral type, however, ''c'' is of string type.; line 1 pos 7;{color} {color:#FF}'Project [a#6.b[c] AS c#8]{color} {color:#FF}+- SubqueryAlias `nest_table`{color} {color:#FF} +- Relation[a#6] json{color} {color:#172b4d}Analyse the causes, a.b Expression dataType match extractor for c, but a.b extractor is GetArrayStructFields, ArrayType(ArrayType()) match {color}GetArrayItem, extraction ("c") treat as an ordinal. {code:java} def apply( child: Expression, extraction: Expression, resolver: Resolver): Expression = {(child.dataType, extraction) match { case (StructType(fields), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetStructField(child, ordinal, Some(fieldName)) case (ArrayType(StructType(fields), containsNull), NonNullLiteral(v, StringType)) => val fieldName = v.toString val ordinal = findField(fields, fieldName, resolver) GetArrayStructFields(child, fields(ordinal).copy(name = fieldName), ordinal, fields.length, containsNull) case (_: ArrayType, _) => GetArrayItem(child, extraction) case (MapType(kt, _, _), _) => GetMapValue(child, extraction) case (otherType, _) => val errorMsg = otherType match { case StructType(_) => s"Field name should be String Literal, but it's $extraction" case other => s"Can't extract value from $child: need struct type but got ${other.catalogString}" } throw new AnalysisException(errorMsg) } }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31984) Make micros rebasing functions via local timestamps pure
[ https://issues.apache.org/jira/browse/SPARK-31984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31984: --- Assignee: Maxim Gekk > Make micros rebasing functions via local timestamps pure > > > Key: SPARK-31984 > URL: https://issues.apache.org/jira/browse/SPARK-31984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > The functions rebaseGregorianToJulianMicros(zoneId: ZoneId, ...) and > rebaseJulianToGregorianMicros(zoneId: ZoneId, ...) accept the zone id as the > first parameter but use it only while forming ZonedDateTime and ignore in > Java 7 GregorianCalendar. The Calendar instance uses the default JVM time > zone internally. This causes the following problems: > # The functions depend on the global state variable. And calling the > functions from different threads can return wrong results if the the default > JVM time zone is changed during the execution. > # It is impossible to speed up generation of JSON files with diff/switch via > parallelisation. > # The functions don't fully use the passed ZoneId. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31984) Make micros rebasing functions via local timestamps pure
[ https://issues.apache.org/jira/browse/SPARK-31984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31984. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28824 [https://github.com/apache/spark/pull/28824] > Make micros rebasing functions via local timestamps pure > > > Key: SPARK-31984 > URL: https://issues.apache.org/jira/browse/SPARK-31984 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.1.0 > > > The functions rebaseGregorianToJulianMicros(zoneId: ZoneId, ...) and > rebaseJulianToGregorianMicros(zoneId: ZoneId, ...) accept the zone id as the > first parameter but use it only while forming ZonedDateTime and ignore in > Java 7 GregorianCalendar. The Calendar instance uses the default JVM time > zone internally. This causes the following problems: > # The functions depend on the global state variable. And calling the > functions from different threads can return wrong results if the the default > JVM time zone is changed during the execution. > # It is impossible to speed up generation of JSON files with diff/switch via > parallelisation. > # The functions don't fully use the passed ZoneId. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31981) Keep TimestampType when taking an average of a Timestamp
[ https://issues.apache.org/jira/browse/SPARK-31981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31981: Fix Version/s: (was: 3.1.0) > Keep TimestampType when taking an average of a Timestamp > > > Key: SPARK-31981 > URL: https://issues.apache.org/jira/browse/SPARK-31981 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Fokko Driesprong >Priority: Major > > Currently, when you take an average of a Timestamp, you'll end up with a > Double, representing the seconds since epoch. This is because of old Hive > behavior. I strongly believe that it is better to return a Timestamp. > root@8c4241b617ec:/# psql postgres postgres > psql (12.3 (Debian 12.3-1.pgdg100+1)) > Type "help" for help. > postgres=# CREATE TABLE timestamp_demo (ts TIMESTAMP); > CREATE TABLE > postgres=# INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11'); > INSERT 0 1 > postgres=# INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11'); > INSERT 0 1 > postgres=# INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11'); > INSERT 0 1 > postgres=# SELECT AVG(ts) FROM timestamp_demo; > ERROR: function avg(timestamp without time zone) does not exist > LINE 1: SELECT AVG(ts) FROM timestamp_demo; > > root@bab43a5731e8:/# mysql > Welcome to the MySQL monitor. Commands end with ; or \g. > Your MySQL connection id is 9 > Server version: 8.0.20 MySQL Community Server - GPL > Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved. > Oracle is a registered trademark of Oracle Corporation and/or its > affiliates. Other names may be trademarks of their respective > owners. > Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. > mysql> CREATE TABLE timestamp_demo (ts TIMESTAMP); > Query OK, 0 rows affected (0.05 sec) > mysql> INSERT INTO timestamp_demo VALUES('2019-01-01 18:22:11'); > Query OK, 1 row affected (0.01 sec) > mysql> INSERT INTO timestamp_demo VALUES('2018-01-01 18:22:11'); > Query OK, 1 row affected (0.01 sec) > mysql> INSERT INTO timestamp_demo VALUES('2017-01-01 18:22:11'); > Query OK, 1 row affected (0.01 sec) > mysql> SELECT AVG(ts) FROM timestamp_demo; > +-+ > | AVG(ts) | > +-+ > | 20180101182211. | > +-+ > 1 row in set (0.00 sec) > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-32001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi updated SPARK-32001: -- Description: JDBC connection providers must be loaded independently just like delegation token providers. > Create Kerberos authentication provider API in JDBC connector > - > > Key: SPARK-32001 > URL: https://issues.apache.org/jira/browse/SPARK-32001 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > > JDBC connection providers must be loaded independently just like delegation > token providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31857) Support Azure SQLDB Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136579#comment-17136579 ] Gabor Somogyi commented on SPARK-31857: --- My plan is to support external providers which will be implemented here: https://issues.apache.org/jira/browse/SPARK-32001 > Support Azure SQLDB Kerberos login in JDBC connector > > > Key: SPARK-31857 > URL: https://issues.apache.org/jira/browse/SPARK-31857 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31815) Support Hive Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136580#comment-17136580 ] Gabor Somogyi commented on SPARK-31815: --- My plan is to support external providers which will be implemented here: https://issues.apache.org/jira/browse/SPARK-32001 > Support Hive Kerberos login in JDBC connector > - > > Key: SPARK-31815 > URL: https://issues.apache.org/jira/browse/SPARK-31815 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31884) Support MongoDB Kerberos login in JDBC connector
[ https://issues.apache.org/jira/browse/SPARK-31884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136578#comment-17136578 ] Gabor Somogyi commented on SPARK-31884: --- My plan is to support external providers which will be implemented here: https://issues.apache.org/jira/browse/SPARK-32001 > Support MongoDB Kerberos login in JDBC connector > > > Key: SPARK-31884 > URL: https://issues.apache.org/jira/browse/SPARK-31884 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jobit mathew >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32001) Create Kerberos authentication provider API in JDBC connector
Gabor Somogyi created SPARK-32001: - Summary: Create Kerberos authentication provider API in JDBC connector Key: SPARK-32001 URL: https://issues.apache.org/jira/browse/SPARK-32001 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136550#comment-17136550 ] Lin Gang Deng commented on SPARK-31955: --- [~dengzh], the version of beeline in spark2.4 and before is 1.2.1, spark3.0 upgrade the beeline to 2.3.7. For spark,The issue was fixed in spark3.0. > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Major > Fix For: 3.0.0 > > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136535#comment-17136535 ] Zhihua Deng commented on SPARK-31955: - The issue seems have been fixed in [HIVE-10541|https://issues.apache.org/jira/browse/HIVE-10541]. > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Major > Fix For: 3.0.0 > > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136527#comment-17136527 ] Yuming Wang commented on SPARK-31955: - The issue fixed by upgrading the beeline to 2.3.7. How to reproduce this issue: *Prepare table* {code:sql} create table test_beeline using parquet as select id from range(5); {code} *Prepare SQL* {code:sql} echo -en "select * from test_beeline\n where id=2;" >> test.sql {code} *Spark 2.4*: {noformat} [root@spark-3267648 spark-2.4.4-bin-hadoop2.7]# bin/beeline -u "jdbc:hive2://localhost:1" -f /root/spark-3.0.0-bin-hadoop3.2/test.sql Connecting to jdbc:hive2://localhost:1 log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Connected to: Spark SQL (version 3.0.0) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:1> select * from test_beeline 0: jdbc:hive2://localhost:1> where id=2;+-+--+ | id | +-+--+ | 0 | | 2 | | 1 | | 3 | | 4 | +-+--+ 5 rows selected (5.622 seconds) 0: jdbc:hive2://localhost:1> where id=2; Closing: 0: jdbc:hive2://localhost:1 {noformat} *Spark 3.0*: {noformat} [root@spark-3267648 spark-3.0.0-bin-hadoop3.2]# bin/beeline -u "jdbc:hive2://localhost:1" -f test.sql log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Connecting to jdbc:hive2://localhost:1 Connected to: Spark SQL (version 3.0.0) Driver: Hive JDBC (version 2.3.7) Transaction isolation: TRANSACTION_REPEATABLE_READ 0: jdbc:hive2://localhost:1> select * from test_beeline . . . . . . . . . . . . . . . .> where id=2; +-+ | id | +-+ | 2 | +-+ 1 row selected (7.749 seconds) 0: jdbc:hive2://localhost:1> Closing: 0: jdbc:hive2://localhost:1 {noformat} > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Major > Fix For: 3.0.0 > > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reopened SPARK-31955: - > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Major > Fix For: 3.0.0 > > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-31955. - Resolution: Fixed > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Major > Fix For: 3.0.0 > > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31955) Beeline discard the last line of the sql file when submited to thriftserver via beeline
[ https://issues.apache.org/jira/browse/SPARK-31955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31955: Fix Version/s: 3.0.0 > Beeline discard the last line of the sql file when submited to thriftserver > via beeline > > > Key: SPARK-31955 > URL: https://issues.apache.org/jira/browse/SPARK-31955 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4 >Reporter: Lin Gang Deng >Priority: Major > Fix For: 3.0.0 > > > I submitted a sql file on beeline and the result returned is wrong. After > many tests, it was found that the sql executed by Spark would discard the > last line.This should be beeline's bug parsing sql file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31997) Should drop test_udtf table when SingleSessionSuite completed
[ https://issues.apache.org/jira/browse/SPARK-31997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31997: Assignee: Yang Jie > Should drop test_udtf table when SingleSessionSuite completed > - > > Key: SPARK-31997 > URL: https://issues.apache.org/jira/browse/SPARK-31997 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > If we execute mvn test SingleSessionSuite and HiveThriftBinaryServerSuite in > order, the test case "SPARK-11595 ADD JAR with input path having URL scheme" > in HiveThriftBinaryServerSuite will failed as following: > > {code:java} > - SPARK-11595 ADD JAR with input path having URL scheme *** FAILED *** > java.sql.SQLException: Error running query: > org.apache.spark.sql.AnalysisException: Can not create the managed > table('`default`.`test_udtf`'). The associated > location('file:/home/yarn/spark_ut/spark_ut/baidu/inf-spark/spark-source/sql/hive-thriftserver/spark-warehouse/test_udtf') > already exists.; at > org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385) > at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65(HiveThriftServer2Suites.scala:603) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65$adapted(HiveThriftServer2Suites.scala:603) > at scala.collection.immutable.List.foreach(List.scala:392) at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63(HiveThriftServer2Suites.scala:603) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63$adapted(HiveThriftServer2Suites.scala:573) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3(HiveThriftServer2Suites.scala:1074) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3$adapted(HiveThriftServer2Suites.scala:1074) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > {code} > > because SingleSessionSuite do `create table test_udtf` and not drop it when > test completed and HiveThriftBinaryServerSuite want to re-create this table. > If we execute mvn test HiveThriftBinaryServerSuite and SingleSessionSuite in > order,both test suites will succeed, but we shouldn't rely on their execution > order > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31997) Should drop test_udtf table when SingleSessionSuite completed
[ https://issues.apache.org/jira/browse/SPARK-31997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31997. -- Fix Version/s: 3.1.0 2.4.7 3.0.1 Resolution: Fixed Issue resolved by pull request 28838 [https://github.com/apache/spark/pull/28838] > Should drop test_udtf table when SingleSessionSuite completed > - > > Key: SPARK-31997 > URL: https://issues.apache.org/jira/browse/SPARK-31997 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.0.1, 2.4.7, 3.1.0 > > > If we execute mvn test SingleSessionSuite and HiveThriftBinaryServerSuite in > order, the test case "SPARK-11595 ADD JAR with input path having URL scheme" > in HiveThriftBinaryServerSuite will failed as following: > > {code:java} > - SPARK-11595 ADD JAR with input path having URL scheme *** FAILED *** > java.sql.SQLException: Error running query: > org.apache.spark.sql.AnalysisException: Can not create the managed > table('`default`.`test_udtf`'). The associated > location('file:/home/yarn/spark_ut/spark_ut/baidu/inf-spark/spark-source/sql/hive-thriftserver/spark-warehouse/test_udtf') > already exists.; at > org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:385) > at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65(HiveThriftServer2Suites.scala:603) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$65$adapted(HiveThriftServer2Suites.scala:603) > at scala.collection.immutable.List.foreach(List.scala:392) at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63(HiveThriftServer2Suites.scala:603) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftBinaryServerSuite.$anonfun$new$63$adapted(HiveThriftServer2Suites.scala:573) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3(HiveThriftServer2Suites.scala:1074) > at > org.apache.spark.sql.hive.thriftserver.HiveThriftJdbcTest.$anonfun$withMultipleConnectionJdbcStatement$3$adapted(HiveThriftServer2Suites.scala:1074) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > {code} > > because SingleSessionSuite do `create table test_udtf` and not drop it when > test completed and HiveThriftBinaryServerSuite want to re-create this table. > If we execute mvn test HiveThriftBinaryServerSuite and SingleSessionSuite in > order,both test suites will succeed, but we shouldn't rely on their execution > order > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136495#comment-17136495 ] Apache Spark commented on SPARK-31962: -- User 'cchighman' has created a pull request for this issue: https://github.com/apache/spark/pull/28841 > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming or just loading from a file data source, I've > encountered a number of occasions where I want to be able to stream from a > folder containing any number of historical files in CSV format. When I start > reading from a folder, however, I might only care about files that were > created after a certain time. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], > there is a method, _listLeafFiles,_ which builds _FileStatus_ objects > containing an implicit _modificationDate_ property. We may already iterate > the resulting files if a filter is applied to the path. In this case, its > trivial to do a primitive comparison against _modificationDate_ and a date > specified from an option. Without the filter specified, we would be > expending less effort than if the filter were applied by itself since we are > comparing primitives. > Having the ability to provide an option where specifying a timestamp when > loading files from a path would minimize complexity for consumers who > leverage the ability to load files or do structured streaming from a folder > path but do not have an interest in reading what could be thousands of files > that are not relevant. > One example to could be "_fileModifiedDate_" accepting a UTC datetime like > below. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .option("fileModifiedDate", "2020-05-01T12:00:00") > .format("csv") > .load("/mnt/Deltas") > {code} > If this option is specified, the expected behavior would be that files within > the _"/mnt/Deltas/"_ path must have been modified at or later than the > specified time in order to be consumed for purposes of reading files from a > folder path or via structured streaming. > I have unit tests passing under F_ileIndexSuite_ in the > _spark.sql.execution.datasources_ package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136494#comment-17136494 ] Apache Spark commented on SPARK-31962: -- User 'cchighman' has created a pull request for this issue: https://github.com/apache/spark/pull/28841 > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming or just loading from a file data source, I've > encountered a number of occasions where I want to be able to stream from a > folder containing any number of historical files in CSV format. When I start > reading from a folder, however, I might only care about files that were > created after a certain time. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], > there is a method, _listLeafFiles,_ which builds _FileStatus_ objects > containing an implicit _modificationDate_ property. We may already iterate > the resulting files if a filter is applied to the path. In this case, its > trivial to do a primitive comparison against _modificationDate_ and a date > specified from an option. Without the filter specified, we would be > expending less effort than if the filter were applied by itself since we are > comparing primitives. > Having the ability to provide an option where specifying a timestamp when > loading files from a path would minimize complexity for consumers who > leverage the ability to load files or do structured streaming from a folder > path but do not have an interest in reading what could be thousands of files > that are not relevant. > One example to could be "_fileModifiedDate_" accepting a UTC datetime like > below. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .option("fileModifiedDate", "2020-05-01T12:00:00") > .format("csv") > .load("/mnt/Deltas") > {code} > If this option is specified, the expected behavior would be that files within > the _"/mnt/Deltas/"_ path must have been modified at or later than the > specified time in order to be consumed for purposes of reading files from a > folder path or via structured streaming. > I have unit tests passing under F_ileIndexSuite_ in the > _spark.sql.execution.datasources_ package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31962: Assignee: (was: Apache Spark) > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming or just loading from a file data source, I've > encountered a number of occasions where I want to be able to stream from a > folder containing any number of historical files in CSV format. When I start > reading from a folder, however, I might only care about files that were > created after a certain time. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], > there is a method, _listLeafFiles,_ which builds _FileStatus_ objects > containing an implicit _modificationDate_ property. We may already iterate > the resulting files if a filter is applied to the path. In this case, its > trivial to do a primitive comparison against _modificationDate_ and a date > specified from an option. Without the filter specified, we would be > expending less effort than if the filter were applied by itself since we are > comparing primitives. > Having the ability to provide an option where specifying a timestamp when > loading files from a path would minimize complexity for consumers who > leverage the ability to load files or do structured streaming from a folder > path but do not have an interest in reading what could be thousands of files > that are not relevant. > One example to could be "_fileModifiedDate_" accepting a UTC datetime like > below. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .option("fileModifiedDate", "2020-05-01T12:00:00") > .format("csv") > .load("/mnt/Deltas") > {code} > If this option is specified, the expected behavior would be that files within > the _"/mnt/Deltas/"_ path must have been modified at or later than the > specified time in order to be consumed for purposes of reading files from a > folder path or via structured streaming. > I have unit tests passing under F_ileIndexSuite_ in the > _spark.sql.execution.datasources_ package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31962: Assignee: Apache Spark > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Assignee: Apache Spark >Priority: Minor > > When using structured streaming or just loading from a file data source, I've > encountered a number of occasions where I want to be able to stream from a > folder containing any number of historical files in CSV format. When I start > reading from a folder, however, I might only care about files that were > created after a certain time. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], > there is a method, _listLeafFiles,_ which builds _FileStatus_ objects > containing an implicit _modificationDate_ property. We may already iterate > the resulting files if a filter is applied to the path. In this case, its > trivial to do a primitive comparison against _modificationDate_ and a date > specified from an option. Without the filter specified, we would be > expending less effort than if the filter were applied by itself since we are > comparing primitives. > Having the ability to provide an option where specifying a timestamp when > loading files from a path would minimize complexity for consumers who > leverage the ability to load files or do structured streaming from a folder > path but do not have an interest in reading what could be thousands of files > that are not relevant. > One example to could be "_fileModifiedDate_" accepting a UTC datetime like > below. > {code:java} > spark.read > .option("header", "true") > .option("delimiter", "\t") > .option("fileModifiedDate", "2020-05-01T12:00:00") > .format("csv") > .load("/mnt/Deltas") > {code} > If this option is specified, the expected behavior would be that files within > the _"/mnt/Deltas/"_ path must have been modified at or later than the > specified time in order to be consumed for purposes of reading files from a > folder path or via structured streaming. > I have unit tests passing under F_ileIndexSuite_ in the > _spark.sql.execution.datasources_ package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming or just loading from a file data source, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.read .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under F_ileIndexSuite_ in the _spark.sql.execution.datasources_ package. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming or just loading from a file data
[jira] [Assigned] (SPARK-31999) Add refresh function command
[ https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31999: Assignee: Apache Spark > Add refresh function command > > > Key: SPARK-31999 > URL: https://issues.apache.org/jira/browse/SPARK-31999 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31999) Add refresh function command
[ https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31999: Assignee: (was: Apache Spark) > Add refresh function command > > > Key: SPARK-31999 > URL: https://issues.apache.org/jira/browse/SPARK-31999 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Description: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_fileModifiedDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("fileModifiedDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. was: When using structured streaming with a FileDataSource, I've encountered a number of occasions where I want to be able to stream from a folder containing any number of historical files in CSV format. When I start reading from a folder, however, I might only care about files that were created after a certain time. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .format("csv") .load("/mnt/Deltas") {code} In [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], there is a method, _listLeafFiles,_ which builds _FileStatus_ objects containing an implicit _modificationDate_ property. We may already iterate the resulting files if a filter is applied to the path. In this case, its trivial to do a primitive comparison against _modificationDate_ and a date specified from an option. Without the filter specified, we would be expending less effort than if the filter were applied by itself since we are comparing primitives. Having the ability to provide an option where specifying a timestamp when loading files from a path would minimize complexity for consumers who leverage the ability to load files or do structured streaming from a folder path but do not have an interest in reading what could be thousands of files that are not relevant. One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime like below. {code:java} spark.readStream .option("header", "true") .option("delimiter", "\t") .option("filesModifiedAfterDate", "2020-05-01T12:00:00") .format("csv") .load("/mnt/Deltas") {code} If this option is specified, the expected behavior would be that files within the _"/mnt/Deltas/"_ path must have been modified at or later than the specified time in order to be consumed for purposes of reading files from a folder path or via structured streaming. I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the _spark.sql.execution.datasources_ package. > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming with a
[jira] [Updated] (SPARK-31962) Provide option to load files after a specified date when reading from a folder path
[ https://issues.apache.org/jira/browse/SPARK-31962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Highman updated SPARK-31962: Component/s: (was: Structured Streaming) > Provide option to load files after a specified date when reading from a > folder path > --- > > Key: SPARK-31962 > URL: https://issues.apache.org/jira/browse/SPARK-31962 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Christopher Highman >Priority: Minor > > When using structured streaming with a FileDataSource, I've encountered a > number of occasions where I want to be able to stream from a folder > containing any number of historical files in CSV format. When I start > reading from a folder, however, I might only care about files that were > created after a certain time. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .format("csv") > .load("/mnt/Deltas") > {code} > In > [https://github.com/apache/spark/blob/f3771c6b47d0b3aef10b86586289a1f675c7cfe2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala], > there is a method, _listLeafFiles,_ which builds _FileStatus_ objects > containing an implicit _modificationDate_ property. We may already iterate > the resulting files if a filter is applied to the path. In this case, its > trivial to do a primitive comparison against _modificationDate_ and a date > specified from an option. Without the filter specified, we would be > expending less effort than if the filter were applied by itself since we are > comparing primitives. > Having the ability to provide an option where specifying a timestamp when > loading files from a path would minimize complexity for consumers who > leverage the ability to load files or do structured streaming from a folder > path but do not have an interest in reading what could be thousands of files > that are not relevant. > One example to could be "_filesModifiedAfterDate_" accepting a UTC datetime > like below. > {code:java} > spark.readStream > .option("header", "true") > .option("delimiter", "\t") > .option("filesModifiedAfterDate", "2020-05-01T12:00:00") > .format("csv") > .load("/mnt/Deltas") > {code} > If this option is specified, the expected behavior would be that files within > the _"/mnt/Deltas/"_ path must have been modified at or later than the > specified time in order to be consumed for purposes of reading files from a > folder path or via structured streaming. > I have unit tests passing under _CSVSuite_ and _FileIndexSuite_ in the > _spark.sql.execution.datasources_ package. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31999) Add refresh function command
[ https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136458#comment-17136458 ] Apache Spark commented on SPARK-31999: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28840 > Add refresh function command > > > Key: SPARK-31999 > URL: https://issues.apache.org/jira/browse/SPARK-31999 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31999) Add refresh function command
[ https://issues.apache.org/jira/browse/SPARK-31999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136460#comment-17136460 ] Apache Spark commented on SPARK-31999: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/28840 > Add refresh function command > > > Key: SPARK-31999 > URL: https://issues.apache.org/jira/browse/SPARK-31999 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: ulysses you >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.
[ https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32000: Assignee: Apache Spark (was: Kousuke Saruta) > Fix the flaky testcase for partially launched task in barrier-mode. > --- > > Key: SPARK-32000 > URL: https://issues.apache.org/jira/browse/SPARK-32000 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > I noticed sometimes the testcase for SPARK-31485 fails. > The reason should be related to the locality wait. > If the scheduler waits for a resource offer which meets the preferred > location for a task until the time-limit of process-local but no resource can > be offered for the locality level, the scheduler will give up the preferred > location. In this case, such task can be assigned to off-preferred location. > The testcase for SPARK-31485, there are two tasks and only one task is > supposed to be assigned at one schedule round but both two tasks can be > assigned in that situation mentioned above and the testcase will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.
[ https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136452#comment-17136452 ] Apache Spark commented on SPARK-32000: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28839 > Fix the flaky testcase for partially launched task in barrier-mode. > --- > > Key: SPARK-32000 > URL: https://issues.apache.org/jira/browse/SPARK-32000 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > I noticed sometimes the testcase for SPARK-31485 fails. > The reason should be related to the locality wait. > If the scheduler waits for a resource offer which meets the preferred > location for a task until the time-limit of process-local but no resource can > be offered for the locality level, the scheduler will give up the preferred > location. In this case, such task can be assigned to off-preferred location. > The testcase for SPARK-31485, there are two tasks and only one task is > supposed to be assigned at one schedule round but both two tasks can be > assigned in that situation mentioned above and the testcase will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.
[ https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32000: Assignee: Kousuke Saruta (was: Apache Spark) > Fix the flaky testcase for partially launched task in barrier-mode. > --- > > Key: SPARK-32000 > URL: https://issues.apache.org/jira/browse/SPARK-32000 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > I noticed sometimes the testcase for SPARK-31485 fails. > The reason should be related to the locality wait. > If the scheduler waits for a resource offer which meets the preferred > location for a task until the time-limit of process-local but no resource can > be offered for the locality level, the scheduler will give up the preferred > location. In this case, such task can be assigned to off-preferred location. > The testcase for SPARK-31485, there are two tasks and only one task is > supposed to be assigned at one schedule round but both two tasks can be > assigned in that situation mentioned above and the testcase will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.
[ https://issues.apache.org/jira/browse/SPARK-32000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-32000: --- Issue Type: Bug (was: Improvement) > Fix the flaky testcase for partially launched task in barrier-mode. > --- > > Key: SPARK-32000 > URL: https://issues.apache.org/jira/browse/SPARK-32000 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > I noticed sometimes the testcase for SPARK-31485 fails. > The reason should be related to the locality wait. > If the scheduler waits for a resource offer which meets the preferred > location for a task until the time-limit of process-local but no resource can > be offered for the locality level, the scheduler will give up the preferred > location. In this case, such task can be assigned to off-preferred location. > The testcase for SPARK-31485, there are two tasks and only one task is > supposed to be assigned at one schedule round but both two tasks can be > assigned in that situation mentioned above and the testcase will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32000) Fix the flaky testcase for partially launched task in barrier-mode.
Kousuke Saruta created SPARK-32000: -- Summary: Fix the flaky testcase for partially launched task in barrier-mode. Key: SPARK-32000 URL: https://issues.apache.org/jira/browse/SPARK-32000 Project: Spark Issue Type: Improvement Components: Spark Core, Tests Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta I noticed sometimes the testcase for SPARK-31485 fails. The reason should be related to the locality wait. If the scheduler waits for a resource offer which meets the preferred location for a task until the time-limit of process-local but no resource can be offered for the locality level, the scheduler will give up the preferred location. In this case, such task can be assigned to off-preferred location. The testcase for SPARK-31485, there are two tasks and only one task is supposed to be assigned at one schedule round but both two tasks can be assigned in that situation mentioned above and the testcase will fail. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31710) Fail casting numeric to timestamp by default
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31710. - Fix Version/s: 3.1.0 Assignee: philipse Resolution: Fixed > Fail casting numeric to timestamp by default > > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Assignee: philipse >Priority: Major > Fix For: 3.1.0 > > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31710) Fail casting numeric to timestamp by default
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31710: Summary: Fail casting numeric to timestamp by default (was: result is the not the same when query and execute jobs) > Fail casting numeric to timestamp by default > > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Priority: Major > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org