[jira] [Updated] (SPARK-32127) Check rules for MERGE INTO should use MergeAction.condition other than MergeAction.children
[ https://issues.apache.org/jira/browse/SPARK-32127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-32127: Summary: Check rules for MERGE INTO should use MergeAction.condition other than MergeAction.children (was: Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children) > Check rules for MERGE INTO should use MergeAction.condition other than > MergeAction.children > --- > > Key: SPARK-32127 > URL: https://issues.apache.org/jira/browse/SPARK-32127 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some > check rules for MERGE INTO one of which ensures the first MATCHED clause must > have a condition. However, it uses {{MergeAction.children}} in the checking > which is not accurate for the case, and it lets the below case pass the check: > {code:scala} > MERGE INTO testcat1.ns1.ns2.tbl AS target > xxx > WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 > WHEN MATCHED THEN DELETE > xxx > {code} > We should use {{MergeAction.condition}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32127) Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children
[ https://issues.apache.org/jira/browse/SPARK-32127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-32127: Description: [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some check rules for MERGE INTO one of which ensures the first MATCHED clause must have a condition. However, it uses {{MergeAction.children}} in the checking which is not accurate for the case, and it lets the below case pass the check: {code:scala} MERGE INTO testcat1.ns1.ns2.tbl AS target xxx WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 WHEN MATCHED THEN DELETE xxx {code} We should use {{MergeAction.condition}} instead. was: [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some check rules for MERGE INTO one of which ensures the first MATCHED clause must have a condition. However, it uses {MergeAction.children} in the checking which is not accurate for the case, and it lets the below case pass the check: {code:scala} MERGE INTO testcat1.ns1.ns2.tbl AS target xxx WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 WHEN MATCHED THEN DELETE xxx {code} We should use {MergeAction.condition} instead. > Check rules for MERGE INTO should use MergeAction.condition other than > MeregAction.children > --- > > Key: SPARK-32127 > URL: https://issues.apache.org/jira/browse/SPARK-32127 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some > check rules for MERGE INTO one of which ensures the first MATCHED clause must > have a condition. However, it uses {{MergeAction.children}} in the checking > which is not accurate for the case, and it lets the below case pass the check: > {code:scala} > MERGE INTO testcat1.ns1.ns2.tbl AS target > xxx > WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 > WHEN MATCHED THEN DELETE > xxx > {code} > We should use {{MergeAction.condition}} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32127) Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children
Xianyin Xin created SPARK-32127: --- Summary: Check rules for MERGE INTO should use MergeAction.condition other than MeregAction.children Key: SPARK-32127 URL: https://issues.apache.org/jira/browse/SPARK-32127 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin [SPARK-30924|https://issues.apache.org/jira/browse/SPARK-30924] adds some check rules for MERGE INTO one of which ensures the first MATCHED clause must have a condition. However, it uses {MergeAction.children} in the checking which is not accurate for the case, and it lets the below case pass the check: {code:scala} MERGE INTO testcat1.ns1.ns2.tbl AS target xxx WHEN MATCHED THEN UPDATE SET target.col2 = source.col2 WHEN MATCHED THEN DELETE xxx {code} We should use {MergeAction.condition} instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32030) Support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO
[ https://issues.apache.org/jira/browse/SPARK-32030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-32030: Description: Now the {{MERGE INTO}} syntax is, {code:sql} MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [ WHEN MATCHED [ AND ] THEN ] [ WHEN MATCHED [ AND ] THEN ] [ WHEN NOT MATCHED [ AND ] THEN ]{code} It would be nice if we support unlimited {{MATCHED}} and {{NOT MATCHED}} clauses in {{MERGE INTO}} statement, because users may want to deal with different "{{AND }}"s, the result of which just like a series of "{{CASE WHEN}}"s. The expected syntax looks like {code:sql} MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [when_clause [, ...]] {code} where {{when_clause}} is {code:java} WHEN MATCHED [ AND ] THEN {code} or {code:java} WHEN NOT MATCHED [ AND ] THEN {code} was: Now the MERGE INTO syntax is, ``` MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [ WHEN MATCHED [ AND ] THEN ] [ WHEN MATCHED [ AND ] THEN ] [ WHEN NOT MATCHED [ AND ] THEN ] ``` It would be nice if we support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement, because users may want to deal with different "AND "s, the result of which just like a series of "CASE WHEN"s. The expected syntax looks like ``` MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [when_clause [, ...]] ``` where `when_clause` is ``` WHEN MATCHED [ AND ] THEN ``` or ``` WHEN NOT MATCHED [ AND ] THEN ``` > Support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO > --- > > Key: SPARK-32030 > URL: https://issues.apache.org/jira/browse/SPARK-32030 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Xianyin Xin >Priority: Major > > Now the {{MERGE INTO}} syntax is, > {code:sql} > MERGE INTO [db_name.]target_table [AS target_alias] > USING [db_name.]source_table [] [AS source_alias] > ON > [ WHEN MATCHED [ AND ] THEN ] > [ WHEN MATCHED [ AND ] THEN ] > [ WHEN NOT MATCHED [ AND ] THEN ]{code} > It would be nice if we support unlimited {{MATCHED}} and {{NOT MATCHED}} > clauses in {{MERGE INTO}} statement, because users may want to deal with > different "{{AND }}"s, the result of which just like a series of > "{{CASE WHEN}}"s. The expected syntax looks like > {code:sql} > MERGE INTO [db_name.]target_table [AS target_alias] > USING [db_name.]source_table [] [AS source_alias] > ON > [when_clause [, ...]] > {code} > where {{when_clause}} is > {code:java} > WHEN MATCHED [ AND ] THEN {code} > or > {code:java} > WHEN NOT MATCHED [ AND ] THEN {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32030) Support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO
Xianyin Xin created SPARK-32030: --- Summary: Support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO Key: SPARK-32030 URL: https://issues.apache.org/jira/browse/SPARK-32030 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1 Reporter: Xianyin Xin Now the MERGE INTO syntax is, ``` MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [ WHEN MATCHED [ AND ] THEN ] [ WHEN MATCHED [ AND ] THEN ] [ WHEN NOT MATCHED [ AND ] THEN ] ``` It would be nice if we support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement, because users may want to deal with different "AND "s, the result of which just like a series of "CASE WHEN"s. The expected syntax looks like ``` MERGE INTO [db_name.]target_table [AS target_alias] USING [db_name.]source_table [] [AS source_alias] ON [when_clause [, ...]] ``` where `when_clause` is ``` WHEN MATCHED [ AND ] THEN ``` or ``` WHEN NOT MATCHED [ AND ] THEN ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29907) Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte.
Xianyin Xin created SPARK-29907: --- Summary: Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte. Key: SPARK-29907 URL: https://issues.apache.org/jira/browse/SPARK-29907 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to `dmlStatementNoWith`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29835) Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE
[ https://issues.apache.org/jira/browse/SPARK-29835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-29835: Summary: Remove the unnecessary conversion from Statement to LogicalPlan for DELETE/UPDATE (was: Remove the unneeded conversion from Statement to LogicalPlan for DELETE/UPDATE) > Remove the unnecessary conversion from Statement to LogicalPlan for > DELETE/UPDATE > - > > Key: SPARK-29835 > URL: https://issues.apache.org/jira/browse/SPARK-29835 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > The current parse and analyze flow for DELETE is: 1, the SQL string will be > firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be > converted to `DeleteFromTable`. However, the SQL string can be parsed to > `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be > redundant. > It is the same for UPDATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29835) Remove the unneeded conversion from Statement to LogicalPlan for DELETE/UPDATE
Xianyin Xin created SPARK-29835: --- Summary: Remove the unneeded conversion from Statement to LogicalPlan for DELETE/UPDATE Key: SPARK-29835 URL: https://issues.apache.org/jira/browse/SPARK-29835 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin The current parse and analyze flow for DELETE is: 1, the SQL string will be firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be converted to `DeleteFromTable`. However, the SQL string can be parsed to `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be redundant. It is the same for UPDATE. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28303) Support DELETE/UPDATE/MERGE Operations in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16948160#comment-16948160 ] Xianyin Xin commented on SPARK-28303: - [~echangzhang] Thank you for sharing your work with me. Seems our goals are similar but not exactly the same, and the approaches are different. In fact this ticket plans to add DELETE/UPDATE/MERGE support in Datasource V2, which can be implemented by the concrete datasource, like file based sources(parquet for example), kudu, and JDBC. Does this meets your requirement? > Support DELETE/UPDATE/MERGE Operations in DataSource V2 > --- > > Key: SPARK-28303 > URL: https://issues.apache.org/jira/browse/SPARK-28303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) > supports deleting/updating data. It's necessary to add related APIs in the > datasource V2 API sets. > For example, we suggest add the below interface in V2 API, > {code:java|title=SupportsDelete.java|borderStyle=solid} > public interface SupportsDelete { > WriteBuilder delete(Filter[] filters); > } > {code} > which can delete data by simple predicates (complicated cases like correlated > subquery is not considered currently). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29420) support MERGE INTO in the parser and add the corresponding logical plan
[ https://issues.apache.org/jira/browse/SPARK-29420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin resolved SPARK-29420. - Resolution: Duplicate Resolve this as duplicate. Move to [#SPARK-28893]. > support MERGE INTO in the parser and add the corresponding logical plan > --- > > Key: SPARK-29420 > URL: https://issues.apache.org/jira/browse/SPARK-29420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28893) support MERGE INTO in the parser and add the corresponding logical plan
[ https://issues.apache.org/jira/browse/SPARK-28893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-28893: Summary: support MERGE INTO in the parser and add the corresponding logical plan (was: Support MERGE INTO in DataSource V2) > support MERGE INTO in the parser and add the corresponding logical plan > --- > > Key: SPARK-28893 > URL: https://issues.apache.org/jira/browse/SPARK-28893 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29420) support MERGE INTO in the parser and add the corresponding logical plan
Xianyin Xin created SPARK-29420: --- Summary: support MERGE INTO in the parser and add the corresponding logical plan Key: SPARK-29420 URL: https://issues.apache.org/jira/browse/SPARK-29420 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames
[ https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931017#comment-16931017 ] Xianyin Xin commented on SPARK-29049: - [~hyukjin.kwon], I updated the description. PR [https://github.com/apache/spark/pull/25626|https://github.com/apache/spark/pull/25626] will use it to normalize the attribute in {{Expression}}s. > Rename DataSourceStrategy#normalizeFilters to > DataSourceStrategy#normalizeAttrNames > --- > > Key: SPARK-29049 > URL: https://issues.apache.org/jira/browse/SPARK-29049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > DataSourceStrategy#normalizeFilters can also be used to normalize attributes > in {{Expression}}, not limit to {{Filter}}. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames
[ https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-29049: Description: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in {{Expression}}, not limit to {{Filter}}. (was: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in \{{Expression}} s, not limit to \{{Filter}}s.) > Rename DataSourceStrategy#normalizeFilters to > DataSourceStrategy#normalizeAttrNames > --- > > Key: SPARK-29049 > URL: https://issues.apache.org/jira/browse/SPARK-29049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > DataSourceStrategy#normalizeFilters can also be used to normalize attributes > in {{Expression}}, not limit to {{Filter}}. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames
[ https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-29049: Description: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in {{Expression}}s, not limit to {{Filter}}s. (was: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in `Expression`s, not limit to `Filter`s.) > Rename DataSourceStrategy#normalizeFilters to > DataSourceStrategy#normalizeAttrNames > --- > > Key: SPARK-29049 > URL: https://issues.apache.org/jira/browse/SPARK-29049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > DataSourceStrategy#normalizeFilters can also be used to normalize attributes > in {{Expression}}s, not limit to {{Filter}}s. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames
[ https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-29049: Description: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in {{Expression}} s, not limit to {{Filter}} s. (was: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in {{Expression}}s, not limit to {{Filter}}s.) > Rename DataSourceStrategy#normalizeFilters to > DataSourceStrategy#normalizeAttrNames > --- > > Key: SPARK-29049 > URL: https://issues.apache.org/jira/browse/SPARK-29049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > DataSourceStrategy#normalizeFilters can also be used to normalize attributes > in {{Expression}} s, not limit to {{Filter}} s. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames
[ https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-29049: Description: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in \{{Expression}} s, not limit to \{{Filter}}s. (was: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in {{Expression}} s, not limit to {{Filter}} s.) > Rename DataSourceStrategy#normalizeFilters to > DataSourceStrategy#normalizeAttrNames > --- > > Key: SPARK-29049 > URL: https://issues.apache.org/jira/browse/SPARK-29049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > DataSourceStrategy#normalizeFilters can also be used to normalize attributes > in \{{Expression}} s, not limit to \{{Filter}}s. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames
[ https://issues.apache.org/jira/browse/SPARK-29049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-29049: Description: DataSourceStrategy#normalizeFilters can also be used to normalize attributes in `Expression`s, not limit to `Filter`s. > Rename DataSourceStrategy#normalizeFilters to > DataSourceStrategy#normalizeAttrNames > --- > > Key: SPARK-29049 > URL: https://issues.apache.org/jira/browse/SPARK-29049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > DataSourceStrategy#normalizeFilters can also be used to normalize attributes > in `Expression`s, not limit to `Filter`s. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29049) Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames
Xianyin Xin created SPARK-29049: --- Summary: Rename DataSourceStrategy#normalizeFilters to DataSourceStrategy#normalizeAttrNames Key: SPARK-29049 URL: https://issues.apache.org/jira/browse/SPARK-29049 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28950) FollowingUp: Change whereClause to be optional in DELETE
Xianyin Xin created SPARK-28950: --- Summary: FollowingUp: Change whereClause to be optional in DELETE Key: SPARK-28950 URL: https://issues.apache.org/jira/browse/SPARK-28950 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28950) [SPARK-28351] FollowingUp: Change whereClause to be optional in DELETE
[ https://issues.apache.org/jira/browse/SPARK-28950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-28950: Summary: [SPARK-28351] FollowingUp: Change whereClause to be optional in DELETE (was: FollowingUp: Change whereClause to be optional in DELETE) > [SPARK-28351] FollowingUp: Change whereClause to be optional in DELETE > -- > > Key: SPARK-28950 > URL: https://issues.apache.org/jira/browse/SPARK-28950 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28923) Deduplicate the codes 'multipartIdentifier' and 'identifierSeq'
[ https://issues.apache.org/jira/browse/SPARK-28923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin resolved SPARK-28923. - Resolution: Invalid > Deduplicate the codes 'multipartIdentifier' and 'identifierSeq' > --- > > Key: SPARK-28923 > URL: https://issues.apache.org/jira/browse/SPARK-28923 > Project: Spark > Issue Type: Request > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Minor > > In {{sqlbase.g4}}, {{multipartIdentifier}} and {{identifierSeq}} have the > same functionality. We'd better deduplicate them. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28923) Deduplicate the codes 'multipartIdentifier' and 'identifierSeq'
Xianyin Xin created SPARK-28923: --- Summary: Deduplicate the codes 'multipartIdentifier' and 'identifierSeq' Key: SPARK-28923 URL: https://issues.apache.org/jira/browse/SPARK-28923 Project: Spark Issue Type: Request Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin In {{sqlbase.g4}}, {{multipartIdentifier}} and {{identifierSeq}} have the same functionality. We'd better deduplicate them. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28893) Support MERGE INTO in DataSource V2
Xianyin Xin created SPARK-28893: --- Summary: Support MERGE INTO in DataSource V2 Key: SPARK-28893 URL: https://issues.apache.org/jira/browse/SPARK-28893 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28892) Support UPDATE in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-28892: Affects Version/s: (was: 2.4.3) 3.0.0 > Support UPDATE in DataSource V2 > --- > > Key: SPARK-28892 > URL: https://issues.apache.org/jira/browse/SPARK-28892 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28892) Support UPDATE in DataSource V2
Xianyin Xin created SPARK-28892: --- Summary: Support UPDATE in DataSource V2 Key: SPARK-28892 URL: https://issues.apache.org/jira/browse/SPARK-28892 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.3 Reporter: Xianyin Xin -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885019#comment-16885019 ] Xianyin Xin commented on SPARK-21067: - [~dricard],does the attached patch work in your case? > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.4.0, 2.4.3 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at
[jira] [Updated] (SPARK-28303) Support DELETE/UPDATE/MERGE Operations in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-28303: Description: Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) supports deleting/updating data. It's necessary to add related APIs in the datasource V2 API sets. For example, we suggest add the below interface in V2 API, {code:java|title=SupportsDelete.java|borderStyle=solid} public interface SupportsDelete { WriteBuilder delete(Filter[] filters); } {code} which can delete data by simple predicates (complicated cases like correlated subquery is not considered currently). was: Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) supports deleting/updating data. It's necessary to add related APIs in the datasource V2 API sets. For example, we suggest add the below interface in V2 API, {code:title=SupportsDelete.java|borderStyle=solid} public interface SupportsDelete extends WriteBuilder { WriteBuilder delete(Filter[] filters); } {code} which can delete data by simple predicates (complicated cases like correlated subquery is not considered currently). > Support DELETE/UPDATE/MERGE Operations in DataSource V2 > --- > > Key: SPARK-28303 > URL: https://issues.apache.org/jira/browse/SPARK-28303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Xianyin Xin >Priority: Major > > Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) > supports deleting/updating data. It's necessary to add related APIs in the > datasource V2 API sets. > For example, we suggest add the below interface in V2 API, > {code:java|title=SupportsDelete.java|borderStyle=solid} > public interface SupportsDelete { > WriteBuilder delete(Filter[] filters); > } > {code} > which can delete data by simple predicates (complicated cases like correlated > subquery is not considered currently). > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28350) Support DELETE in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin resolved SPARK-28350. - Resolution: Duplicate Resolve this, and create sub-task of [SPARK-28303|https://issues.apache.org/jira/browse/SPARK-28303] instead. > Support DELETE in DataSource V2 > --- > > Key: SPARK-28350 > URL: https://issues.apache.org/jira/browse/SPARK-28350 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > This ticket add the DELETE support for V2 datasources. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28351) Support DELETE in DataSource V2
Xianyin Xin created SPARK-28351: --- Summary: Support DELETE in DataSource V2 Key: SPARK-28351 URL: https://issues.apache.org/jira/browse/SPARK-28351 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin This ticket add the DELETE support for V2 datasources. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28350) Support DELETE in DataSource V2
Xianyin Xin created SPARK-28350: --- Summary: Support DELETE in DataSource V2 Key: SPARK-28350 URL: https://issues.apache.org/jira/browse/SPARK-28350 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Xianyin Xin This ticket add the DELETE support for V2 datasources. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28303) Support DELETE/UPDATE/MERGE Operations in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880258#comment-16880258 ] Xianyin Xin commented on SPARK-28303: - The SQL entrance is need also. > Support DELETE/UPDATE/MERGE Operations in DataSource V2 > --- > > Key: SPARK-28303 > URL: https://issues.apache.org/jira/browse/SPARK-28303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Xianyin Xin >Priority: Major > > Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) > supports deleting/updating data. It's necessary to add related APIs in the > datasource V2 API sets. > For example, we suggest add the below interface in V2 API, > {code:title=SupportsDelete.java|borderStyle=solid} > public interface SupportsDelete extends WriteBuilder { > WriteBuilder delete(Filter[] filters); > } > {code} > which can delete data by simple predicates (complicated cases like correlated > subquery is not considered currently). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28303) Support DELETE/UPDATE/MERGE Operations in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-28303: Description: Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) supports deleting/updating data. It's necessary to add related APIs in the datasource V2 API sets. For example, we suggest add the below interface in V2 API, {code:title=SupportsDelete.java|borderStyle=solid} public interface SupportsDelete extends WriteBuilder { WriteBuilder delete(Filter[] filters); } {code} which can delete data by simple predicates (complicated cases like correlated subquery is not considered currently). was: Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) supports deleting/updating data. It's necessary to add related APIs in the datasource V2 API sets. For example, we suggest add the below interface in V2 API, ```java public interface SupportsDelete extends WriteBuilder { WriteBuilder delete(Filter[] filters); } ``` which can delete data by simple predicates (complicated cases like correlated subquery is not considered currently). > Support DELETE/UPDATE/MERGE Operations in DataSource V2 > --- > > Key: SPARK-28303 > URL: https://issues.apache.org/jira/browse/SPARK-28303 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Xianyin Xin >Priority: Major > > Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) > supports deleting/updating data. It's necessary to add related APIs in the > datasource V2 API sets. > For example, we suggest add the below interface in V2 API, > {code:title=SupportsDelete.java|borderStyle=solid} > public interface SupportsDelete extends WriteBuilder { > WriteBuilder delete(Filter[] filters); > } > {code} > which can delete data by simple predicates (complicated cases like correlated > subquery is not considered currently). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28303) Support DELETE/UPDATE/MERGE Operations in DataSource V2
Xianyin Xin created SPARK-28303: --- Summary: Support DELETE/UPDATE/MERGE Operations in DataSource V2 Key: SPARK-28303 URL: https://issues.apache.org/jira/browse/SPARK-28303 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Xianyin Xin Now many datasources (delta, jdbc, hive with transaction support, kudu, etc) supports deleting/updating data. It's necessary to add related APIs in the datasource V2 API sets. For example, we suggest add the below interface in V2 API, ```java public interface SupportsDelete extends WriteBuilder { WriteBuilder delete(Filter[] filters); } ``` which can delete data by simple predicates (complicated cases like correlated subquery is not considered currently). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16865272#comment-16865272 ] Xianyin Xin commented on SPARK-27714: - [~nkollar], sorry for the late reply. Yes, It's similar with the implementation in Postgres. However, It is not a replacement, or an alternative of current join reorder logic (DP), but a supplement of DP. DP is used when the number of joined table is small (<12 now in spark), while GA is used when the number of joined tables is large. Because as the number of joined table grows, DP would spend lots of time to find the best joined plan. GA can accelerates the "best plan searching" progress. TPC-DS q64 is an example. Our experiment shows the executing time decreased from 1300+s to 200+s for 10TB TPC-DS q64, with a 18 nodes cluster. > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better to introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15348) Hive ACID
[ https://issues.apache.org/jira/browse/SPARK-15348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848881#comment-16848881 ] Xianyin Xin commented on SPARK-15348: - The starting point (or goal) of Delta Lake is not ACID, but "Data Lake", and ACID is just one of its features. The ACID designs between hive and delta is very different, both have pros and cons. However, hive table and delta table are two datasources in spark's perspective, so a pluggable ACID support for different datasources within one framework is a choice. Maybe datasource V2 API can handle this. > Hive ACID > - > > Key: SPARK-15348 > URL: https://issues.apache.org/jira/browse/SPARK-15348 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.0, 2.3.0 >Reporter: Ran Haim >Priority: Major > > Spark does not support any feature of hive's transnational tables, > you cannot use spark to delete/update a table and it also has problems > reading the aggregated data when no compaction was done. > Also it seems that compaction is not supported - alter table ... partition > COMPACT 'major' -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27722) Remove UnsafeKeyValueSorter
[ https://issues.apache.org/jira/browse/SPARK-27722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840971#comment-16840971 ] Xianyin Xin commented on SPARK-27722: - When doing the moving, I didn't find any reference of this class either. > Remove UnsafeKeyValueSorter > --- > > Key: SPARK-27722 > URL: https://issues.apache.org/jira/browse/SPARK-27722 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Minor > > We just moved the location of classes including {{UnsafeKeyValueSorter}}. > After further investigating, I don't find where {{UnsafeKeyValueSorter}} is > used. > If it is not used at all, shall we just remove it from codebase? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840370#comment-16840370 ] Xianyin Xin commented on SPARK-27714: - [~hyukjin.kwon] Thanks for reminding. [~hyukjin.kwon] [~viirya] Thank you for comments. I'm working on a doc, will post it later on. > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Xianyin Xin >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better to introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-27714: Description: Now the join reorder logic is based on dynamic planning which can find the most optimized plan theoretically, but the searching cost grows rapidly with the # of joined tables grows. It would be better to introduce Genetic algorithm (GA) to overcome this problem. (was: Now the join reorder logic is based on dynamic planning which can find the most optimized plan theoretically, but the searching cost grows rapidly with the # of joined tables grows. It would be better in introduce Genetic algorithm (GA) to overcome this problem.) > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Xianyin Xin >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better to introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27714) Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12
[ https://issues.apache.org/jira/browse/SPARK-27714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-27714: Summary: Support Join Reorder based on Genetic Algorithm when the # of joined tables > 12 (was: Support Join Reorder based on Genetic algorithm when the # of joined tables > 12) > Support Join Reorder based on Genetic Algorithm when the # of joined tables > > 12 > > > Key: SPARK-27714 > URL: https://issues.apache.org/jira/browse/SPARK-27714 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.3 >Reporter: Xianyin Xin >Priority: Major > > Now the join reorder logic is based on dynamic planning which can find the > most optimized plan theoretically, but the searching cost grows rapidly with > the # of joined tables grows. It would be better in introduce Genetic > algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27714) Support Join Reorder based on Genetic algorithm when the # of joined tables > 12
Xianyin Xin created SPARK-27714: --- Summary: Support Join Reorder based on Genetic algorithm when the # of joined tables > 12 Key: SPARK-27714 URL: https://issues.apache.org/jira/browse/SPARK-27714 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Xianyin Xin Now the join reorder logic is based on dynamic planning which can find the most optimized plan theoretically, but the searching cost grows rapidly with the # of joined tables grows. It would be better in introduce Genetic algorithm (GA) to overcome this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27713) Move RecordBinaryComparator and unsafe sorters from catalyst project to core
Xianyin Xin created SPARK-27713: --- Summary: Move RecordBinaryComparator and unsafe sorters from catalyst project to core Key: SPARK-27713 URL: https://issues.apache.org/jira/browse/SPARK-27713 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3 Reporter: Xianyin Xin `RecordBinaryComparator`, `UnsafeExternalRowSorter` and `UnsafeKeyValueSorter` now locates in catalyst, which should be moved to core, as they're used only in physical plan. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790119#comment-16790119 ] Xianyin Xin commented on SPARK-21067: - Yep, [~Moriarty279] , nice analysis. > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0, 2.4.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774675#comment-16774675 ] Xianyin Xin commented on SPARK-21067: - Upload a patch which is based on 2.3.2. > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at
[jira] [Updated] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xianyin Xin updated SPARK-21067: Attachment: SPARK-21067.patch > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard >Priority: Major > Attachments: SPARK-21067.patch > > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at