[jira] [Created] (SPARK-36726) Upgrade Parquet to 1.12.1
Chao Sun created SPARK-36726: Summary: Upgrade Parquet to 1.12.1 Key: SPARK-36726 URL: https://issues.apache.org/jira/browse/SPARK-36726 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Chao Sun Upgrade Apache Parquet to 1.12.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36726) Upgrade Parquet to 1.12.1
[ https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36726: Assignee: Apache Spark > Upgrade Parquet to 1.12.1 > - > > Key: SPARK-36726 > URL: https://issues.apache.org/jira/browse/SPARK-36726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Apache Spark >Priority: Major > > Upgrade Apache Parquet to 1.12.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36726) Upgrade Parquet to 1.12.1
[ https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413676#comment-17413676 ] Apache Spark commented on SPARK-36726: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/33969 > Upgrade Parquet to 1.12.1 > - > > Key: SPARK-36726 > URL: https://issues.apache.org/jira/browse/SPARK-36726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Upgrade Apache Parquet to 1.12.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36726) Upgrade Parquet to 1.12.1
[ https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36726: Assignee: (was: Apache Spark) > Upgrade Parquet to 1.12.1 > - > > Key: SPARK-36726 > URL: https://issues.apache.org/jira/browse/SPARK-36726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Major > > Upgrade Apache Parquet to 1.12.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow
[ https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413679#comment-17413679 ] pralabhkumar commented on SPARK-32285: -- Thx , will share the PR in some time > Add PySpark support for nested timestamps with arrow > > > Key: SPARK-32285 > URL: https://issues.apache.org/jira/browse/SPARK-32285 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Currently with arrow optimizations, there is post-processing done in pandas > for timestamp columns to localize timezone. This is not done for nested > columns with timestamps such as StructType or ArrayType. > Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use > of structs with timestamps in groupedby key over a window. > As a simple first step, timestamps with 1 level nesting could be done first > and this will satisfy the immediate need. > NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing > with pyarrow.array.cast, which could be easier done than in pandas. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic
Tongwei created SPARK-36727: --- Summary: Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic Key: SPARK-36727 URL: https://issues.apache.org/jira/browse/SPARK-36727 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.2 Reporter: Tongwei -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-33648) Moving file stage failed cause dulpicated data
[ https://issues.apache.org/jira/browse/SPARK-33648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tongwei closed SPARK-33648. --- > Moving file stage failed cause dulpicated data > --- > > Key: SPARK-33648 > URL: https://issues.apache.org/jira/browse/SPARK-33648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Tongwei >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tongwei updated SPARK-36727: Description: {code:java} // non-partitioned table overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; INSERT OVERWRITE TABLE tbl SELECT 0,1; INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; // partitioned table static overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT); INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE p1=2021; {code} When we run the above query, an error will be throwed "Cannot overwrite a path that is also being read from" > Support sql overwrite a path that is also being read from when > partitionOverwriteMode is dynamic > > > Key: SPARK-36727 > URL: https://issues.apache.org/jira/browse/SPARK-36727 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Tongwei >Priority: Minor > > {code:java} > // non-partitioned table overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; > INSERT OVERWRITE TABLE tbl SELECT 0,1; > INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; > // partitioned table static overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 > INT); > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE > p1=2021; > {code} > When we run the above query, an error will be throwed "Cannot overwrite a > path that is also being read from" > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tongwei updated SPARK-36727: Description: {code:java} // non-partitioned table overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; INSERT OVERWRITE TABLE tbl SELECT 0,1; INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; // partitioned table static overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT); INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE p1=2021; {code} When we run the above query, an error will be throwed "Cannot overwrite a path that is also being read from" We need to support this operation when the weather is good was: {code:java} // non-partitioned table overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; INSERT OVERWRITE TABLE tbl SELECT 0,1; INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; // partitioned table static overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT); INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE p1=2021; {code} When we run the above query, an error will be throwed "Cannot overwrite a path that is also being read from" > Support sql overwrite a path that is also being read from when > partitionOverwriteMode is dynamic > > > Key: SPARK-36727 > URL: https://issues.apache.org/jira/browse/SPARK-36727 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Tongwei >Priority: Minor > > {code:java} > // non-partitioned table overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; > INSERT OVERWRITE TABLE tbl SELECT 0,1; > INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; > // partitioned table static overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 > INT); > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE > p1=2021; > {code} > When we run the above query, an error will be throwed "Cannot overwrite a > path that is also being read from" > We need to support this operation when the weather is good > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic
[ https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tongwei updated SPARK-36727: Description: {code:java} // non-partitioned table overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; INSERT OVERWRITE TABLE tbl SELECT 0,1; INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; // partitioned table static overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT); INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE p1=2021; {code} When we run the above query, an error will be throwed "Cannot overwrite a path that is also being read from" We need to support this operation when the spark.sql.sources.partitionOverwriteMode is dynamic was: {code:java} // non-partitioned table overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; INSERT OVERWRITE TABLE tbl SELECT 0,1; INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; // partitioned table static overwrite CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT); INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE p1=2021; {code} When we run the above query, an error will be throwed "Cannot overwrite a path that is also being read from" We need to support this operation when the weather is good > Support sql overwrite a path that is also being read from when > partitionOverwriteMode is dynamic > > > Key: SPARK-36727 > URL: https://issues.apache.org/jira/browse/SPARK-36727 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2 >Reporter: Tongwei >Priority: Minor > > {code:java} > // non-partitioned table overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET; > INSERT OVERWRITE TABLE tbl SELECT 0,1; > INSERT OVERWRITE TABLE tbl SELECT * FROM tbl; > // partitioned table static overwrite > CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 > INT); > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2; > INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE > p1=2021; > {code} > When we run the above query, an error will be throwed "Cannot overwrite a > path that is also being read from" > We need to support this operation when the > spark.sql.sources.partitionOverwriteMode is dynamic -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas
Bjørn Jørgensen created SPARK-36728: --- Summary: Can't create datetime object from anything other then year column Pyspark - koalas Key: SPARK-36728 URL: https://issues.apache.org/jira/browse/SPARK-36728 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.3.0 Reporter: Bjørn Jørgensen If I create a datetime object it must be from columns named year. df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 2016], 'month': [2, 3], 'day': [4, 5], 'hour': [2, 3], 'minute': [10, 30], 'second': [21,25]}) df.info() Int64Index: 2 entries, 1 to 0Data columns (total 6 columns): # Column Non-Null Count Dtype--- -- -- - 0 year 2 non-null int64 1 month 2 non-null int64 2 day 2 non-null int64 3 hour 2 non-null int64 4 minute 2 non-null int64 5 second 2 non-null int64dtypes: int64(6) df['date'] = ps.to_datetime(df[['year', 'month', 'day']]) df.info() Int64Index: 2 entries, 1 to 0Data columns (total 7 columns): # Column Non-Null Count Dtype --- -- -- - 0 year 2 non-null int64 1 month 2 non-null int64 2 day 2 non-null int64 3 hour 2 non-null int64 4 minute 2 non-null int64 5 second 2 non-null int64 6 date 2 non-null datetime64dtypes: datetime64(1), int64(6) df_test = ps.DataFrame(\{'testyear': [2015, 2016], 'testmonth': [2, 3], 'testday': [4, 5], 'hour': [2, 3], 'minute': [10, 30], 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 'testmonth', 'testday']]) ---KeyError Traceback (most recent call last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = ps.to_datetime(df[['testyear', 'testmonth', 'testday']]) /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key) 11853 return self.loc[:, key] 11854 elif is_list_like(key):> 11855 return self.loc[:, list(key)] 11856 raise NotImplementedError(key) 11857 /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key) 476 returns_series, 477 series_name,--> 478 ) = self._select_cols(cols_sel) 479 480 if cond is None and limit is None and returns_series: /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, missing_keys) 322 return self._select_cols_else(cols_sel, missing_keys) 323 elif is_list_like(cols_sel):--> 324 return self._select_cols_by_iterable(cols_sel, missing_keys) 325 else: 326 return self._select_cols_else(cols_sel, missing_keys) /opt/spark/python/pyspark/pandas/indexing.py in _select_cols_by_iterable(self, cols_sel, missing_keys) 1352 if not found: 1353 if missing_keys is None:-> 1354 raise KeyError("['{}'] not in index".format(name_like_string(key))) 1355 else: 1356 missing_keys.append(key) KeyError: "['testyear'] not in index" df_test testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 30 25 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas
[ https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-36728: Attachment: pyspark_date.txt > Can't create datetime object from anything other then year column Pyspark - > koalas > -- > > Key: SPARK-36728 > URL: https://issues.apache.org/jira/browse/SPARK-36728 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: pyspark_date.txt > > > If I create a datetime object it must be from columns named year. > > df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, > 2016], 'month': [2, 3], 'day': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 6 columns): # Column Non-Null Count Dtype--- -- > -- - 0 year 2 non-null int64 1 month 2 > non-null int64 2 day 2 non-null int64 3 hour 2 non-null > int64 4 minute 2 non-null int64 5 second 2 non-null > int64dtypes: int64(6) > df['date'] = ps.to_datetime(df[['year', 'month', 'day']]) > df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 7 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null int64 1 month > 2 non-null int64 2 day 2 non-null int64 3 hour > 2 non-null int64 4 minute 2 non-null int64 5 second > 2 non-null int64 6 date 2 non-null datetime64dtypes: > datetime64(1), int64(6) > df_test = ps.DataFrame(\{'testyear': [2015, 2016], > 'testmonth': [2, 3], 'testday': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', > 'testmonth', 'testday']]) > ---KeyError > Traceback (most recent call > last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = > ps.to_datetime(df[['testyear', 'testmonth', 'testday']]) > /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key) 11853 > return self.loc[:, key] 11854 elif is_list_like(key):> > 11855 return self.loc[:, list(key)] 11856 raise > NotImplementedError(key) 11857 > /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key) 476 > returns_series, 477 series_name,--> 478 > ) = self._select_cols(cols_sel) 479 480 if cond > is None and limit is None and returns_series: > /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, > missing_keys) 322 return self._select_cols_else(cols_sel, > missing_keys) 323 elif is_list_like(cols_sel):--> 324 > return self._select_cols_by_iterable(cols_sel, missing_keys) 325 > else: 326 return self._select_cols_else(cols_sel, missing_keys) > /opt/spark/python/pyspark/pandas/indexing.py in > _select_cols_by_iterable(self, cols_sel, missing_keys) 1352 > if not found: 1353 if missing_keys is None:-> 1354 > raise KeyError("['{}'] not in > index".format(name_like_string(key))) 1355 else: 1356 > missing_keys.append(key) > KeyError: "['testyear'] not in index" > df_test > testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 > 30 25 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68
Kousuke Saruta created SPARK-36729: -- Summary: Upgrade Netty from 4.1.63 to 4.1.68 Key: SPARK-36729 URL: https://issues.apache.org/jira/browse/SPARK-36729 Project: Spark Issue Type: Sub-task Components: Build Affects Versions: 3.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Recently Netty 4.1.68 was released, which includes official M1 Mac support. https://github.com/netty/netty/pull/11666 4.1.65 also includes a critical bug fix which Spark might be affected. https://github.com/netty/netty/issues/11209 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68
[ https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413718#comment-17413718 ] Apache Spark commented on SPARK-36729: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/33970 > Upgrade Netty from 4.1.63 to 4.1.68 > --- > > Key: SPARK-36729 > URL: https://issues.apache.org/jira/browse/SPARK-36729 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Recently Netty 4.1.68 was released, which includes official M1 Mac support. > https://github.com/netty/netty/pull/11666 > 4.1.65 also includes a critical bug fix which Spark might be affected. > https://github.com/netty/netty/issues/11209 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68
[ https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36729: Assignee: Apache Spark (was: Kousuke Saruta) > Upgrade Netty from 4.1.63 to 4.1.68 > --- > > Key: SPARK-36729 > URL: https://issues.apache.org/jira/browse/SPARK-36729 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > Recently Netty 4.1.68 was released, which includes official M1 Mac support. > https://github.com/netty/netty/pull/11666 > 4.1.65 also includes a critical bug fix which Spark might be affected. > https://github.com/netty/netty/issues/11209 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68
[ https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413717#comment-17413717 ] Apache Spark commented on SPARK-36729: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/33970 > Upgrade Netty from 4.1.63 to 4.1.68 > --- > > Key: SPARK-36729 > URL: https://issues.apache.org/jira/browse/SPARK-36729 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Recently Netty 4.1.68 was released, which includes official M1 Mac support. > https://github.com/netty/netty/pull/11666 > 4.1.65 also includes a critical bug fix which Spark might be affected. > https://github.com/netty/netty/issues/11209 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68
[ https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36729: Assignee: Kousuke Saruta (was: Apache Spark) > Upgrade Netty from 4.1.63 to 4.1.68 > --- > > Key: SPARK-36729 > URL: https://issues.apache.org/jira/browse/SPARK-36729 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Recently Netty 4.1.68 was released, which includes official M1 Mac support. > https://github.com/netty/netty/pull/11666 > 4.1.65 also includes a critical bug fix which Spark might be affected. > https://github.com/netty/netty/issues/11209 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas
[ https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413725#comment-17413725 ] dgd_contributor commented on SPARK-36728: - I think this is not a bug, same behavior in pandas. We need set name of columns like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same. [docs|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#] {code:java} >>> df_test = pd.DataFrame( ... { ... 'testyear': [2015, 2016], ... 'testmonth': [2, 3], ... 'testday': [4, 5], ... } ... ) >>> pd.to_datetime(df_test[['testyear', 'testmonth', 'testday']]) Traceback (most recent call last): File "", line 1, in File "/Users/dgd/spark/python/venv/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 890, in to_datetime result = _assemble_from_unit_mappings(arg, errors, tz) File "/Users/dgd/spark/python/venv/lib/python3.8/site-packages/pandas/core/tools/datetimes.py", line 996, in _assemble_from_unit_mappings raise ValueError( ValueError: to assemble mappings requires at least that [year, month, day] be specified: [day,month,year] is missing >>> {code} > Can't create datetime object from anything other then year column Pyspark - > koalas > -- > > Key: SPARK-36728 > URL: https://issues.apache.org/jira/browse/SPARK-36728 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Bjørn Jørgensen >Priority: Major > Attachments: pyspark_date.txt > > > If I create a datetime object it must be from columns named year. > > df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, > 2016], 'month': [2, 3], 'day': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 6 columns): # Column Non-Null Count Dtype--- -- > -- - 0 year 2 non-null int64 1 month 2 > non-null int64 2 day 2 non-null int64 3 hour 2 non-null > int64 4 minute 2 non-null int64 5 second 2 non-null > int64dtypes: int64(6) > df['date'] = ps.to_datetime(df[['year', 'month', 'day']]) > df.info() > Int64Index: 2 entries, 1 to 0Data > columns (total 7 columns): # Column Non-Null Count Dtype --- -- > -- - 0 year 2 non-null int64 1 month > 2 non-null int64 2 day 2 non-null int64 3 hour > 2 non-null int64 4 minute 2 non-null int64 5 second > 2 non-null int64 6 date 2 non-null datetime64dtypes: > datetime64(1), int64(6) > df_test = ps.DataFrame(\{'testyear': [2015, 2016], > 'testmonth': [2, 3], 'testday': [4, 5], > 'hour': [2, 3], 'minute': [10, 30], > 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', > 'testmonth', 'testday']]) > ---KeyError > Traceback (most recent call > last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = > ps.to_datetime(df[['testyear', 'testmonth', 'testday']]) > /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key) 11853 > return self.loc[:, key] 11854 elif is_list_like(key):> > 11855 return self.loc[:, list(key)] 11856 raise > NotImplementedError(key) 11857 > /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key) 476 > returns_series, 477 series_name,--> 478 > ) = self._select_cols(cols_sel) 479 480 if cond > is None and limit is None and returns_series: > /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, > missing_keys) 322 return self._select_cols_else(cols_sel, > missing_keys) 323 elif is_list_like(cols_sel):--> 324 > return self._select_cols_by_iterable(cols_sel, missing_keys) 325 > else: 326 return self._select_cols_else(cols_sel, missing_keys) > /opt/spark/python/pyspark/pandas/indexing.py in > _select_cols_by_iterable(self, cols_sel, missing_keys) 1352 > if not found: 1353 if missing_keys is None:-> 1354 > raise KeyError("['{}'] not in > index".format(name_like_string(key))) 1355 else: 1356 > missing_keys
[jira] [Resolved] (SPARK-36636) SparkContextSuite random failure in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-36636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-36636. -- Fix Version/s: 3.0.4 3.1.3 3.2.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/33963 > SparkContextSuite random failure in Scala 2.13 > -- > > Key: SPARK-36636 > URL: https://issues.apache.org/jira/browse/SPARK-36636 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.2.0, 3.1.3, 3.0.4 > > Attachments: image-2021-09-11-00-29-21-168.png, > image-2021-09-11-00-30-20-237.png, image-2021-09-11-00-39-43-752.png > > > run > {code:java} > build/mvn clean install -Pscala-2.13 -pl core -am{code} > or > {code:java} > build/mvn clean install -Pscala-2.13 -pl core -am -Dtest=none > -DwildcardSuites=org.apache.spark.SparkContextSuite > {code} > Some cases may fail as follows: > > {code:java} > - SPARK-33084: Add jar support Ivy URI -- test param key case sensitive *** > FAILED *** > java.lang.IllegalStateException: Cannot call methods on a stopped > SparkContext. > This stopped SparkContext was created at: > org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1155) > org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > org.scalatest.Transformer.apply(Transformer.scala:22) > org.scalatest.Transformer.apply(Transformer.scala:20) > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > scala.collection.immutable.List.foreach(List.scala:333) > The currently active SparkContext was created at: > org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1155) > org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > org.scalatest.Transformer.apply(Transformer.scala:22) > org.scalatest.Transformer.apply(Transformer.scala:20) > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) > org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) > org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) > org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > scala.collection.immutable.List.foreach(List.scala:333) > at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:118) > at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1887) > at > org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2575) > at org.apache.spark.SparkContext.addJar(SparkContext.scala:2008) > at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928) > at > org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1156) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcome
[jira] [Updated] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68
[ https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-36729: --- Priority: Minor (was: Major) > Upgrade Netty from 4.1.63 to 4.1.68 > --- > > Key: SPARK-36729 > URL: https://issues.apache.org/jira/browse/SPARK-36729 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > Recently Netty 4.1.68 was released, which includes official M1 Mac support. > https://github.com/netty/netty/pull/11666 > 4.1.65 also includes a critical bug fix which Spark might be affected. > https://github.com/netty/netty/issues/11209 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36720) On overwrite mode, setting option truncate as true doesn't truncate the table
[ https://issues.apache.org/jira/browse/SPARK-36720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413734#comment-17413734 ] Balaji Balasubramaniam commented on SPARK-36720: Because even though I’m setting mode is overwrite and truncate option is set to true, I would expect it to truncate the table and not drop the table. Sent from Yahoo Mail for iPhone On Saturday, September 11, 2021, 7:00 PM, Hyukjin Kwon (Jira) wrote: [ https://issues.apache.org/jira/browse/SPARK-36720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413639#comment-17413639 ] Hyukjin Kwon commented on SPARK-36720: -- The error is from: {quote} com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [258]: insufficient privilege: Detailed info for this error can be found with guid '' {quote} Mind elabourating why is it an issue in PySpark or Apache Spark? -- This message was sent by Atlassian Jira (v8.3.4#803005) > On overwrite mode, setting option truncate as true doesn't truncate the table > - > > Key: SPARK-36720 > URL: https://issues.apache.org/jira/browse/SPARK-36720 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.1.1 >Reporter: Balaji Balasubramaniam >Priority: Major > > I'm using PySpark from AWS Glue job to write it to SAP HANA using jdbc. Our > requirement is to truncate and load data in HANA. > I've tried both of these options and on both cases, based on the stack trace, > it is trying to drop the table which is not allowed by security design. > #df_lake.write.format("jdbc").option("url", edw_jdbc_url).option("driver", > "com.sap.db.jdbc.Driver").option("dbtable", edw_jdbc_db_table).option("user", > edw_jdbc_userid).option("password", edw_jdbc_password).option("truncate", > "true").mode("append").save() > properties=\{"user": edw_jdbc_userid, "password": edw_jdbc_password, > "truncate":"true"} > df_lake.write.jdbc(url=edw_jdbc_url, table=edw_jdbc_db_table, > mode='overwrite', properties=properties) > > I've verified that the schema matches. I did the jdbc read and print out the > schema as well as printing the schema from the source table. > Schema from HANA: > root > |-- RTL_ACCT_ID: long (nullable = true) > |-- FINE_DINING_PROPOSED: string (nullable = true) > |-- FINE_WINE_PROPOSED: string (nullable = true) > |-- FINE_WINE_INF_PROPOSED: string (nullable = true) > |-- GOLD_SILVER_PROPOSED: string (nullable = true) > |-- PREMIUM_PROPOSED: string (nullable = true) > |-- GSP_PROPOSED: string (nullable = true) > |-- PROPOSED_CRAFT: string (nullable = true) > |-- FW_REASON: string (nullable = true) > |-- FWI_REASON: string (nullable = true) > |-- GS_REASON: string (nullable = true) > |-- PREM_REASON: string (nullable = true) > |-- FD_REASON: string (nullable = true) > |-- CRAFT_REASON: string (nullable = true) > |-- GSP_FLAG: string (nullable = true) > |-- GSP_REASON: string (nullable = true) > |-- ELIGIBILITY: string (nullable = true) > |-- DW_LD_S: timestamp (nullable = true) > Schema from the source table: > root > |-- RTL_ACCT_ID: long (nullable = true) > |-- FINE_DINING_PROPOSED: string (nullable = true) > |-- FINE_WINE_PROPOSED: string (nullable = true) > |-- FINE_WINE_INF_PROPOSED: string (nullable = true) > |-- GOLD_SILVER_PROPOSED: string (nullable = true) > |-- PREMIUM_PROPOSED: string (nullable = true) > |-- GSP_PROPOSED: string (nullable = true) > |-- PROPOSED_CRAFT: string (nullable = true) > |-- FW_REASON: string (nullable = true) > |-- FWI_REASON: string (nullable = true) > |-- GS_REASON: string (nullable = true) > |-- PREM_REASON: string (nullable = true) > |-- FD_REASON: string (nullable = true) > |-- CRAFT_REASON: string (nullable = true) > |-- GSP_FLAG: string (nullable = true) > |-- GSP_REASON: string (nullable = true) > |-- ELIGIBILITY: string (nullable = true) > |-- DW_LD_S: timestamp (nullable = true) > This is the stack trace > py4j.protocol.Py4JJavaError: An error occurred while calling o169.jdbc. > : com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [258]: > insufficient privilege: Detailed info for this error can be found with guid > '' > at > com.sap.db.jdbc.exceptions.SQLExceptionSapDB._newInstance(SQLExceptionSapDB.java:191) > at > com.sap.db.jdbc.exceptions.SQLExceptionSapDB.newInstance(SQLExceptionSapDB.java:42) > at > com.sap.db.jdbc.packet.HReplyPacket._buildExceptionChain(HReplyPacket.java:976) > at > com.sap.db.jdbc.packet.HReplyPacket.getSQLExceptionChain(HReplyPacket.java:157) > at > com.sap.db.jdbc.packet.HPartInfo.getSQLExceptionChain(HPartInfo.java:39) > at com.sap.db.jdbc.Co
[jira] [Resolved] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68
[ https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36729. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33970 [https://github.com/apache/spark/pull/33970] > Upgrade Netty from 4.1.63 to 4.1.68 > --- > > Key: SPARK-36729 > URL: https://issues.apache.org/jira/browse/SPARK-36729 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > Recently Netty 4.1.68 was released, which includes official M1 Mac support. > https://github.com/netty/netty/pull/11666 > 4.1.65 also includes a critical bug fix which Spark might be affected. > https://github.com/netty/netty/issues/11209 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36726) Upgrade Parquet to 1.12.1
[ https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun updated SPARK-36726: - Priority: Blocker (was: Major) > Upgrade Parquet to 1.12.1 > - > > Key: SPARK-36726 > URL: https://issues.apache.org/jira/browse/SPARK-36726 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Priority: Blocker > > Upgrade Apache Parquet to 1.12.1 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable
[ https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-36705: -- Target Version/s: 3.2.0 > Disable push based shuffle when IO encryption is enabled or serializer is not > relocatable > - > > Key: SPARK-36705 > URL: https://issues.apache.org/jira/browse/SPARK-36705 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Mridul Muralidharan >Priority: Blocker > > Push based shuffle is not compatible with io encryption or non-relocatable > serialization. > This is similar to SPARK-34790 > We have to disable push based shuffle if either of these two are true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36730) Use V2 Filter in V2 file source
Huaxin Gao created SPARK-36730: -- Summary: Use V2 Filter in V2 file source Key: SPARK-36730 URL: https://issues.apache.org/jira/browse/SPARK-36730 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Huaxin Gao Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36731) BrotliCodec doesn't support Apple Silicon on MacOS
Dongjoon Hyun created SPARK-36731: - Summary: BrotliCodec doesn't support Apple Silicon on MacOS Key: SPARK-36731 URL: https://issues.apache.org/jira/browse/SPARK-36731 Project: Spark Issue Type: Sub-task Components: Build, SQL Affects Versions: 3.3.0 Reporter: Dongjoon Hyun {code} [info] Caused by: java.lang.UnsatisfiedLinkError: Couldn't load native library 'brotli'. [LoaderResult: os.name="Mac OS X", os.arch="aarch64", os.version="11.5.2", java.vm.name="OpenJDK 64-Bit Server VM", java.vm.version="25.302-b08", java.vm.vendor="Azul Systems, Inc.", alreadyLoaded="null", loadedFromSystemLibraryPath="false", nativeLibName="libbrotli.dylib", temporaryLibFile="/Users/dongjoon/APACHE/spark-merge/target/tmp/brotli8243220902047076449/libbrotli.dylib", libNameWithinClasspath="/lib/darwin-aarch64/libbrotli.dylib", usedThisClassloader="false", usedSystemClassloader="false", java.library.path="/Users/dongjoon/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:."] [info] at org.meteogroup.jbrotli.libloader.BrotliLibraryLoader.loadBrotli(BrotliLibraryLoader.java:35) [info] at org.apache.hadoop.io.compress.BrotliCodec.(BrotliCodec.java:40) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36731) BrotliCodec doesn't support Apple Silicon on MacOS
[ https://issues.apache.org/jira/browse/SPARK-36731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413859#comment-17413859 ] Dongjoon Hyun commented on SPARK-36731: --- Thanks to SPARK-36670, we identified this issue. Thank you, [~viirya] > BrotliCodec doesn't support Apple Silicon on MacOS > -- > > Key: SPARK-36731 > URL: https://issues.apache.org/jira/browse/SPARK-36731 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > {code} > [info] Caused by: java.lang.UnsatisfiedLinkError: Couldn't load native > library 'brotli'. [LoaderResult: os.name="Mac OS X", os.arch="aarch64", > os.version="11.5.2", java.vm.name="OpenJDK 64-Bit Server VM", > java.vm.version="25.302-b08", java.vm.vendor="Azul Systems, Inc.", > alreadyLoaded="null", loadedFromSystemLibraryPath="false", > nativeLibName="libbrotli.dylib", > temporaryLibFile="/Users/dongjoon/APACHE/spark-merge/target/tmp/brotli8243220902047076449/libbrotli.dylib", > libNameWithinClasspath="/lib/darwin-aarch64/libbrotli.dylib", > usedThisClassloader="false", usedSystemClassloader="false", > java.library.path="/Users/dongjoon/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:."] > [info]at > org.meteogroup.jbrotli.libloader.BrotliLibraryLoader.loadBrotli(BrotliLibraryLoader.java:35) > [info]at > org.apache.hadoop.io.compress.BrotliCodec.(BrotliCodec.java:40) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36670) Add end-to-end codec test cases for main datasources
[ https://issues.apache.org/jira/browse/SPARK-36670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-36670: --- Assignee: L. C. Hsieh (was: Apache Spark) > Add end-to-end codec test cases for main datasources > > > Key: SPARK-36670 > URL: https://issues.apache.org/jira/browse/SPARK-36670 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.2.0 > > > We found there is no e2e test cases available for main datasources like > Parquet, Orc. It makes developers harder to identify possible bugs early. We > should add such tests in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36700) BlockManager re-registration is broken due to deferred removal of BlockManager
[ https://issues.apache.org/jira/browse/SPARK-36700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413865#comment-17413865 ] wuyi commented on SPARK-36700: -- Reverted by [https://github.com/apache/spark/pull/33942] and backported to 3.2, 3.1, 3.0. > BlockManager re-registration is broken due to deferred removal of > BlockManager > --- > > Key: SPARK-36700 > URL: https://issues.apache.org/jira/browse/SPARK-36700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: wuyi >Priority: Blocker > > Due to the deferred removal of BlockManager (introduced in SPARK-35011), an > expected BlockManager re-registration could be refused as the inactive > BlockManager still exists in the map `blockManagerInfo`: > https://github.com/apache/spark/blob/9cefde8db373a3433b7e3ce328e4a2ce83b1aca2/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L551 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36700) BlockManager re-registration is broken due to deferred removal of BlockManager
[ https://issues.apache.org/jira/browse/SPARK-36700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-36700. -- Fix Version/s: 3.3.0 3.0.4 3.1.3 3.2.0 Assignee: wuyi Resolution: Fixed > BlockManager re-registration is broken due to deferred removal of > BlockManager > --- > > Key: SPARK-36700 > URL: https://issues.apache.org/jira/browse/SPARK-36700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0 >Reporter: wuyi >Assignee: wuyi >Priority: Blocker > Fix For: 3.2.0, 3.1.3, 3.0.4, 3.3.0 > > > Due to the deferred removal of BlockManager (introduced in SPARK-35011), an > expected BlockManager re-registration could be refused as the inactive > BlockManager still exists in the map `blockManagerInfo`: > https://github.com/apache/spark/blob/9cefde8db373a3433b7e3ce328e4a2ce83b1aca2/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L551 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36732) Upgrade ORC to 1.6.11
Dongjoon Hyun created SPARK-36732: - Summary: Upgrade ORC to 1.6.11 Key: SPARK-36732 URL: https://issues.apache.org/jira/browse/SPARK-36732 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36732) Upgrade ORC to 1.6.11
[ https://issues.apache.org/jira/browse/SPARK-36732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36732: Assignee: Apache Spark > Upgrade ORC to 1.6.11 > - > > Key: SPARK-36732 > URL: https://issues.apache.org/jira/browse/SPARK-36732 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36732) Upgrade ORC to 1.6.11
[ https://issues.apache.org/jira/browse/SPARK-36732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413874#comment-17413874 ] Apache Spark commented on SPARK-36732: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/33971 > Upgrade ORC to 1.6.11 > - > > Key: SPARK-36732 > URL: https://issues.apache.org/jira/browse/SPARK-36732 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36732) Upgrade ORC to 1.6.11
[ https://issues.apache.org/jira/browse/SPARK-36732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36732: Assignee: (was: Apache Spark) > Upgrade ORC to 1.6.11 > - > > Key: SPARK-36732 > URL: https://issues.apache.org/jira/browse/SPARK-36732 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36733) Perf issue in SchemaPruning when a struct has million fields
Kohki Nishio created SPARK-36733: Summary: Perf issue in SchemaPruning when a struct has million fields Key: SPARK-36733 URL: https://issues.apache.org/jira/browse/SPARK-36733 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: Kohki Nishio Seeing a significant performance degradation in query processing when a table contains a significantly large number of fields (>10K). Here's the stacktraces while processing a query {code:java} java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.map(TraversableLike.scala:285) at scala.collection.TraversableLike.map$(TraversableLike.scala:278) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown Source) at scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303) at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown Source) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at scala.collection.TraversableLike.filter(TraversableLike.scala:394) at scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75) at org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown Source) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36733) Perf issue in SchemaPruning when a struct has million fields
[ https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413882#comment-17413882 ] Kohki Nishio commented on SPARK-36733: -- [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69] often time (as long as I observed), left struct and the right struct are the same one. And every call to {{StructType.fieldNames}} runs\{{ fields.map(_.name). }} this computation is quite expensive for 10K fields. {{ val filteredRightFieldNames = rightStruct.fieldNames}} {{ .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }} > Perf issue in SchemaPruning when a struct has million fields > > > Key: SPARK-36733 > URL: https://issues.apache.org/jira/browse/SPARK-36733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > > Seeing a significant performance degradation in query processing when a table > contains a significantly large number of fields (>10K). > Here's the stacktraces while processing a query > {code:java} > java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at > scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:285) at > scala.collection.TraversableLike.map$(TraversableLike.scala:278) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown > Source) at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303) > at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown > Source) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at > scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at > scala.collection.TraversableLike.filter(TraversableLike.scala:394) at > scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at > scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown > Source) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-36733) Perf issue in SchemaPruning when a struct has million fields
[ https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413882#comment-17413882 ] Kohki Nishio edited comment on SPARK-36733 at 9/13/21, 3:23 AM: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69] often time (as long as I observed), left struct and the right struct are the same one. And every call to {{StructType.fieldNames}} runs {{ fields.map(_.name). }} this computation is quite expensive for 10K fields. {{ val filteredRightFieldNames = rightStruct.fieldNames}} {{ .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }} was (Author: taroplus): [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69] often time (as long as I observed), left struct and the right struct are the same one. And every call to {{StructType.fieldNames}} runs\{{ fields.map(_.name). }} this computation is quite expensive for 10K fields. {{ val filteredRightFieldNames = rightStruct.fieldNames}} {{ .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }} > Perf issue in SchemaPruning when a struct has million fields > > > Key: SPARK-36733 > URL: https://issues.apache.org/jira/browse/SPARK-36733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > > Seeing a significant performance degradation in query processing when a table > contains a significantly large number of fields (>10K). > Here's the stacktraces while processing a query > {code:java} > java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at > scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:285) at > scala.collection.TraversableLike.map$(TraversableLike.scala:278) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown > Source) at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303) > at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown > Source) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at > scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at > scala.collection.TraversableLike.filter(TraversableLike.scala:394) at > scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at > scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown > Source) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36733) Perf issue in SchemaPruning when a struct has many fields
[ https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kohki Nishio updated SPARK-36733: - Summary: Perf issue in SchemaPruning when a struct has many fields (was: Perf issue in SchemaPruning when a struct has million fields) > Perf issue in SchemaPruning when a struct has many fields > - > > Key: SPARK-36733 > URL: https://issues.apache.org/jira/browse/SPARK-36733 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kohki Nishio >Priority: Major > > Seeing a significant performance degradation in query processing when a table > contains a significantly large number of fields (>10K). > Here's the stacktraces while processing a query > {code:java} > java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at > scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:285) at > scala.collection.TraversableLike.map$(TraversableLike.scala:278) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown > Source) at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303) > at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown > Source) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at > scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at > scala.collection.TraversableLike.filter(TraversableLike.scala:394) at > scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at > scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown > Source) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36734) Upgrade ORC to 1.5.13
Dongjoon Hyun created SPARK-36734: - Summary: Upgrade ORC to 1.5.13 Key: SPARK-36734 URL: https://issues.apache.org/jira/browse/SPARK-36734 Project: Spark Issue Type: Bug Components: Build, SQL Affects Versions: 3.1.2 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34183) DataSource V2: Support required distribution and ordering in SS
[ https://issues.apache.org/jira/browse/SPARK-34183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413895#comment-17413895 ] Anton Okolnychyi commented on SPARK-34183: -- Sorry for the delay, [~hyukjin.kwon] [~Gengliang.Wang]. I am planning to update the PR this week. I am not sure this should be a real blocker, however. I was asked to create this issue during the review. > DataSource V2: Support required distribution and ordering in SS > --- > > Key: SPARK-34183 > URL: https://issues.apache.org/jira/browse/SPARK-34183 > Project: Spark > Issue Type: Sub-task > Components: SQL, Structured Streaming >Affects Versions: 3.2.0 >Reporter: Anton Okolnychyi >Priority: Blocker > > We need to support a required distribution and ordering for SS. See the > discussion > [here|https://github.com/apache/spark/pull/31083#issuecomment-763214597]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36734) Upgrade ORC to 1.5.13
[ https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36734: Assignee: (was: Apache Spark) > Upgrade ORC to 1.5.13 > - > > Key: SPARK-36734 > URL: https://issues.apache.org/jira/browse/SPARK-36734 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.1.2 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36734) Upgrade ORC to 1.5.13
[ https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36734: Assignee: Apache Spark > Upgrade ORC to 1.5.13 > - > > Key: SPARK-36734 > URL: https://issues.apache.org/jira/browse/SPARK-36734 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.1.2 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36734) Upgrade ORC to 1.5.13
[ https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36734: Assignee: Apache Spark > Upgrade ORC to 1.5.13 > - > > Key: SPARK-36734 > URL: https://issues.apache.org/jira/browse/SPARK-36734 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.1.2 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36734) Upgrade ORC to 1.5.13
[ https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413896#comment-17413896 ] Apache Spark commented on SPARK-36734: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/33972 > Upgrade ORC to 1.5.13 > - > > Key: SPARK-36734 > URL: https://issues.apache.org/jira/browse/SPARK-36734 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.1.2 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34183) DataSource V2: Support required distribution and ordering in SS
[ https://issues.apache.org/jira/browse/SPARK-34183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413897#comment-17413897 ] Jungtaek Lim commented on SPARK-34183: -- Sorry I totally missed this issue. My bad. I asked to file this as a blocker because there is "uncertain" on the behavior of streaming query when data source requires distribution and ordering and I wanted to see it addressed before releasing. For example, in worst case, data sources requiring distribution/ordering may no longer work with SS in any way, if the Spark implementation requires "sort" across micro-batches. Would data source be able to (and is expected) indicate about the type of the query (batch vs streaming) and optionally provide requirements of distribution/ordering? I'm not sure. Even it is technically possible, we should guide data source implementators to do so. Otherwise it's going to be another surprise thing on Spark. > DataSource V2: Support required distribution and ordering in SS > --- > > Key: SPARK-34183 > URL: https://issues.apache.org/jira/browse/SPARK-34183 > Project: Spark > Issue Type: Sub-task > Components: SQL, Structured Streaming >Affects Versions: 3.2.0 >Reporter: Anton Okolnychyi >Priority: Blocker > > We need to support a required distribution and ordering for SS. See the > discussion > [here|https://github.com/apache/spark/pull/31083#issuecomment-763214597]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36735) Adjust overhead of cached relation for DPP
L. C. Hsieh created SPARK-36735: --- Summary: Adjust overhead of cached relation for DPP Key: SPARK-36735 URL: https://issues.apache.org/jira/browse/SPARK-36735 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: L. C. Hsieh Currently we calculate if there is benefit of pruning with DPP by simply summing up the size of all scan relations as the overhead. However, for cached relations, the overhead should be different than a non-cached relation. This proposes to use adjusted overhead for cached relation with DPP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36716) Join estimation support LeftExistence join type
[ https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36716: Assignee: Apache Spark > Join estimation support LeftExistence join type > --- > > Key: SPARK-36716 > URL: https://issues.apache.org/jira/browse/SPARK-36716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > Attachments: image-2021-09-10-17-17-33-082.png > > > Join estimation support LeftExistence join type. This can benefit tpcds q10. > Before: > {noformat} > TakeOrderedAndProject (51) > +- * HashAggregate (50) >+- Exchange (49) > +- * HashAggregate (48) > +- * Project (47) > +- * SortMergeJoin Inner (46) >:- * Sort (40) >: +- Exchange (39) >: +- * Project (38) >:+- * BroadcastHashJoin Inner BuildRight (37) >: :- * Project (31) >: : +- * Filter (30) >: : +- SortMergeJoin ExistenceJoin(exists#1) (29) >: ::- SortMergeJoin ExistenceJoin(exists#2) > (21) >: :: :- * SortMergeJoin LeftSemi (13) >: :: : :- * Sort (5) >: :: : : +- Exchange (4) >: :: : : +- * Filter (3) >: :: : :+- * ColumnarToRow (2) >: :: : : +- Scan parquet > default.customer (1) >: :: : +- * Sort (12) >: :: : +- Exchange (11) >: :: :+- * Project (10) >: :: : +- * BroadcastHashJoin > Inner BuildRight (9) >: :: : :- * ColumnarToRow (7) >: :: : : +- Scan parquet > default.store_sales (6) >: :: : +- ReusedExchange (8) >: :: +- * Sort (20) >: :: +- Exchange (19) >: ::+- * Project (18) >: :: +- * BroadcastHashJoin Inner > BuildRight (17) >: :: :- * ColumnarToRow (15) >: :: : +- Scan parquet > default.web_sales (14) >: :: +- ReusedExchange (16) >: :+- * Sort (28) >: : +- Exchange (27) >: : +- * Project (26) >: : +- * BroadcastHashJoin Inner > BuildRight (25) >: ::- * ColumnarToRow (23) >: :: +- Scan parquet > default.catalog_sales (22) >: :+- ReusedExchange (24) >: +- BroadcastExchange (36) >: +- * Project (35) >: +- * Filter (34) >:+- * ColumnarToRow (33) >: +- Scan parquet > default.customer_address (32) >+- * Sort (45) > +- Exchange (44) > +- * Filter (43) > +- * ColumnarToRow (42) >+- Scan parquet default.customer_demographics (41) > {noformat} > After: > {noformat} > TakeOrderedAndProject (48) > +- * HashAggregate (47) >+- Exchange (46) > +- * HashAggregate (45) > +- * Project (44) > +- * BroadcastHashJoin Inner BuildLeft (43) >:- BroadcastExchange (39) >: +- * Project (38) >: +- * BroadcastHashJoin Inner BuildRight (37) >::- * Project (31) >:: +- * Filter (30) >:: +- SortMergeJoin ExistenceJoin(exists#1) (29) >:::- SortMergeJoin ExistenceJoin(exists#2) (21) >::: :- * SortMergeJoin LeftSemi (13) >::: : :- * Sort (5) >::: : : +- Exchange (4) >::: : : +- * Filter (3) >::: : :+- * ColumnarToRow (2) >:
[jira] [Commented] (SPARK-36735) Adjust overhead of cached relation for DPP
[ https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413905#comment-17413905 ] Apache Spark commented on SPARK-36735: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/33975 > Adjust overhead of cached relation for DPP > -- > > Key: SPARK-36735 > URL: https://issues.apache.org/jira/browse/SPARK-36735 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently we calculate if there is benefit of pruning with DPP by simply > summing up the size of all scan relations as the overhead. However, for > cached relations, the overhead should be different than a non-cached > relation. This proposes to use adjusted overhead for cached relation with DPP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36730) Use V2 Filter in V2 file source
[ https://issues.apache.org/jira/browse/SPARK-36730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413904#comment-17413904 ] Apache Spark commented on SPARK-36730: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/33973 > Use V2 Filter in V2 file source > --- > > Key: SPARK-36730 > URL: https://issues.apache.org/jira/browse/SPARK-36730 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > > Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36716) Join estimation support LeftExistence join type
[ https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413903#comment-17413903 ] Apache Spark commented on SPARK-36716: -- User '007akuan' has created a pull request for this issue: https://github.com/apache/spark/pull/33974 > Join estimation support LeftExistence join type > --- > > Key: SPARK-36716 > URL: https://issues.apache.org/jira/browse/SPARK-36716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > Attachments: image-2021-09-10-17-17-33-082.png > > > Join estimation support LeftExistence join type. This can benefit tpcds q10. > Before: > {noformat} > TakeOrderedAndProject (51) > +- * HashAggregate (50) >+- Exchange (49) > +- * HashAggregate (48) > +- * Project (47) > +- * SortMergeJoin Inner (46) >:- * Sort (40) >: +- Exchange (39) >: +- * Project (38) >:+- * BroadcastHashJoin Inner BuildRight (37) >: :- * Project (31) >: : +- * Filter (30) >: : +- SortMergeJoin ExistenceJoin(exists#1) (29) >: ::- SortMergeJoin ExistenceJoin(exists#2) > (21) >: :: :- * SortMergeJoin LeftSemi (13) >: :: : :- * Sort (5) >: :: : : +- Exchange (4) >: :: : : +- * Filter (3) >: :: : :+- * ColumnarToRow (2) >: :: : : +- Scan parquet > default.customer (1) >: :: : +- * Sort (12) >: :: : +- Exchange (11) >: :: :+- * Project (10) >: :: : +- * BroadcastHashJoin > Inner BuildRight (9) >: :: : :- * ColumnarToRow (7) >: :: : : +- Scan parquet > default.store_sales (6) >: :: : +- ReusedExchange (8) >: :: +- * Sort (20) >: :: +- Exchange (19) >: ::+- * Project (18) >: :: +- * BroadcastHashJoin Inner > BuildRight (17) >: :: :- * ColumnarToRow (15) >: :: : +- Scan parquet > default.web_sales (14) >: :: +- ReusedExchange (16) >: :+- * Sort (28) >: : +- Exchange (27) >: : +- * Project (26) >: : +- * BroadcastHashJoin Inner > BuildRight (25) >: ::- * ColumnarToRow (23) >: :: +- Scan parquet > default.catalog_sales (22) >: :+- ReusedExchange (24) >: +- BroadcastExchange (36) >: +- * Project (35) >: +- * Filter (34) >:+- * ColumnarToRow (33) >: +- Scan parquet > default.customer_address (32) >+- * Sort (45) > +- Exchange (44) > +- * Filter (43) > +- * ColumnarToRow (42) >+- Scan parquet default.customer_demographics (41) > {noformat} > After: > {noformat} > TakeOrderedAndProject (48) > +- * HashAggregate (47) >+- Exchange (46) > +- * HashAggregate (45) > +- * Project (44) > +- * BroadcastHashJoin Inner BuildLeft (43) >:- BroadcastExchange (39) >: +- * Project (38) >: +- * BroadcastHashJoin Inner BuildRight (37) >::- * Project (31) >:: +- * Filter (30) >:: +- SortMergeJoin ExistenceJoin(exists#1) (29) >:::- SortMergeJoin ExistenceJoin(exists#2) (21) >::: :- * SortMergeJoin LeftSemi (13) >::: : :- * Sort (5) >::: : : +- Exchange (4) >::: : : +- * Filter (3) >
[jira] [Assigned] (SPARK-36730) Use V2 Filter in V2 file source
[ https://issues.apache.org/jira/browse/SPARK-36730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36730: Assignee: (was: Apache Spark) > Use V2 Filter in V2 file source > --- > > Key: SPARK-36730 > URL: https://issues.apache.org/jira/browse/SPARK-36730 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Priority: Major > > Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36735) Adjust overhead of cached relation for DPP
[ https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36735: Assignee: (was: Apache Spark) > Adjust overhead of cached relation for DPP > -- > > Key: SPARK-36735 > URL: https://issues.apache.org/jira/browse/SPARK-36735 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently we calculate if there is benefit of pruning with DPP by simply > summing up the size of all scan relations as the overhead. However, for > cached relations, the overhead should be different than a non-cached > relation. This proposes to use adjusted overhead for cached relation with DPP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36730) Use V2 Filter in V2 file source
[ https://issues.apache.org/jira/browse/SPARK-36730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36730: Assignee: Apache Spark > Use V2 Filter in V2 file source > --- > > Key: SPARK-36730 > URL: https://issues.apache.org/jira/browse/SPARK-36730 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36716) Join estimation support LeftExistence join type
[ https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36716: Assignee: (was: Apache Spark) > Join estimation support LeftExistence join type > --- > > Key: SPARK-36716 > URL: https://issues.apache.org/jira/browse/SPARK-36716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > Attachments: image-2021-09-10-17-17-33-082.png > > > Join estimation support LeftExistence join type. This can benefit tpcds q10. > Before: > {noformat} > TakeOrderedAndProject (51) > +- * HashAggregate (50) >+- Exchange (49) > +- * HashAggregate (48) > +- * Project (47) > +- * SortMergeJoin Inner (46) >:- * Sort (40) >: +- Exchange (39) >: +- * Project (38) >:+- * BroadcastHashJoin Inner BuildRight (37) >: :- * Project (31) >: : +- * Filter (30) >: : +- SortMergeJoin ExistenceJoin(exists#1) (29) >: ::- SortMergeJoin ExistenceJoin(exists#2) > (21) >: :: :- * SortMergeJoin LeftSemi (13) >: :: : :- * Sort (5) >: :: : : +- Exchange (4) >: :: : : +- * Filter (3) >: :: : :+- * ColumnarToRow (2) >: :: : : +- Scan parquet > default.customer (1) >: :: : +- * Sort (12) >: :: : +- Exchange (11) >: :: :+- * Project (10) >: :: : +- * BroadcastHashJoin > Inner BuildRight (9) >: :: : :- * ColumnarToRow (7) >: :: : : +- Scan parquet > default.store_sales (6) >: :: : +- ReusedExchange (8) >: :: +- * Sort (20) >: :: +- Exchange (19) >: ::+- * Project (18) >: :: +- * BroadcastHashJoin Inner > BuildRight (17) >: :: :- * ColumnarToRow (15) >: :: : +- Scan parquet > default.web_sales (14) >: :: +- ReusedExchange (16) >: :+- * Sort (28) >: : +- Exchange (27) >: : +- * Project (26) >: : +- * BroadcastHashJoin Inner > BuildRight (25) >: ::- * ColumnarToRow (23) >: :: +- Scan parquet > default.catalog_sales (22) >: :+- ReusedExchange (24) >: +- BroadcastExchange (36) >: +- * Project (35) >: +- * Filter (34) >:+- * ColumnarToRow (33) >: +- Scan parquet > default.customer_address (32) >+- * Sort (45) > +- Exchange (44) > +- * Filter (43) > +- * ColumnarToRow (42) >+- Scan parquet default.customer_demographics (41) > {noformat} > After: > {noformat} > TakeOrderedAndProject (48) > +- * HashAggregate (47) >+- Exchange (46) > +- * HashAggregate (45) > +- * Project (44) > +- * BroadcastHashJoin Inner BuildLeft (43) >:- BroadcastExchange (39) >: +- * Project (38) >: +- * BroadcastHashJoin Inner BuildRight (37) >::- * Project (31) >:: +- * Filter (30) >:: +- SortMergeJoin ExistenceJoin(exists#1) (29) >:::- SortMergeJoin ExistenceJoin(exists#2) (21) >::: :- * SortMergeJoin LeftSemi (13) >::: : :- * Sort (5) >::: : : +- Exchange (4) >::: : : +- * Filter (3) >::: : :+- * ColumnarToRow (2) >::: : :
[jira] [Assigned] (SPARK-36735) Adjust overhead of cached relation for DPP
[ https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36735: Assignee: Apache Spark > Adjust overhead of cached relation for DPP > -- > > Key: SPARK-36735 > URL: https://issues.apache.org/jira/browse/SPARK-36735 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Currently we calculate if there is benefit of pruning with DPP by simply > summing up the size of all scan relations as the overhead. However, for > cached relations, the overhead should be different than a non-cached > relation. This proposes to use adjusted overhead for cached relation with DPP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36735) Adjust overhead of cached relation for DPP
[ https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413906#comment-17413906 ] Apache Spark commented on SPARK-36735: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/33975 > Adjust overhead of cached relation for DPP > -- > > Key: SPARK-36735 > URL: https://issues.apache.org/jira/browse/SPARK-36735 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Priority: Major > > Currently we calculate if there is benefit of pruning with DPP by simply > summing up the size of all scan relations as the overhead. However, for > cached relations, the overhead should be different than a non-cached > relation. This proposes to use adjusted overhead for cached relation with DPP. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36716) Join estimation support LeftExistence join type
[ https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-36716: Attachment: (was: image-2021-09-10-17-17-33-082.png) > Join estimation support LeftExistence join type > --- > > Key: SPARK-36716 > URL: https://issues.apache.org/jira/browse/SPARK-36716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Priority: Major > > Join estimation support LeftExistence join type. This can benefit tpcds q10. > Before: > {noformat} > TakeOrderedAndProject (51) > +- * HashAggregate (50) >+- Exchange (49) > +- * HashAggregate (48) > +- * Project (47) > +- * SortMergeJoin Inner (46) >:- * Sort (40) >: +- Exchange (39) >: +- * Project (38) >:+- * BroadcastHashJoin Inner BuildRight (37) >: :- * Project (31) >: : +- * Filter (30) >: : +- SortMergeJoin ExistenceJoin(exists#1) (29) >: ::- SortMergeJoin ExistenceJoin(exists#2) > (21) >: :: :- * SortMergeJoin LeftSemi (13) >: :: : :- * Sort (5) >: :: : : +- Exchange (4) >: :: : : +- * Filter (3) >: :: : :+- * ColumnarToRow (2) >: :: : : +- Scan parquet > default.customer (1) >: :: : +- * Sort (12) >: :: : +- Exchange (11) >: :: :+- * Project (10) >: :: : +- * BroadcastHashJoin > Inner BuildRight (9) >: :: : :- * ColumnarToRow (7) >: :: : : +- Scan parquet > default.store_sales (6) >: :: : +- ReusedExchange (8) >: :: +- * Sort (20) >: :: +- Exchange (19) >: ::+- * Project (18) >: :: +- * BroadcastHashJoin Inner > BuildRight (17) >: :: :- * ColumnarToRow (15) >: :: : +- Scan parquet > default.web_sales (14) >: :: +- ReusedExchange (16) >: :+- * Sort (28) >: : +- Exchange (27) >: : +- * Project (26) >: : +- * BroadcastHashJoin Inner > BuildRight (25) >: ::- * ColumnarToRow (23) >: :: +- Scan parquet > default.catalog_sales (22) >: :+- ReusedExchange (24) >: +- BroadcastExchange (36) >: +- * Project (35) >: +- * Filter (34) >:+- * ColumnarToRow (33) >: +- Scan parquet > default.customer_address (32) >+- * Sort (45) > +- Exchange (44) > +- * Filter (43) > +- * ColumnarToRow (42) >+- Scan parquet default.customer_demographics (41) > {noformat} > After: > {noformat} > TakeOrderedAndProject (48) > +- * HashAggregate (47) >+- Exchange (46) > +- * HashAggregate (45) > +- * Project (44) > +- * BroadcastHashJoin Inner BuildLeft (43) >:- BroadcastExchange (39) >: +- * Project (38) >: +- * BroadcastHashJoin Inner BuildRight (37) >::- * Project (31) >:: +- * Filter (30) >:: +- SortMergeJoin ExistenceJoin(exists#1) (29) >:::- SortMergeJoin ExistenceJoin(exists#2) (21) >::: :- * SortMergeJoin LeftSemi (13) >::: : :- * Sort (5) >::: : : +- Exchange (4) >::: : : +- * Filter (3) >::: : :+- * ColumnarToRow (2) >::: : : +- Scan parquet > default.customer
[jira] [Commented] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable
[ https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413929#comment-17413929 ] Apache Spark commented on SPARK-36705: -- User 'rmcyang' has created a pull request for this issue: https://github.com/apache/spark/pull/33976 > Disable push based shuffle when IO encryption is enabled or serializer is not > relocatable > - > > Key: SPARK-36705 > URL: https://issues.apache.org/jira/browse/SPARK-36705 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Mridul Muralidharan >Priority: Blocker > > Push based shuffle is not compatible with io encryption or non-relocatable > serialization. > This is similar to SPARK-34790 > We have to disable push based shuffle if either of these two are true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable
[ https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36705: Assignee: (was: Apache Spark) > Disable push based shuffle when IO encryption is enabled or serializer is not > relocatable > - > > Key: SPARK-36705 > URL: https://issues.apache.org/jira/browse/SPARK-36705 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Mridul Muralidharan >Priority: Blocker > > Push based shuffle is not compatible with io encryption or non-relocatable > serialization. > This is similar to SPARK-34790 > We have to disable push based shuffle if either of these two are true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable
[ https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36705: Assignee: Apache Spark > Disable push based shuffle when IO encryption is enabled or serializer is not > relocatable > - > > Key: SPARK-36705 > URL: https://issues.apache.org/jira/browse/SPARK-36705 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.2.0 >Reporter: Mridul Muralidharan >Assignee: Apache Spark >Priority: Blocker > > Push based shuffle is not compatible with io encryption or non-relocatable > serialization. > This is similar to SPARK-34790 > We have to disable push based shuffle if either of these two are true. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org