[jira] [Created] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-12 Thread Chao Sun (Jira)
Chao Sun created SPARK-36726:


 Summary: Upgrade Parquet to 1.12.1
 Key: SPARK-36726
 URL: https://issues.apache.org/jira/browse/SPARK-36726
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36726:


Assignee: Apache Spark

> Upgrade Parquet to 1.12.1
> -
>
> Key: SPARK-36726
> URL: https://issues.apache.org/jira/browse/SPARK-36726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413676#comment-17413676
 ] 

Apache Spark commented on SPARK-36726:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33969

> Upgrade Parquet to 1.12.1
> -
>
> Key: SPARK-36726
> URL: https://issues.apache.org/jira/browse/SPARK-36726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36726:


Assignee: (was: Apache Spark)

> Upgrade Parquet to 1.12.1
> -
>
> Key: SPARK-36726
> URL: https://issues.apache.org/jira/browse/SPARK-36726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Major
>
> Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32285) Add PySpark support for nested timestamps with arrow

2021-09-12 Thread pralabhkumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413679#comment-17413679
 ] 

pralabhkumar commented on SPARK-32285:
--

Thx , will share the PR in some time

> Add PySpark support for nested timestamps with arrow
> 
>
> Key: SPARK-32285
> URL: https://issues.apache.org/jira/browse/SPARK-32285
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently with arrow optimizations, there is post-processing done in pandas 
> for timestamp columns to localize timezone. This is not done for nested 
> columns with timestamps such as StructType or ArrayType.
> Adding support for this is needed for Apache Arrow 1.0.0 upgrade due to use 
> of structs with timestamps in groupedby key over a window.
> As a simple first step, timestamps with 1 level nesting could be done first 
> and this will satisfy the immediate need.
> NOTE: with Arrow 1.0.0, it might be possible to do the timezone processing 
> with pyarrow.array.cast, which could be easier done than in pandas.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic

2021-09-12 Thread Tongwei (Jira)
Tongwei created SPARK-36727:
---

 Summary: Support sql overwrite a path that is also being read from 
when partitionOverwriteMode is dynamic
 Key: SPARK-36727
 URL: https://issues.apache.org/jira/browse/SPARK-36727
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2
Reporter: Tongwei






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-33648) Moving file stage failed cause dulpicated data

2021-09-12 Thread Tongwei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei closed SPARK-33648.
---

> Moving file stage failed cause dulpicated data 
> ---
>
> Key: SPARK-33648
> URL: https://issues.apache.org/jira/browse/SPARK-33648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Tongwei
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic

2021-09-12 Thread Tongwei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-36727:

Description: 
{code:java}
// non-partitioned table overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
INSERT OVERWRITE TABLE tbl SELECT 0,1;
INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;

// partitioned table static overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT);
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
p1=2021;

{code}
When we run the above query, an error will be throwed "Cannot overwrite a path 
that is also being read from"

 

 

> Support sql overwrite a path that is also being read from when 
> partitionOverwriteMode is dynamic
> 
>
> Key: SPARK-36727
> URL: https://issues.apache.org/jira/browse/SPARK-36727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tongwei
>Priority: Minor
>
> {code:java}
> // non-partitioned table overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
> INSERT OVERWRITE TABLE tbl SELECT 0,1;
> INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;
> // partitioned table static overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 
> INT);
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
> p1=2021;
> {code}
> When we run the above query, an error will be throwed "Cannot overwrite a 
> path that is also being read from"
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic

2021-09-12 Thread Tongwei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-36727:

Description: 
{code:java}
// non-partitioned table overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
INSERT OVERWRITE TABLE tbl SELECT 0,1;
INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;

// partitioned table static overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT);
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
p1=2021;

{code}
When we run the above query, an error will be throwed "Cannot overwrite a path 
that is also being read from"

We need to support this operation when the weather is good

 

  was:
{code:java}
// non-partitioned table overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
INSERT OVERWRITE TABLE tbl SELECT 0,1;
INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;

// partitioned table static overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT);
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
p1=2021;

{code}
When we run the above query, an error will be throwed "Cannot overwrite a path 
that is also being read from"

 

 


> Support sql overwrite a path that is also being read from when 
> partitionOverwriteMode is dynamic
> 
>
> Key: SPARK-36727
> URL: https://issues.apache.org/jira/browse/SPARK-36727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tongwei
>Priority: Minor
>
> {code:java}
> // non-partitioned table overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
> INSERT OVERWRITE TABLE tbl SELECT 0,1;
> INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;
> // partitioned table static overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 
> INT);
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
> p1=2021;
> {code}
> When we run the above query, an error will be throwed "Cannot overwrite a 
> path that is also being read from"
> We need to support this operation when the weather is good
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36727) Support sql overwrite a path that is also being read from when partitionOverwriteMode is dynamic

2021-09-12 Thread Tongwei (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tongwei updated SPARK-36727:

Description: 
{code:java}
// non-partitioned table overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
INSERT OVERWRITE TABLE tbl SELECT 0,1;
INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;

// partitioned table static overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT);
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
p1=2021;

{code}
When we run the above query, an error will be throwed "Cannot overwrite a path 
that is also being read from"

We need to support this operation when the 
spark.sql.sources.partitionOverwriteMode is dynamic

  was:
{code:java}
// non-partitioned table overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
INSERT OVERWRITE TABLE tbl SELECT 0,1;
INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;

// partitioned table static overwrite
CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 INT);
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
p1=2021;

{code}
When we run the above query, an error will be throwed "Cannot overwrite a path 
that is also being read from"

We need to support this operation when the weather is good

 


> Support sql overwrite a path that is also being read from when 
> partitionOverwriteMode is dynamic
> 
>
> Key: SPARK-36727
> URL: https://issues.apache.org/jira/browse/SPARK-36727
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Tongwei
>Priority: Minor
>
> {code:java}
> // non-partitioned table overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET;
> INSERT OVERWRITE TABLE tbl SELECT 0,1;
> INSERT OVERWRITE TABLE tbl SELECT * FROM tbl;
> // partitioned table static overwrite
> CREATE TABLE tbl (col1 INT, col2 STRING) USING PARQUET PARTITIONED BY (pt1 
> INT);
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT 0 AS col1,1 AS col2;
> INSERT OVERWRITE TABLE tbl PARTITION(p1=2021) SELECT col1, col2 FROM WHERE 
> p1=2021;
> {code}
> When we run the above query, an error will be throwed "Cannot overwrite a 
> path that is also being read from"
> We need to support this operation when the 
> spark.sql.sources.partitionOverwriteMode is dynamic



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas

2021-09-12 Thread Jira
Bjørn Jørgensen created SPARK-36728:
---

 Summary: Can't create datetime object from anything other then 
year column Pyspark - koalas
 Key: SPARK-36728
 URL: https://issues.apache.org/jira/browse/SPARK-36728
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Bjørn Jørgensen


If I create a datetime object it must be from columns named year.

 

df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 
2016],                   'month': [2, 3],                    'day': [4, 5],     
               'hour': [2, 3],                    'minute': [10, 30],           
         'second': [21,25]}) df.info()
Int64Index: 2 entries, 1 to 0Data 
columns (total 6 columns): #   Column  Non-Null Count  Dtype---  --  
--  - 0   year    2 non-null      int64 1   month   2 non-null  
    int64 2   day     2 non-null      int64 3   hour    2 non-null      int64 4 
  minute  2 non-null      int64 5   second  2 non-null      int64dtypes: 
int64(6)

df['date'] = ps.to_datetime(df[['year', 'month', 'day']])
df.info()

Int64Index: 2 entries, 1 to 0Data 
columns (total 7 columns): #   Column  Non-Null Count  Dtype     ---  --  
--  -      0   year    2 non-null      int64      1   month   2 
non-null      int64      2   day     2 non-null      int64      3   hour    2 
non-null      int64      4   minute  2 non-null      int64      5   second  2 
non-null      int64      6   date    2 non-null      datetime64dtypes: 
datetime64(1), int64(6)


df_test = ps.DataFrame(\{'testyear': [2015, 2016],                   
'testmonth': [2, 3],                    'testday': [4, 5],                    
'hour': [2, 3],                    'minute': [10, 30],                    
'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 
'testmonth', 'testday']])

---KeyError
                                  Traceback (most recent call 
last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = 
ps.to_datetime(df[['testyear', 'testmonth', 'testday']])
/opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key)  11853      
       return self.loc[:, key]  11854         elif is_list_like(key):> 11855    
         return self.loc[:, list(key)]  11856         raise 
NotImplementedError(key)  11857 
/opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key)    476   
              returns_series,    477                 series_name,--> 478        
     ) = self._select_cols(cols_sel)    479     480             if cond is None 
and limit is None and returns_series:
/opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, 
missing_keys)    322             return self._select_cols_else(cols_sel, 
missing_keys)    323         elif is_list_like(cols_sel):--> 324             
return self._select_cols_by_iterable(cols_sel, missing_keys)    325         
else:    326             return self._select_cols_else(cols_sel, missing_keys)
/opt/spark/python/pyspark/pandas/indexing.py in _select_cols_by_iterable(self, 
cols_sel, missing_keys)   1352                 if not found:   1353             
        if missing_keys is None:-> 1354                         raise 
KeyError("['{}'] not in index".format(name_like_string(key)))   1355            
         else:   1356                         missing_keys.append(key)
KeyError: "['testyear'] not in index"
df_test
testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 30 
25



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas

2021-09-12 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-36728:

Attachment: pyspark_date.txt

> Can't create datetime object from anything other then year column Pyspark - 
> koalas
> --
>
> Key: SPARK-36728
> URL: https://issues.apache.org/jira/browse/SPARK-36728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: pyspark_date.txt
>
>
> If I create a datetime object it must be from columns named year.
>  
> df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 
> 2016],                   'month': [2, 3],                    'day': [4, 5],   
>                  'hour': [2, 3],                    'minute': [10, 30],       
>              'second': [21,25]}) df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 6 columns): #   Column  Non-Null Count  Dtype---  --  
> --  - 0   year    2 non-null      int64 1   month   2 
> non-null      int64 2   day     2 non-null      int64 3   hour    2 non-null  
>     int64 4   minute  2 non-null      int64 5   second  2 non-null      
> int64dtypes: int64(6)
> df['date'] = ps.to_datetime(df[['year', 'month', 'day']])
> df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 7 columns): #   Column  Non-Null Count  Dtype     ---  --  
> --  -      0   year    2 non-null      int64      1   month   
> 2 non-null      int64      2   day     2 non-null      int64      3   hour    
> 2 non-null      int64      4   minute  2 non-null      int64      5   second  
> 2 non-null      int64      6   date    2 non-null      datetime64dtypes: 
> datetime64(1), int64(6)
> df_test = ps.DataFrame(\{'testyear': [2015, 2016],                   
> 'testmonth': [2, 3],                    'testday': [4, 5],                    
> 'hour': [2, 3],                    'minute': [10, 30],                    
> 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 
> 'testmonth', 'testday']])
> ---KeyError
>                                   Traceback (most recent call 
> last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = 
> ps.to_datetime(df[['testyear', 'testmonth', 'testday']])
> /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key)  11853    
>          return self.loc[:, key]  11854         elif is_list_like(key):> 
> 11855             return self.loc[:, list(key)]  11856         raise 
> NotImplementedError(key)  11857 
> /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key)    476 
>                 returns_series,    477                 series_name,--> 478    
>          ) = self._select_cols(cols_sel)    479     480             if cond 
> is None and limit is None and returns_series:
> /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, 
> missing_keys)    322             return self._select_cols_else(cols_sel, 
> missing_keys)    323         elif is_list_like(cols_sel):--> 324             
> return self._select_cols_by_iterable(cols_sel, missing_keys)    325         
> else:    326             return self._select_cols_else(cols_sel, missing_keys)
> /opt/spark/python/pyspark/pandas/indexing.py in 
> _select_cols_by_iterable(self, cols_sel, missing_keys)   1352                 
> if not found:   1353                     if missing_keys is None:-> 1354      
>                    raise KeyError("['{}'] not in 
> index".format(name_like_string(key)))   1355                     else:   1356 
>                         missing_keys.append(key)
> KeyError: "['testyear'] not in index"
> df_test
> testyear testmonth testday hour minute second0 2015 2 4 2 10 211 2016 3 5 3 
> 30 25



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68

2021-09-12 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36729:
--

 Summary: Upgrade Netty from 4.1.63 to 4.1.68
 Key: SPARK-36729
 URL: https://issues.apache.org/jira/browse/SPARK-36729
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Recently Netty 4.1.68 was released, which includes official M1 Mac support.
https://github.com/netty/netty/pull/11666

4.1.65 also includes a critical bug fix which Spark might be affected.
https://github.com/netty/netty/issues/11209



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413718#comment-17413718
 ] 

Apache Spark commented on SPARK-36729:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33970

> Upgrade Netty from 4.1.63 to 4.1.68
> ---
>
> Key: SPARK-36729
> URL: https://issues.apache.org/jira/browse/SPARK-36729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Recently Netty 4.1.68 was released, which includes official M1 Mac support.
> https://github.com/netty/netty/pull/11666
> 4.1.65 also includes a critical bug fix which Spark might be affected.
> https://github.com/netty/netty/issues/11209



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36729:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade Netty from 4.1.63 to 4.1.68
> ---
>
> Key: SPARK-36729
> URL: https://issues.apache.org/jira/browse/SPARK-36729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Major
>
> Recently Netty 4.1.68 was released, which includes official M1 Mac support.
> https://github.com/netty/netty/pull/11666
> 4.1.65 also includes a critical bug fix which Spark might be affected.
> https://github.com/netty/netty/issues/11209



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413717#comment-17413717
 ] 

Apache Spark commented on SPARK-36729:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/33970

> Upgrade Netty from 4.1.63 to 4.1.68
> ---
>
> Key: SPARK-36729
> URL: https://issues.apache.org/jira/browse/SPARK-36729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Recently Netty 4.1.68 was released, which includes official M1 Mac support.
> https://github.com/netty/netty/pull/11666
> 4.1.65 also includes a critical bug fix which Spark might be affected.
> https://github.com/netty/netty/issues/11209



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36729:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade Netty from 4.1.63 to 4.1.68
> ---
>
> Key: SPARK-36729
> URL: https://issues.apache.org/jira/browse/SPARK-36729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> Recently Netty 4.1.68 was released, which includes official M1 Mac support.
> https://github.com/netty/netty/pull/11666
> 4.1.65 also includes a critical bug fix which Spark might be affected.
> https://github.com/netty/netty/issues/11209



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36728) Can't create datetime object from anything other then year column Pyspark - koalas

2021-09-12 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413725#comment-17413725
 ] 

dgd_contributor commented on SPARK-36728:
-

I think this is not a bug, same behavior in pandas. We need set name of columns 
like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals 
of the same. 
[docs|https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#]


{code:java}
>>> df_test = pd.DataFrame(
... {
... 'testyear': [2015, 2016],
... 'testmonth': [2, 3],
... 'testday': [4, 5],
... }
... ) 
>>> pd.to_datetime(df_test[['testyear', 'testmonth', 'testday']])
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/dgd/spark/python/venv/lib/python3.8/site-packages/pandas/core/tools/datetimes.py",
 line 890, in to_datetime
result = _assemble_from_unit_mappings(arg, errors, tz)
  File 
"/Users/dgd/spark/python/venv/lib/python3.8/site-packages/pandas/core/tools/datetimes.py",
 line 996, in _assemble_from_unit_mappings
raise ValueError(
ValueError: to assemble mappings requires at least that [year, month, day] be 
specified: [day,month,year] is missing
>>>
{code}


> Can't create datetime object from anything other then year column Pyspark - 
> koalas
> --
>
> Key: SPARK-36728
> URL: https://issues.apache.org/jira/browse/SPARK-36728
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Bjørn Jørgensen
>Priority: Major
> Attachments: pyspark_date.txt
>
>
> If I create a datetime object it must be from columns named year.
>  
> df = ps.DataFrame(\{'year': [2015, 2016],df = ps.DataFrame({'year': [2015, 
> 2016],                   'month': [2, 3],                    'day': [4, 5],   
>                  'hour': [2, 3],                    'minute': [10, 30],       
>              'second': [21,25]}) df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 6 columns): #   Column  Non-Null Count  Dtype---  --  
> --  - 0   year    2 non-null      int64 1   month   2 
> non-null      int64 2   day     2 non-null      int64 3   hour    2 non-null  
>     int64 4   minute  2 non-null      int64 5   second  2 non-null      
> int64dtypes: int64(6)
> df['date'] = ps.to_datetime(df[['year', 'month', 'day']])
> df.info()
> Int64Index: 2 entries, 1 to 0Data 
> columns (total 7 columns): #   Column  Non-Null Count  Dtype     ---  --  
> --  -      0   year    2 non-null      int64      1   month   
> 2 non-null      int64      2   day     2 non-null      int64      3   hour    
> 2 non-null      int64      4   minute  2 non-null      int64      5   second  
> 2 non-null      int64      6   date    2 non-null      datetime64dtypes: 
> datetime64(1), int64(6)
> df_test = ps.DataFrame(\{'testyear': [2015, 2016],                   
> 'testmonth': [2, 3],                    'testday': [4, 5],                    
> 'hour': [2, 3],                    'minute': [10, 30],                    
> 'second': [21,25]}) df_test['date'] = ps.to_datetime(df[['testyear', 
> 'testmonth', 'testday']])
> ---KeyError
>                                   Traceback (most recent call 
> last)/tmp/ipykernel_73/904491906.py in > 1 df_test['date'] = 
> ps.to_datetime(df[['testyear', 'testmonth', 'testday']])
> /opt/spark/python/pyspark/pandas/frame.py in __getitem__(self, key)  11853    
>          return self.loc[:, key]  11854         elif is_list_like(key):> 
> 11855             return self.loc[:, list(key)]  11856         raise 
> NotImplementedError(key)  11857 
> /opt/spark/python/pyspark/pandas/indexing.py in __getitem__(self, key)    476 
>                 returns_series,    477                 series_name,--> 478    
>          ) = self._select_cols(cols_sel)    479     480             if cond 
> is None and limit is None and returns_series:
> /opt/spark/python/pyspark/pandas/indexing.py in _select_cols(self, cols_sel, 
> missing_keys)    322             return self._select_cols_else(cols_sel, 
> missing_keys)    323         elif is_list_like(cols_sel):--> 324             
> return self._select_cols_by_iterable(cols_sel, missing_keys)    325         
> else:    326             return self._select_cols_else(cols_sel, missing_keys)
> /opt/spark/python/pyspark/pandas/indexing.py in 
> _select_cols_by_iterable(self, cols_sel, missing_keys)   1352                 
> if not found:   1353                     if missing_keys is None:-> 1354      
>                    raise KeyError("['{}'] not in 
> index".format(name_like_string(key)))   1355                     else:   1356 
>                         missing_keys

[jira] [Resolved] (SPARK-36636) SparkContextSuite random failure in Scala 2.13

2021-09-12 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-36636.
--
Fix Version/s: 3.0.4
   3.1.3
   3.2.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/33963

> SparkContextSuite random failure in Scala 2.13
> --
>
> Key: SPARK-36636
> URL: https://issues.apache.org/jira/browse/SPARK-36636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
> Attachments: image-2021-09-11-00-29-21-168.png, 
> image-2021-09-11-00-30-20-237.png, image-2021-09-11-00-39-43-752.png
>
>
> run
> {code:java}
> build/mvn clean install -Pscala-2.13 -pl core -am{code}
> or
> {code:java}
> build/mvn clean install -Pscala-2.13 -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.SparkContextSuite
> {code}
> Some cases may fail as follows:
>  
> {code:java}
> - SPARK-33084: Add jar support Ivy URI -- test param key case sensitive *** 
> FAILED ***
>   java.lang.IllegalStateException: Cannot call methods on a stopped 
> SparkContext.
> This stopped SparkContext was created at:
> org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1155)
> org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> org.scalatest.Transformer.apply(Transformer.scala:22)
> org.scalatest.Transformer.apply(Transformer.scala:20)
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> scala.collection.immutable.List.foreach(List.scala:333)
> The currently active SparkContext was created at:
> org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1155)
> org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
> org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
> org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> org.scalatest.Transformer.apply(Transformer.scala:22)
> org.scalatest.Transformer.apply(Transformer.scala:20)
> org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
> org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
> org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
> org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
> org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
> org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
> org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
> org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
> org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
> org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
> org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
> scala.collection.immutable.List.foreach(List.scala:333)
>   at org.apache.spark.SparkContext.assertNotStopped(SparkContext.scala:118)
>   at org.apache.spark.SparkContext.getSchedulingMode(SparkContext.scala:1887)
>   at 
> org.apache.spark.SparkContext.postEnvironmentUpdate(SparkContext.scala:2575)
>   at org.apache.spark.SparkContext.addJar(SparkContext.scala:2008)
>   at org.apache.spark.SparkContext.addJar(SparkContext.scala:1928)
>   at 
> org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1156)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcome

[jira] [Updated] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68

2021-09-12 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36729:
---
Priority: Minor  (was: Major)

> Upgrade Netty from 4.1.63 to 4.1.68
> ---
>
> Key: SPARK-36729
> URL: https://issues.apache.org/jira/browse/SPARK-36729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Recently Netty 4.1.68 was released, which includes official M1 Mac support.
> https://github.com/netty/netty/pull/11666
> 4.1.65 also includes a critical bug fix which Spark might be affected.
> https://github.com/netty/netty/issues/11209



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36720) On overwrite mode, setting option truncate as true doesn't truncate the table

2021-09-12 Thread Balaji Balasubramaniam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413734#comment-17413734
 ] 

Balaji Balasubramaniam commented on SPARK-36720:


Because even though I’m setting mode is overwrite and truncate option is set to 
true, I would expect it to truncate the table and not drop the table. 


Sent from Yahoo Mail for iPhone


On Saturday, September 11, 2021, 7:00 PM, Hyukjin Kwon (Jira)  
wrote:


    [ 
https://issues.apache.org/jira/browse/SPARK-36720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413639#comment-17413639
 ] 

Hyukjin Kwon commented on SPARK-36720:
--

The error is from:

{quote}
com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [258]: 
insufficient privilege: Detailed info for this error can be found with guid 
''
{quote}

Mind elabourating why is it an issue in PySpark or Apache Spark?




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


> On overwrite mode, setting option truncate as true doesn't truncate the table
> -
>
> Key: SPARK-36720
> URL: https://issues.apache.org/jira/browse/SPARK-36720
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: Balaji Balasubramaniam
>Priority: Major
>
> I'm using PySpark from AWS Glue job to write it to SAP HANA using jdbc. Our 
> requirement is to truncate and load data in HANA.
> I've tried both of these options and on both cases, based on the stack trace, 
> it is trying to drop the table which is not allowed by security design.
> #df_lake.write.format("jdbc").option("url", edw_jdbc_url).option("driver", 
> "com.sap.db.jdbc.Driver").option("dbtable", edw_jdbc_db_table).option("user", 
> edw_jdbc_userid).option("password", edw_jdbc_password).option("truncate", 
> "true").mode("append").save()
>  properties=\{"user": edw_jdbc_userid, "password": edw_jdbc_password, 
> "truncate":"true"}
> df_lake.write.jdbc(url=edw_jdbc_url, table=edw_jdbc_db_table, 
> mode='overwrite', properties=properties)
>  
> I've verified that the schema matches. I did the jdbc read and print out the 
> schema as well as printing the schema from the source table.
> Schema from HANA:
> root
>  |-- RTL_ACCT_ID: long (nullable = true)
>  |-- FINE_DINING_PROPOSED: string (nullable = true)
>  |-- FINE_WINE_PROPOSED: string (nullable = true)
>  |-- FINE_WINE_INF_PROPOSED: string (nullable = true)
>  |-- GOLD_SILVER_PROPOSED: string (nullable = true)
>  |-- PREMIUM_PROPOSED: string (nullable = true)
>  |-- GSP_PROPOSED: string (nullable = true)
>  |-- PROPOSED_CRAFT: string (nullable = true)
>  |-- FW_REASON: string (nullable = true)
>  |-- FWI_REASON: string (nullable = true)
>  |-- GS_REASON: string (nullable = true)
>  |-- PREM_REASON: string (nullable = true)
>  |-- FD_REASON: string (nullable = true)
>  |-- CRAFT_REASON: string (nullable = true)
>  |-- GSP_FLAG: string (nullable = true)
>  |-- GSP_REASON: string (nullable = true)
>  |-- ELIGIBILITY: string (nullable = true)
>  |-- DW_LD_S: timestamp (nullable = true)
> Schema from the source table: 
> root
>  |-- RTL_ACCT_ID: long (nullable = true)
>  |-- FINE_DINING_PROPOSED: string (nullable = true)
>  |-- FINE_WINE_PROPOSED: string (nullable = true)
>  |-- FINE_WINE_INF_PROPOSED: string (nullable = true)
>  |-- GOLD_SILVER_PROPOSED: string (nullable = true)
>  |-- PREMIUM_PROPOSED: string (nullable = true)
>  |-- GSP_PROPOSED: string (nullable = true)
>  |-- PROPOSED_CRAFT: string (nullable = true)
>  |-- FW_REASON: string (nullable = true)
>  |-- FWI_REASON: string (nullable = true)
>  |-- GS_REASON: string (nullable = true)
>  |-- PREM_REASON: string (nullable = true)
>  |-- FD_REASON: string (nullable = true)
>  |-- CRAFT_REASON: string (nullable = true)
>  |-- GSP_FLAG: string (nullable = true)
>  |-- GSP_REASON: string (nullable = true)
>  |-- ELIGIBILITY: string (nullable = true)
>  |-- DW_LD_S: timestamp (nullable = true)
> This is the stack trace
> py4j.protocol.Py4JJavaError: An error occurred while calling o169.jdbc.
> : com.sap.db.jdbc.exceptions.JDBCDriverException: SAP DBTech JDBC: [258]: 
> insufficient privilege: Detailed info for this error can be found with guid 
> ''
>   at 
> com.sap.db.jdbc.exceptions.SQLExceptionSapDB._newInstance(SQLExceptionSapDB.java:191)
>   at 
> com.sap.db.jdbc.exceptions.SQLExceptionSapDB.newInstance(SQLExceptionSapDB.java:42)
>   at 
> com.sap.db.jdbc.packet.HReplyPacket._buildExceptionChain(HReplyPacket.java:976)
>   at 
> com.sap.db.jdbc.packet.HReplyPacket.getSQLExceptionChain(HReplyPacket.java:157)
>   at 
> com.sap.db.jdbc.packet.HPartInfo.getSQLExceptionChain(HPartInfo.java:39)
>   at com.sap.db.jdbc.Co

[jira] [Resolved] (SPARK-36729) Upgrade Netty from 4.1.63 to 4.1.68

2021-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36729.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33970
[https://github.com/apache/spark/pull/33970]

> Upgrade Netty from 4.1.63 to 4.1.68
> ---
>
> Key: SPARK-36729
> URL: https://issues.apache.org/jira/browse/SPARK-36729
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> Recently Netty 4.1.68 was released, which includes official M1 Mac support.
> https://github.com/netty/netty/pull/11666
> 4.1.65 also includes a critical bug fix which Spark might be affected.
> https://github.com/netty/netty/issues/11209



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36726) Upgrade Parquet to 1.12.1

2021-09-12 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-36726:
-
Priority: Blocker  (was: Major)

> Upgrade Parquet to 1.12.1
> -
>
> Key: SPARK-36726
> URL: https://issues.apache.org/jira/browse/SPARK-36726
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Blocker
>
> Upgrade Apache Parquet to 1.12.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-12 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-36705:
--
Target Version/s: 3.2.0

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Priority: Blocker
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36730) Use V2 Filter in V2 file source

2021-09-12 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-36730:
--

 Summary: Use V2 Filter in V2 file source
 Key: SPARK-36730
 URL: https://issues.apache.org/jira/browse/SPARK-36730
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Huaxin Gao


Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36731) BrotliCodec doesn't support Apple Silicon on MacOS

2021-09-12 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-36731:
-

 Summary: BrotliCodec doesn't support Apple Silicon on MacOS
 Key: SPARK-36731
 URL: https://issues.apache.org/jira/browse/SPARK-36731
 Project: Spark
  Issue Type: Sub-task
  Components: Build, SQL
Affects Versions: 3.3.0
Reporter: Dongjoon Hyun


{code}
[info] Caused by: java.lang.UnsatisfiedLinkError: Couldn't load native library 
'brotli'. [LoaderResult: os.name="Mac OS X", os.arch="aarch64", 
os.version="11.5.2", java.vm.name="OpenJDK 64-Bit Server VM", 
java.vm.version="25.302-b08", java.vm.vendor="Azul Systems, Inc.", 
alreadyLoaded="null", loadedFromSystemLibraryPath="false", 
nativeLibName="libbrotli.dylib", 
temporaryLibFile="/Users/dongjoon/APACHE/spark-merge/target/tmp/brotli8243220902047076449/libbrotli.dylib",
 libNameWithinClasspath="/lib/darwin-aarch64/libbrotli.dylib", 
usedThisClassloader="false", usedSystemClassloader="false", 
java.library.path="/Users/dongjoon/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:."]
[info]  at 
org.meteogroup.jbrotli.libloader.BrotliLibraryLoader.loadBrotli(BrotliLibraryLoader.java:35)
[info]  at org.apache.hadoop.io.compress.BrotliCodec.(BrotliCodec.java:40)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36731) BrotliCodec doesn't support Apple Silicon on MacOS

2021-09-12 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413859#comment-17413859
 ] 

Dongjoon Hyun commented on SPARK-36731:
---

Thanks to SPARK-36670, we identified this issue. Thank you, [~viirya]

> BrotliCodec doesn't support Apple Silicon on MacOS
> --
>
> Key: SPARK-36731
> URL: https://issues.apache.org/jira/browse/SPARK-36731
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> [info] Caused by: java.lang.UnsatisfiedLinkError: Couldn't load native 
> library 'brotli'. [LoaderResult: os.name="Mac OS X", os.arch="aarch64", 
> os.version="11.5.2", java.vm.name="OpenJDK 64-Bit Server VM", 
> java.vm.version="25.302-b08", java.vm.vendor="Azul Systems, Inc.", 
> alreadyLoaded="null", loadedFromSystemLibraryPath="false", 
> nativeLibName="libbrotli.dylib", 
> temporaryLibFile="/Users/dongjoon/APACHE/spark-merge/target/tmp/brotli8243220902047076449/libbrotli.dylib",
>  libNameWithinClasspath="/lib/darwin-aarch64/libbrotli.dylib", 
> usedThisClassloader="false", usedSystemClassloader="false", 
> java.library.path="/Users/dongjoon/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:."]
> [info]at 
> org.meteogroup.jbrotli.libloader.BrotliLibraryLoader.loadBrotli(BrotliLibraryLoader.java:35)
> [info]at 
> org.apache.hadoop.io.compress.BrotliCodec.(BrotliCodec.java:40)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36670) Add end-to-end codec test cases for main datasources

2021-09-12 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-36670:
---

Assignee: L. C. Hsieh  (was: Apache Spark)

> Add end-to-end codec test cases for main datasources
> 
>
> Key: SPARK-36670
> URL: https://issues.apache.org/jira/browse/SPARK-36670
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We found there is no e2e test cases available for main datasources like 
> Parquet, Orc. It makes developers harder to identify possible bugs early. We 
> should add such tests in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36700) BlockManager re-registration is broken due to deferred removal of BlockManager

2021-09-12 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413865#comment-17413865
 ] 

wuyi commented on SPARK-36700:
--

Reverted by [https://github.com/apache/spark/pull/33942] and backported to 3.2, 
3.1, 3.0.

> BlockManager re-registration is broken due to deferred removal of 
> BlockManager 
> ---
>
> Key: SPARK-36700
> URL: https://issues.apache.org/jira/browse/SPARK-36700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: wuyi
>Priority: Blocker
>
> Due to the deferred removal of BlockManager (introduced in SPARK-35011), an 
> expected BlockManager re-registration could be refused as the inactive 
> BlockManager still exists in the map `blockManagerInfo`:
> https://github.com/apache/spark/blob/9cefde8db373a3433b7e3ce328e4a2ce83b1aca2/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L551



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36700) BlockManager re-registration is broken due to deferred removal of BlockManager

2021-09-12 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-36700.
--
Fix Version/s: 3.3.0
   3.0.4
   3.1.3
   3.2.0
 Assignee: wuyi
   Resolution: Fixed

> BlockManager re-registration is broken due to deferred removal of 
> BlockManager 
> ---
>
> Key: SPARK-36700
> URL: https://issues.apache.org/jira/browse/SPARK-36700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Blocker
> Fix For: 3.2.0, 3.1.3, 3.0.4, 3.3.0
>
>
> Due to the deferred removal of BlockManager (introduced in SPARK-35011), an 
> expected BlockManager re-registration could be refused as the inactive 
> BlockManager still exists in the map `blockManagerInfo`:
> https://github.com/apache/spark/blob/9cefde8db373a3433b7e3ce328e4a2ce83b1aca2/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala#L551



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36732) Upgrade ORC to 1.6.11

2021-09-12 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-36732:
-

 Summary: Upgrade ORC to 1.6.11
 Key: SPARK-36732
 URL: https://issues.apache.org/jira/browse/SPARK-36732
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36732) Upgrade ORC to 1.6.11

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36732:


Assignee: Apache Spark

> Upgrade ORC to 1.6.11
> -
>
> Key: SPARK-36732
> URL: https://issues.apache.org/jira/browse/SPARK-36732
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36732) Upgrade ORC to 1.6.11

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413874#comment-17413874
 ] 

Apache Spark commented on SPARK-36732:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33971

> Upgrade ORC to 1.6.11
> -
>
> Key: SPARK-36732
> URL: https://issues.apache.org/jira/browse/SPARK-36732
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36732) Upgrade ORC to 1.6.11

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36732:


Assignee: (was: Apache Spark)

> Upgrade ORC to 1.6.11
> -
>
> Key: SPARK-36732
> URL: https://issues.apache.org/jira/browse/SPARK-36732
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36733) Perf issue in SchemaPruning when a struct has million fields

2021-09-12 Thread Kohki Nishio (Jira)
Kohki Nishio created SPARK-36733:


 Summary: Perf issue in SchemaPruning when a struct has million 
fields
 Key: SPARK-36733
 URL: https://issues.apache.org/jira/browse/SPARK-36733
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Kohki Nishio


Seeing a significant performance degradation in query processing when a table 
contains a significantly large number of fields (>10K).

Here's the stacktraces while processing a query
{code:java}
   java.lang.Thread.State: RUNNABLE   java.lang.Thread.State: RUNNABLE at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at 
scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
scala.collection.TraversableLike.map(TraversableLike.scala:285) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:278) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at 
org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown
 Source) at 
scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303)
 at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown 
Source) at 
scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at 
scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at 
scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at 
scala.collection.TraversableLike.filter(TraversableLike.scala:394) at 
scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at 
scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at 
org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70)
 at 
org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75)
 at 
org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown
 Source) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36733) Perf issue in SchemaPruning when a struct has million fields

2021-09-12 Thread Kohki Nishio (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413882#comment-17413882
 ] 

Kohki Nishio commented on SPARK-36733:
--

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69]

often time (as long as I observed), left struct and the right struct are the 
same one. And every call to {{StructType.fieldNames}}  runs\{{ 
fields.map(_.name). }}

this computation is quite expensive for 10K fields.

{{ val filteredRightFieldNames = rightStruct.fieldNames}}
{{    .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }}

 

> Perf issue in SchemaPruning when a struct has million fields
> 
>
> Key: SPARK-36733
> URL: https://issues.apache.org/jira/browse/SPARK-36733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
>
> Seeing a significant performance degradation in query processing when a table 
> contains a significantly large number of fields (>10K).
> Here's the stacktraces while processing a query
> {code:java}
>    java.lang.Thread.State: RUNNABLE   java.lang.Thread.State: RUNNABLE at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at 
> scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) 
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 
> at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:285) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:278) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown
>  Source) at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303)
>  at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown 
> Source) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at 
> scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at 
> scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.filter(TraversableLike.scala:394) at 
> scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at 
> scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown
>  Source) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-36733) Perf issue in SchemaPruning when a struct has million fields

2021-09-12 Thread Kohki Nishio (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413882#comment-17413882
 ] 

Kohki Nishio edited comment on SPARK-36733 at 9/13/21, 3:23 AM:


[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69]

often time (as long as I observed), left struct and the right struct are the 
same one. And every call to {{StructType.fieldNames}}  runs {{ 
fields.map(_.name). }}

this computation is quite expensive for 10K fields.

{{ val filteredRightFieldNames = rightStruct.fieldNames}}
{{    .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }}

 


was (Author: taroplus):
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69]

often time (as long as I observed), left struct and the right struct are the 
same one. And every call to {{StructType.fieldNames}}  runs\{{ 
fields.map(_.name). }}

this computation is quite expensive for 10K fields.

{{ val filteredRightFieldNames = rightStruct.fieldNames}}
{{    .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }}

 

> Perf issue in SchemaPruning when a struct has million fields
> 
>
> Key: SPARK-36733
> URL: https://issues.apache.org/jira/browse/SPARK-36733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
>
> Seeing a significant performance degradation in query processing when a table 
> contains a significantly large number of fields (>10K).
> Here's the stacktraces while processing a query
> {code:java}
>    java.lang.Thread.State: RUNNABLE   java.lang.Thread.State: RUNNABLE at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at 
> scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) 
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 
> at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:285) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:278) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown
>  Source) at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303)
>  at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown 
> Source) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at 
> scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at 
> scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.filter(TraversableLike.scala:394) at 
> scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at 
> scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown
>  Source) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36733) Perf issue in SchemaPruning when a struct has many fields

2021-09-12 Thread Kohki Nishio (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kohki Nishio updated SPARK-36733:
-
Summary: Perf issue in SchemaPruning when a struct has many fields  (was: 
Perf issue in SchemaPruning when a struct has million fields)

> Perf issue in SchemaPruning when a struct has many fields
> -
>
> Key: SPARK-36733
> URL: https://issues.apache.org/jira/browse/SPARK-36733
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Kohki Nishio
>Priority: Major
>
> Seeing a significant performance degradation in query processing when a table 
> contains a significantly large number of fields (>10K).
> Here's the stacktraces while processing a query
> {code:java}
>    java.lang.Thread.State: RUNNABLE   java.lang.Thread.State: RUNNABLE at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at 
> scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) 
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) 
> at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:285) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:278) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at 
> org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown
>  Source) at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303)
>  at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown 
> Source) at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at 
> scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at 
> scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at 
> scala.collection.TraversableLike.filter(TraversableLike.scala:394) at 
> scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at 
> scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75)
>  at 
> org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown
>  Source) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36734) Upgrade ORC to 1.5.13

2021-09-12 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-36734:
-

 Summary: Upgrade ORC to 1.5.13
 Key: SPARK-36734
 URL: https://issues.apache.org/jira/browse/SPARK-36734
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 3.1.2
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34183) DataSource V2: Support required distribution and ordering in SS

2021-09-12 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413895#comment-17413895
 ] 

Anton Okolnychyi commented on SPARK-34183:
--

Sorry for the delay, [~hyukjin.kwon] [~Gengliang.Wang]. I am planning to update 
the PR this week. I am not sure this should be a real blocker, however. I was 
asked to create this issue during the review.

> DataSource V2: Support required distribution and ordering in SS
> ---
>
> Key: SPARK-34183
> URL: https://issues.apache.org/jira/browse/SPARK-34183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We need to support a required distribution and ordering for SS. See the 
> discussion 
> [here|https://github.com/apache/spark/pull/31083#issuecomment-763214597].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36734) Upgrade ORC to 1.5.13

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36734:


Assignee: (was: Apache Spark)

> Upgrade ORC to 1.5.13
> -
>
> Key: SPARK-36734
> URL: https://issues.apache.org/jira/browse/SPARK-36734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.1.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36734) Upgrade ORC to 1.5.13

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36734:


Assignee: Apache Spark

> Upgrade ORC to 1.5.13
> -
>
> Key: SPARK-36734
> URL: https://issues.apache.org/jira/browse/SPARK-36734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.1.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36734) Upgrade ORC to 1.5.13

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36734:


Assignee: Apache Spark

> Upgrade ORC to 1.5.13
> -
>
> Key: SPARK-36734
> URL: https://issues.apache.org/jira/browse/SPARK-36734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.1.2
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36734) Upgrade ORC to 1.5.13

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413896#comment-17413896
 ] 

Apache Spark commented on SPARK-36734:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/33972

> Upgrade ORC to 1.5.13
> -
>
> Key: SPARK-36734
> URL: https://issues.apache.org/jira/browse/SPARK-36734
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.1.2
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34183) DataSource V2: Support required distribution and ordering in SS

2021-09-12 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413897#comment-17413897
 ] 

Jungtaek Lim commented on SPARK-34183:
--

Sorry I totally missed this issue. My bad.

I asked to file this as a blocker because there is "uncertain" on the behavior 
of streaming query when data source requires distribution and ordering and I 
wanted to see it addressed before releasing.

For example, in worst case, data sources requiring distribution/ordering may no 
longer work with SS in any way, if the Spark implementation requires "sort" 
across micro-batches. Would data source be able to (and is expected) indicate 
about the type of the query (batch vs streaming) and optionally provide 
requirements of distribution/ordering? I'm not sure. Even it is technically 
possible, we should guide data source implementators to do so. Otherwise it's 
going to be another surprise thing on Spark.

> DataSource V2: Support required distribution and ordering in SS
> ---
>
> Key: SPARK-34183
> URL: https://issues.apache.org/jira/browse/SPARK-34183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We need to support a required distribution and ordering for SS. See the 
> discussion 
> [here|https://github.com/apache/spark/pull/31083#issuecomment-763214597].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36735) Adjust overhead of cached relation for DPP

2021-09-12 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-36735:
---

 Summary: Adjust overhead of cached relation for DPP
 Key: SPARK-36735
 URL: https://issues.apache.org/jira/browse/SPARK-36735
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: L. C. Hsieh


Currently we calculate if there is benefit of pruning with DPP by simply 
summing up the size of all scan relations as the overhead. However, for cached 
relations, the overhead should be different than a non-cached relation. This 
proposes to use adjusted overhead for cached relation with DPP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36716) Join estimation support LeftExistence join type

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36716:


Assignee: Apache Spark

> Join estimation support LeftExistence join type
> ---
>
> Key: SPARK-36716
> URL: https://issues.apache.org/jira/browse/SPARK-36716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
> Attachments: image-2021-09-10-17-17-33-082.png
>
>
> Join estimation support LeftExistence join type. This can benefit tpcds q10.
> Before:
> {noformat}
> TakeOrderedAndProject (51)
> +- * HashAggregate (50)
>+- Exchange (49)
>   +- * HashAggregate (48)
>  +- * Project (47)
> +- * SortMergeJoin Inner (46)
>:- * Sort (40)
>:  +- Exchange (39)
>: +- * Project (38)
>:+- * BroadcastHashJoin Inner BuildRight (37)
>:   :- * Project (31)
>:   :  +- * Filter (30)
>:   : +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:   ::- SortMergeJoin ExistenceJoin(exists#2) 
> (21)
>:   ::  :- * SortMergeJoin LeftSemi (13)
>:   ::  :  :- * Sort (5)
>:   ::  :  :  +- Exchange (4)
>:   ::  :  : +- * Filter (3)
>:   ::  :  :+- * ColumnarToRow (2)
>:   ::  :  :   +- Scan parquet 
> default.customer (1)
>:   ::  :  +- * Sort (12)
>:   ::  : +- Exchange (11)
>:   ::  :+- * Project (10)
>:   ::  :   +- * BroadcastHashJoin 
> Inner BuildRight (9)
>:   ::  :  :- * ColumnarToRow (7)
>:   ::  :  :  +- Scan parquet 
> default.store_sales (6)
>:   ::  :  +- ReusedExchange (8)
>:   ::  +- * Sort (20)
>:   :: +- Exchange (19)
>:   ::+- * Project (18)
>:   ::   +- * BroadcastHashJoin Inner 
> BuildRight (17)
>:   ::  :- * ColumnarToRow (15)
>:   ::  :  +- Scan parquet 
> default.web_sales (14)
>:   ::  +- ReusedExchange (16)
>:   :+- * Sort (28)
>:   :   +- Exchange (27)
>:   :  +- * Project (26)
>:   : +- * BroadcastHashJoin Inner 
> BuildRight (25)
>:   ::- * ColumnarToRow (23)
>:   ::  +- Scan parquet 
> default.catalog_sales (22)
>:   :+- ReusedExchange (24)
>:   +- BroadcastExchange (36)
>:  +- * Project (35)
>: +- * Filter (34)
>:+- * ColumnarToRow (33)
>:   +- Scan parquet 
> default.customer_address (32)
>+- * Sort (45)
>   +- Exchange (44)
>  +- * Filter (43)
> +- * ColumnarToRow (42)
>+- Scan parquet default.customer_demographics (41)
> {noformat}
> After:
> {noformat}
> TakeOrderedAndProject (48)
> +- * HashAggregate (47)
>+- Exchange (46)
>   +- * HashAggregate (45)
>  +- * Project (44)
> +- * BroadcastHashJoin Inner BuildLeft (43)
>:- BroadcastExchange (39)
>:  +- * Project (38)
>: +- * BroadcastHashJoin Inner BuildRight (37)
>::- * Project (31)
>::  +- * Filter (30)
>:: +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:::- SortMergeJoin ExistenceJoin(exists#2) (21)
>:::  :- * SortMergeJoin LeftSemi (13)
>:::  :  :- * Sort (5)
>:::  :  :  +- Exchange (4)
>:::  :  : +- * Filter (3)
>:::  :  :+- * ColumnarToRow (2)
>: 

[jira] [Commented] (SPARK-36735) Adjust overhead of cached relation for DPP

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413905#comment-17413905
 ] 

Apache Spark commented on SPARK-36735:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33975

> Adjust overhead of cached relation for DPP
> --
>
> Key: SPARK-36735
> URL: https://issues.apache.org/jira/browse/SPARK-36735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently we calculate if there is benefit of pruning with DPP by simply 
> summing up the size of all scan relations as the overhead. However, for 
> cached relations, the overhead should be different than a non-cached 
> relation. This proposes to use adjusted overhead for cached relation with DPP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36730) Use V2 Filter in V2 file source

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413904#comment-17413904
 ] 

Apache Spark commented on SPARK-36730:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/33973

> Use V2 Filter in V2 file source
> ---
>
> Key: SPARK-36730
> URL: https://issues.apache.org/jira/browse/SPARK-36730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36716) Join estimation support LeftExistence join type

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413903#comment-17413903
 ] 

Apache Spark commented on SPARK-36716:
--

User '007akuan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33974

> Join estimation support LeftExistence join type
> ---
>
> Key: SPARK-36716
> URL: https://issues.apache.org/jira/browse/SPARK-36716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2021-09-10-17-17-33-082.png
>
>
> Join estimation support LeftExistence join type. This can benefit tpcds q10.
> Before:
> {noformat}
> TakeOrderedAndProject (51)
> +- * HashAggregate (50)
>+- Exchange (49)
>   +- * HashAggregate (48)
>  +- * Project (47)
> +- * SortMergeJoin Inner (46)
>:- * Sort (40)
>:  +- Exchange (39)
>: +- * Project (38)
>:+- * BroadcastHashJoin Inner BuildRight (37)
>:   :- * Project (31)
>:   :  +- * Filter (30)
>:   : +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:   ::- SortMergeJoin ExistenceJoin(exists#2) 
> (21)
>:   ::  :- * SortMergeJoin LeftSemi (13)
>:   ::  :  :- * Sort (5)
>:   ::  :  :  +- Exchange (4)
>:   ::  :  : +- * Filter (3)
>:   ::  :  :+- * ColumnarToRow (2)
>:   ::  :  :   +- Scan parquet 
> default.customer (1)
>:   ::  :  +- * Sort (12)
>:   ::  : +- Exchange (11)
>:   ::  :+- * Project (10)
>:   ::  :   +- * BroadcastHashJoin 
> Inner BuildRight (9)
>:   ::  :  :- * ColumnarToRow (7)
>:   ::  :  :  +- Scan parquet 
> default.store_sales (6)
>:   ::  :  +- ReusedExchange (8)
>:   ::  +- * Sort (20)
>:   :: +- Exchange (19)
>:   ::+- * Project (18)
>:   ::   +- * BroadcastHashJoin Inner 
> BuildRight (17)
>:   ::  :- * ColumnarToRow (15)
>:   ::  :  +- Scan parquet 
> default.web_sales (14)
>:   ::  +- ReusedExchange (16)
>:   :+- * Sort (28)
>:   :   +- Exchange (27)
>:   :  +- * Project (26)
>:   : +- * BroadcastHashJoin Inner 
> BuildRight (25)
>:   ::- * ColumnarToRow (23)
>:   ::  +- Scan parquet 
> default.catalog_sales (22)
>:   :+- ReusedExchange (24)
>:   +- BroadcastExchange (36)
>:  +- * Project (35)
>: +- * Filter (34)
>:+- * ColumnarToRow (33)
>:   +- Scan parquet 
> default.customer_address (32)
>+- * Sort (45)
>   +- Exchange (44)
>  +- * Filter (43)
> +- * ColumnarToRow (42)
>+- Scan parquet default.customer_demographics (41)
> {noformat}
> After:
> {noformat}
> TakeOrderedAndProject (48)
> +- * HashAggregate (47)
>+- Exchange (46)
>   +- * HashAggregate (45)
>  +- * Project (44)
> +- * BroadcastHashJoin Inner BuildLeft (43)
>:- BroadcastExchange (39)
>:  +- * Project (38)
>: +- * BroadcastHashJoin Inner BuildRight (37)
>::- * Project (31)
>::  +- * Filter (30)
>:: +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:::- SortMergeJoin ExistenceJoin(exists#2) (21)
>:::  :- * SortMergeJoin LeftSemi (13)
>:::  :  :- * Sort (5)
>:::  :  :  +- Exchange (4)
>:::  :  : +- * Filter (3)
> 

[jira] [Assigned] (SPARK-36730) Use V2 Filter in V2 file source

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36730:


Assignee: (was: Apache Spark)

> Use V2 Filter in V2 file source
> ---
>
> Key: SPARK-36730
> URL: https://issues.apache.org/jira/browse/SPARK-36730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36735) Adjust overhead of cached relation for DPP

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36735:


Assignee: (was: Apache Spark)

> Adjust overhead of cached relation for DPP
> --
>
> Key: SPARK-36735
> URL: https://issues.apache.org/jira/browse/SPARK-36735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently we calculate if there is benefit of pruning with DPP by simply 
> summing up the size of all scan relations as the overhead. However, for 
> cached relations, the overhead should be different than a non-cached 
> relation. This proposes to use adjusted overhead for cached relation with DPP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36730) Use V2 Filter in V2 file source

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36730:


Assignee: Apache Spark

> Use V2 Filter in V2 file source
> ---
>
> Key: SPARK-36730
> URL: https://issues.apache.org/jira/browse/SPARK-36730
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Use V2 Filters in V2 file source, e.g. FileScan, FileScanBuilder



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36716) Join estimation support LeftExistence join type

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36716:


Assignee: (was: Apache Spark)

> Join estimation support LeftExistence join type
> ---
>
> Key: SPARK-36716
> URL: https://issues.apache.org/jira/browse/SPARK-36716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: image-2021-09-10-17-17-33-082.png
>
>
> Join estimation support LeftExistence join type. This can benefit tpcds q10.
> Before:
> {noformat}
> TakeOrderedAndProject (51)
> +- * HashAggregate (50)
>+- Exchange (49)
>   +- * HashAggregate (48)
>  +- * Project (47)
> +- * SortMergeJoin Inner (46)
>:- * Sort (40)
>:  +- Exchange (39)
>: +- * Project (38)
>:+- * BroadcastHashJoin Inner BuildRight (37)
>:   :- * Project (31)
>:   :  +- * Filter (30)
>:   : +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:   ::- SortMergeJoin ExistenceJoin(exists#2) 
> (21)
>:   ::  :- * SortMergeJoin LeftSemi (13)
>:   ::  :  :- * Sort (5)
>:   ::  :  :  +- Exchange (4)
>:   ::  :  : +- * Filter (3)
>:   ::  :  :+- * ColumnarToRow (2)
>:   ::  :  :   +- Scan parquet 
> default.customer (1)
>:   ::  :  +- * Sort (12)
>:   ::  : +- Exchange (11)
>:   ::  :+- * Project (10)
>:   ::  :   +- * BroadcastHashJoin 
> Inner BuildRight (9)
>:   ::  :  :- * ColumnarToRow (7)
>:   ::  :  :  +- Scan parquet 
> default.store_sales (6)
>:   ::  :  +- ReusedExchange (8)
>:   ::  +- * Sort (20)
>:   :: +- Exchange (19)
>:   ::+- * Project (18)
>:   ::   +- * BroadcastHashJoin Inner 
> BuildRight (17)
>:   ::  :- * ColumnarToRow (15)
>:   ::  :  +- Scan parquet 
> default.web_sales (14)
>:   ::  +- ReusedExchange (16)
>:   :+- * Sort (28)
>:   :   +- Exchange (27)
>:   :  +- * Project (26)
>:   : +- * BroadcastHashJoin Inner 
> BuildRight (25)
>:   ::- * ColumnarToRow (23)
>:   ::  +- Scan parquet 
> default.catalog_sales (22)
>:   :+- ReusedExchange (24)
>:   +- BroadcastExchange (36)
>:  +- * Project (35)
>: +- * Filter (34)
>:+- * ColumnarToRow (33)
>:   +- Scan parquet 
> default.customer_address (32)
>+- * Sort (45)
>   +- Exchange (44)
>  +- * Filter (43)
> +- * ColumnarToRow (42)
>+- Scan parquet default.customer_demographics (41)
> {noformat}
> After:
> {noformat}
> TakeOrderedAndProject (48)
> +- * HashAggregate (47)
>+- Exchange (46)
>   +- * HashAggregate (45)
>  +- * Project (44)
> +- * BroadcastHashJoin Inner BuildLeft (43)
>:- BroadcastExchange (39)
>:  +- * Project (38)
>: +- * BroadcastHashJoin Inner BuildRight (37)
>::- * Project (31)
>::  +- * Filter (30)
>:: +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:::- SortMergeJoin ExistenceJoin(exists#2) (21)
>:::  :- * SortMergeJoin LeftSemi (13)
>:::  :  :- * Sort (5)
>:::  :  :  +- Exchange (4)
>:::  :  : +- * Filter (3)
>:::  :  :+- * ColumnarToRow (2)
>:::  :  :  

[jira] [Assigned] (SPARK-36735) Adjust overhead of cached relation for DPP

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36735:


Assignee: Apache Spark

> Adjust overhead of cached relation for DPP
> --
>
> Key: SPARK-36735
> URL: https://issues.apache.org/jira/browse/SPARK-36735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Currently we calculate if there is benefit of pruning with DPP by simply 
> summing up the size of all scan relations as the overhead. However, for 
> cached relations, the overhead should be different than a non-cached 
> relation. This proposes to use adjusted overhead for cached relation with DPP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36735) Adjust overhead of cached relation for DPP

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413906#comment-17413906
 ] 

Apache Spark commented on SPARK-36735:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33975

> Adjust overhead of cached relation for DPP
> --
>
> Key: SPARK-36735
> URL: https://issues.apache.org/jira/browse/SPARK-36735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently we calculate if there is benefit of pruning with DPP by simply 
> summing up the size of all scan relations as the overhead. However, for 
> cached relations, the overhead should be different than a non-cached 
> relation. This proposes to use adjusted overhead for cached relation with DPP.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36716) Join estimation support LeftExistence join type

2021-09-12 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-36716:

Attachment: (was: image-2021-09-10-17-17-33-082.png)

> Join estimation support LeftExistence join type
> ---
>
> Key: SPARK-36716
> URL: https://issues.apache.org/jira/browse/SPARK-36716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Priority: Major
>
> Join estimation support LeftExistence join type. This can benefit tpcds q10.
> Before:
> {noformat}
> TakeOrderedAndProject (51)
> +- * HashAggregate (50)
>+- Exchange (49)
>   +- * HashAggregate (48)
>  +- * Project (47)
> +- * SortMergeJoin Inner (46)
>:- * Sort (40)
>:  +- Exchange (39)
>: +- * Project (38)
>:+- * BroadcastHashJoin Inner BuildRight (37)
>:   :- * Project (31)
>:   :  +- * Filter (30)
>:   : +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:   ::- SortMergeJoin ExistenceJoin(exists#2) 
> (21)
>:   ::  :- * SortMergeJoin LeftSemi (13)
>:   ::  :  :- * Sort (5)
>:   ::  :  :  +- Exchange (4)
>:   ::  :  : +- * Filter (3)
>:   ::  :  :+- * ColumnarToRow (2)
>:   ::  :  :   +- Scan parquet 
> default.customer (1)
>:   ::  :  +- * Sort (12)
>:   ::  : +- Exchange (11)
>:   ::  :+- * Project (10)
>:   ::  :   +- * BroadcastHashJoin 
> Inner BuildRight (9)
>:   ::  :  :- * ColumnarToRow (7)
>:   ::  :  :  +- Scan parquet 
> default.store_sales (6)
>:   ::  :  +- ReusedExchange (8)
>:   ::  +- * Sort (20)
>:   :: +- Exchange (19)
>:   ::+- * Project (18)
>:   ::   +- * BroadcastHashJoin Inner 
> BuildRight (17)
>:   ::  :- * ColumnarToRow (15)
>:   ::  :  +- Scan parquet 
> default.web_sales (14)
>:   ::  +- ReusedExchange (16)
>:   :+- * Sort (28)
>:   :   +- Exchange (27)
>:   :  +- * Project (26)
>:   : +- * BroadcastHashJoin Inner 
> BuildRight (25)
>:   ::- * ColumnarToRow (23)
>:   ::  +- Scan parquet 
> default.catalog_sales (22)
>:   :+- ReusedExchange (24)
>:   +- BroadcastExchange (36)
>:  +- * Project (35)
>: +- * Filter (34)
>:+- * ColumnarToRow (33)
>:   +- Scan parquet 
> default.customer_address (32)
>+- * Sort (45)
>   +- Exchange (44)
>  +- * Filter (43)
> +- * ColumnarToRow (42)
>+- Scan parquet default.customer_demographics (41)
> {noformat}
> After:
> {noformat}
> TakeOrderedAndProject (48)
> +- * HashAggregate (47)
>+- Exchange (46)
>   +- * HashAggregate (45)
>  +- * Project (44)
> +- * BroadcastHashJoin Inner BuildLeft (43)
>:- BroadcastExchange (39)
>:  +- * Project (38)
>: +- * BroadcastHashJoin Inner BuildRight (37)
>::- * Project (31)
>::  +- * Filter (30)
>:: +- SortMergeJoin ExistenceJoin(exists#1) (29)
>:::- SortMergeJoin ExistenceJoin(exists#2) (21)
>:::  :- * SortMergeJoin LeftSemi (13)
>:::  :  :- * Sort (5)
>:::  :  :  +- Exchange (4)
>:::  :  : +- * Filter (3)
>:::  :  :+- * ColumnarToRow (2)
>:::  :  :   +- Scan parquet 
> default.customer 

[jira] [Commented] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-12 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413929#comment-17413929
 ] 

Apache Spark commented on SPARK-36705:
--

User 'rmcyang' has created a pull request for this issue:
https://github.com/apache/spark/pull/33976

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Priority: Blocker
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36705:


Assignee: (was: Apache Spark)

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Priority: Blocker
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36705) Disable push based shuffle when IO encryption is enabled or serializer is not relocatable

2021-09-12 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36705:


Assignee: Apache Spark

> Disable push based shuffle when IO encryption is enabled or serializer is not 
> relocatable
> -
>
> Key: SPARK-36705
> URL: https://issues.apache.org/jira/browse/SPARK-36705
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle
>Affects Versions: 3.2.0
>Reporter: Mridul Muralidharan
>Assignee: Apache Spark
>Priority: Blocker
>
> Push based shuffle is not compatible with io encryption or non-relocatable 
> serialization.
> This is similar to SPARK-34790
> We have to disable push based shuffle if either of these two are true.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org