date:20210606

[jira] [Assigned] (SPARK-35661) Allow deserialized off-heap memory entry

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35661:


Assignee: Apache Spark

> Allow deserialized off-heap memory entry
> 
>
> Key: SPARK-35661
> URL: https://issues.apache.org/jira/browse/SPARK-35661
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35661) Allow deserialized off-heap memory entry

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358362#comment-17358362
 ] 

Apache Spark commented on SPARK-35661:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32800

> Allow deserialized off-heap memory entry
> 
>
> Key: SPARK-35661
> URL: https://issues.apache.org/jira/browse/SPARK-35661
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35661) Allow deserialized off-heap memory entry

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35661:


Assignee: (was: Apache Spark)

> Allow deserialized off-heap memory entry
> 
>
> Key: SPARK-35661
> URL: https://issues.apache.org/jira/browse/SPARK-35661
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35661) Allow deserialized off-heap memory entry

2021-06-06 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-35661:
---

 Summary: Allow deserialized off-heap memory entry
 Key: SPARK-35661
 URL: https://issues.apache.org/jira/browse/SPARK-35661
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1

2021-06-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35660:
-

Assignee: Dongjoon Hyun

> Upgrade Kubernetes-client to 5.4.1
> --
>
> Key: SPARK-35660
> URL: https://issues.apache.org/jira/browse/SPARK-35660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1

2021-06-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35660.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32798
[https://github.com/apache/spark/pull/32798]

> Upgrade Kubernetes-client to 5.4.1
> --
>
> Key: SPARK-35660
> URL: https://issues.apache.org/jira/browse/SPARK-35660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35635) concurrent insert statements from multiple beeline fail with job aborted exception

2021-06-06 Thread Chetan Bhat (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358356#comment-17358356
 ] 

Chetan Bhat commented on SPARK-35635:
-

Yes thats the issue. That has to be taken care from the system during 
concurrent query execution.

> concurrent insert statements from multiple beeline fail with job aborted 
> exception
> --
>
> Key: SPARK-35635
> URL: https://issues.apache.org/jira/browse/SPARK-35635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
>Reporter: Chetan Bhat
>Priority: Minor
>
> Create tables - 
> CREATE TABLE J1_TBL (
>  i integer,
>  j integer,
>  t string
> ) USING parquet;
> CREATE TABLE J2_TBL (
>  i integer,
>  k integer
> ) USING parquet;
> From 4 concurrent beeline sessions execute the insert into select queries - 
> INSERT INTO J1_TBL VALUES (1, 4, 'one');
> INSERT INTO J1_TBL VALUES (2, 3, 'two');
> INSERT INTO J1_TBL VALUES (3, 2, 'three');
> INSERT INTO J1_TBL VALUES (4, 1, 'four');
> INSERT INTO J1_TBL VALUES (5, 0, 'five');
> INSERT INTO J1_TBL VALUES (6, 6, 'six');
> INSERT INTO J1_TBL VALUES (7, 7, 'seven');
> INSERT INTO J1_TBL VALUES (8, 8, 'eight');
> INSERT INTO J1_TBL VALUES (0, NULL, 'zero');
> INSERT INTO J1_TBL VALUES (NULL, NULL, 'null');
> INSERT INTO J1_TBL VALUES (NULL, 0, 'zero');
> INSERT INTO J2_TBL VALUES (1, -1);
> INSERT INTO J2_TBL VALUES (2, 2);
> INSERT INTO J2_TBL VALUES (3, -3);
> INSERT INTO J2_TBL VALUES (2, 4);
> INSERT INTO J2_TBL VALUES (5, -5);
> INSERT INTO J2_TBL VALUES (5, -5);
> INSERT INTO J2_TBL VALUES (0, NULL);
> INSERT INTO J2_TBL VALUES (NULL, NULL);
> INSERT INTO J2_TBL VALUES (NULL, 0);
>  
> Issue : concurrent insert statements from multiple beeline fail with job 
> aborted exception.
> 0: jdbc:hive2://10.19.89.222:23040/> INSERT INTO J1_TBL VALUES (8, 8, 
> 'eight');
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted.
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:366)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3$$Lambda$1781/750578465.apply$mcV$sp(Unknown
>  Source)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:45)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Job aborted.
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:109)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:107)
>  at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:121)
>  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
>  at org.apache.spark.sql.Dataset$$Lambda$1650/1168893915.apply(Unknown Source)
>  at

[jira] [Assigned] (SPARK-35646) Merge contents and remove obsolete pages in API reference section

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35646:


Assignee: (was: Apache Spark)

> Merge contents and remove obsolete pages in API reference section
> -
>
> Key: SPARK-35646
> URL: https://issues.apache.org/jira/browse/SPARK-35646
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Now Koalas documentation is in PySpark documentations. We should probably now 
> remove obsolete pages such as blog post and talks. Also, we should refine and 
> merge contents properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35646) Merge contents and remove obsolete pages in API reference section

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35646:


Assignee: Apache Spark

> Merge contents and remove obsolete pages in API reference section
> -
>
> Key: SPARK-35646
> URL: https://issues.apache.org/jira/browse/SPARK-35646
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Now Koalas documentation is in PySpark documentations. We should probably now 
> remove obsolete pages such as blog post and talks. Also, we should refine and 
> merge contents properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35646) Merge contents and remove obsolete pages in API reference section

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358355#comment-17358355
 ] 

Apache Spark commented on SPARK-35646:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/32799

> Merge contents and remove obsolete pages in API reference section
> -
>
> Key: SPARK-35646
> URL: https://issues.apache.org/jira/browse/SPARK-35646
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Now Koalas documentation is in PySpark documentations. We should probably now 
> remove obsolete pages such as blog post and talks. Also, we should refine and 
> merge contents properly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35660:


Assignee: Apache Spark

> Upgrade Kubernetes-client to 5.4.1
> --
>
> Key: SPARK-35660
> URL: https://issues.apache.org/jira/browse/SPARK-35660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35660:


Assignee: (was: Apache Spark)

> Upgrade Kubernetes-client to 5.4.1
> --
>
> Key: SPARK-35660
> URL: https://issues.apache.org/jira/browse/SPARK-35660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358337#comment-17358337
 ] 

Apache Spark commented on SPARK-35660:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32798

> Upgrade Kubernetes-client to 5.4.1
> --
>
> Key: SPARK-35660
> URL: https://issues.apache.org/jira/browse/SPARK-35660
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1

2021-06-06 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-35660:
-

 Summary: Upgrade Kubernetes-client to 5.4.1
 Key: SPARK-35660
 URL: https://issues.apache.org/jira/browse/SPARK-35660
 Project: Spark
  Issue Type: Improvement
  Components: Build, Kubernetes
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35603) Add data source options link for R API documentation.

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35603:


Assignee: (was: Apache Spark)

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should add the data source options link for R documentation as well like 
> we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35603) Add data source options link for R API documentation.

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35603:


Assignee: Apache Spark

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should add the data source options link for R documentation as well like 
> we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35603) Add data source options link for R API documentation.

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358333#comment-17358333
 ] 

Apache Spark commented on SPARK-35603:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32797

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should add the data source options link for R documentation as well like 
> we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35603) Add data source options link for R API documentation.

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358332#comment-17358332
 ] 

Apache Spark commented on SPARK-35603:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32797

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should add the data source options link for R documentation as well like 
> we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34765) Linear Models standardization optimization

2021-06-06 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-34765.
--
Resolution: Resolved

> Linear Models standardization optimization
> --
>
> Key: SPARK-34765
> URL: https://issues.apache.org/jira/browse/SPARK-34765
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.1, 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> Existing impl of standardization in linear models does *NOT* center the 
> vectors by removing the means, for the purpose of keep the dataset sparsity.
> However, this will cause feature values with small var be scaled to large 
> values, and underlying solver like LBFGS can not efficiently handle this 
> case. see SPARK-34448 for details.
> If internal vectors are centers (like other famous impl, i.e. 
> GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
> SPARK-34448, the number of iteration to convergence will be reduced from 93 
> to 6. Moreover, the final solution is much more close to the one in GLMNET.
> luckily, we find a new way to 'virtually' center the vectors without 
> densifying the dataset, iff:
> 1, fitIntercept is true;
>  2, no penalty on the intercept, it seem this is always true in existing 
> impls;
>  3, no bounds on the intercept;
>  
> We will also need to check whether this new methods work in all other linear 
> models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into 
> those models if possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35619) Refactor LinearRegression - make huber support virtual centering

2021-06-06 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-35619.
--
Resolution: Resolved

> Refactor LinearRegression - make huber support virtual centering
> 
>
> Key: SPARK-35619
> URL: https://issues.apache.org/jira/browse/SPARK-35619
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> 1, make huber regression support virtual centering
> 2, as to \{LeastSquares}, it always compute without intercept, and estimate 
> the intercept after optimizing the linear part. So just re-org the 
> LeastSquares part



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31241) Support Hive on DataSourceV2

2021-06-06 Thread Dabao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358331#comment-17358331
 ] 

Dabao commented on SPARK-31241:
---

Hi, [~Jackey Lee] 

We’re now working on a project using DataSourceV2 to provide multiple source 
support. 

Is there any new progress in the current issue?  And could you provide any doc 
for current design, so that we can discuss and improve it in detail ?

> Support Hive on DataSourceV2
> 
>
> Key: SPARK-31241
> URL: https://issues.apache.org/jira/browse/SPARK-31241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> There are 3 reasons why we need to support Hive on DataSourceV2.
> 1. Hive itself is one of Spark data sources.
> 2. HiveTable is essentially a FileTable with its own input and output
> formats, it works fine with FileTable.
> 3. HiveTable should be stateless, and users can freely read or write Hive
> using batch or microbatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35603) Add data source options link for R API documentation.

2021-06-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35603:

Summary: Add data source options link for R API documentation.  (was: Move 
data source options from R into a single page.)

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should consolidate the data source options from R documentation as well 
> like we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35603) Add data source options link for R API documentation.

2021-06-06 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35603:

Description: We should add the data source options link for R documentation 
as well like we did at https://issues.apache.org/jira/browse/SPARK-34491 .  
(was: We should consolidate the data source options from R documentation as 
well like we did at https://issues.apache.org/jira/browse/SPARK-34491 .)

> Add data source options link for R API documentation.
> -
>
> Key: SPARK-35603
> URL: https://issues.apache.org/jira/browse/SPARK-35603
> Project: Spark
>  Issue Type: Documentation
>  Components: docs, R
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should add the data source options link for R documentation as well like 
> we did at https://issues.apache.org/jira/browse/SPARK-34491 .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31241) Support Hive on DataSourceV2

2021-06-06 Thread Dabao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dabao updated SPARK-31241:
--
Comment: was deleted

(was: Hi,[~Jackey Lee]

We’re now working on a project using DataSourceV2 to provide multiple source
 support. Is there any new progress in the current issue?  And could you 
provide any doc for current design, so that we can discuss and improve it in 
detail ?)

> Support Hive on DataSourceV2
> 
>
> Key: SPARK-31241
> URL: https://issues.apache.org/jira/browse/SPARK-31241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> There are 3 reasons why we need to support Hive on DataSourceV2.
> 1. Hive itself is one of Spark data sources.
> 2. HiveTable is essentially a FileTable with its own input and output
> formats, it works fine with FileTable.
> 3. HiveTable should be stateless, and users can freely read or write Hive
> using batch or microbatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35657) createDataFrame fails while to_spark works.

2021-06-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35657.
--
Resolution: Won't Fix

> createDataFrame fails while to_spark works.
> ---
>
> Key: SPARK-35657
> URL: https://issues.apache.org/jira/browse/SPARK-35657
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source)
>  * Python 3.8.10
>  * OpenJDK 11.0
>  * pandas 1.2.4
>  * pyarrow 4.0.1
>Reporter: Yosi Pramajaya
>Priority: Major
>
> Sample code:
> {{kdf = ks.DataFrame({}}
> {{ 'a': [1, 2, 3],}}
> {{ 'b': [2., 3., 4.],}}
> {{ 'c': ['string1', 'string2', 'string3'],}}
> {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}}
> {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), 
> datetime(2000, 1, 3, 12, 0)]}}
> {{ })}}{{df = kdf.to_spark() # WORKS}}
> {{ df = spark.createDataFrame(kdf) # FAILED}}
> Error:
> {{TypeError: Can not infer schema for type: }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35657) createDataFrame fails while to_spark works.

2021-06-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358302#comment-17358302
 ] 

Hyukjin Kwon commented on SPARK-35657:
--

Yeah, let's stick to to_spark for now.

> createDataFrame fails while to_spark works.
> ---
>
> Key: SPARK-35657
> URL: https://issues.apache.org/jira/browse/SPARK-35657
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source)
>  * Python 3.8.10
>  * OpenJDK 11.0
>  * pandas 1.2.4
>  * pyarrow 4.0.1
>Reporter: Yosi Pramajaya
>Priority: Major
>
> Sample code:
> {{kdf = ks.DataFrame({}}
> {{ 'a': [1, 2, 3],}}
> {{ 'b': [2., 3., 4.],}}
> {{ 'c': ['string1', 'string2', 'string3'],}}
> {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}}
> {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), 
> datetime(2000, 1, 3, 12, 0)]}}
> {{ })}}{{df = kdf.to_spark() # WORKS}}
> {{ df = spark.createDataFrame(kdf) # FAILED}}
> Error:
> {{TypeError: Can not infer schema for type: }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35658) Document Parquet encryption feature in Spark

2021-06-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-35658:
-
Target Version/s:   (was: 3.2.0)

> Document Parquet encryption feature in Spark
> 
>
> Key: SPARK-35658
> URL: https://issues.apache.org/jira/browse/SPARK-35658
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.0
>Reporter: Gidon Gershinsky
>Priority: Major
>
> Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the 
> encryption feature which can be called from Spark SQL. The aim of this Jira 
> is to document the use of Parquet encryption in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35599) Introduce a way to compare series of array for older pandas

2021-06-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35599.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32772
[https://github.com/apache/spark/pull/32772]

> Introduce a way to compare series of array for older pandas
> ---
>
> Key: SPARK-35599
> URL: https://issues.apache.org/jira/browse/SPARK-35599
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> PySpark tests ComplexOpsTest.test_add failed with older pandas e.g. v1.0.1, 
> with the ValueError The truth value of an array with more than one element is 
> ambiguous. Use a.any() or a.all().
> We need to introduce a way to check the equality when the data are arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35599) Introduce a way to compare series of array for older pandas

2021-06-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35599:


Assignee: Xinrong Meng

> Introduce a way to compare series of array for older pandas
> ---
>
> Key: SPARK-35599
> URL: https://issues.apache.org/jira/browse/SPARK-35599
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> PySpark tests ComplexOpsTest.test_add failed with older pandas e.g. v1.0.1, 
> with the ValueError The truth value of an array with more than one element is 
> ambiguous. Use a.any() or a.all().
> We need to introduce a way to check the equality when the data are arrays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358286#comment-17358286
 ] 

Adam Binford commented on SPARK-35564:
--

Is that documented somewhere? I know Boolean expressions aren't guaranteed to 
short circuit, but I think most spark users would assume multiple when clauses 
would short circuit

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358285#comment-17358285
 ] 

L. C. Hsieh commented on SPARK-35564:
-

If you mean a common expr in tail conditions other than the first one, it is 
similar as coalesce example above as I think it supposes all conditions can be 
executed without problem. It is still performance consideration here.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35499) Apply black to pandas API on Spark codes.

2021-06-06 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35499:
---

Assignee: Haejoon Lee

> Apply black to pandas API on Spark codes.
> -
>
> Key: SPARK-35499
> URL: https://issues.apache.org/jira/browse/SPARK-35499
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> Make it easier and more efficient to static analysis, we'd better to apply 
> `black` to the pandas API on Spark.
> Koalas project is using black for [reformatting 
> script|https://github.com/databricks/koalas/blob/master/dev/reformat].
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35499) Apply black to pandas API on Spark codes.

2021-06-06 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35499.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32779
[https://github.com/apache/spark/pull/32779]

> Apply black to pandas API on Spark codes.
> -
>
> Key: SPARK-35499
> URL: https://issues.apache.org/jira/browse/SPARK-35499
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> Make it easier and more efficient to static analysis, we'd better to apply 
> `black` to the pandas API on Spark.
> Koalas project is using black for [reformatting 
> script|https://github.com/databricks/koalas/blob/master/dev/reformat].
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35659) Avoid write null to StateStore

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358282#comment-17358282
 ] 

Apache Spark commented on SPARK-35659:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32796

> Avoid write null to StateStore
> --
>
> Key: SPARK-35659
> URL: https://issues.apache.org/jira/browse/SPARK-35659
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> According to {{get}} method doc in StateStore API, it returns non-null row if 
> the key exists. So basically we should avoid write null to StateStore. You 
> cannot distinguish if the returned null row is because the key doesn't exist, 
> or the value is actually null. And due to the defined behavior of {{get}}, it 
> is quite easy to cause NPE error if the caller doesn't expect to get a null 
> if the caller believes the key exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35659) Avoid write null to StateStore

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358281#comment-17358281
 ] 

Apache Spark commented on SPARK-35659:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32796

> Avoid write null to StateStore
> --
>
> Key: SPARK-35659
> URL: https://issues.apache.org/jira/browse/SPARK-35659
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> According to {{get}} method doc in StateStore API, it returns non-null row if 
> the key exists. So basically we should avoid write null to StateStore. You 
> cannot distinguish if the returned null row is because the key doesn't exist, 
> or the value is actually null. And due to the defined behavior of {{get}}, it 
> is quite easy to cause NPE error if the caller doesn't expect to get a null 
> if the caller believes the key exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35659) Avoid write null to StateStore

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35659:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Avoid write null to StateStore
> --
>
> Key: SPARK-35659
> URL: https://issues.apache.org/jira/browse/SPARK-35659
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> According to {{get}} method doc in StateStore API, it returns non-null row if 
> the key exists. So basically we should avoid write null to StateStore. You 
> cannot distinguish if the returned null row is because the key doesn't exist, 
> or the value is actually null. And due to the defined behavior of {{get}}, it 
> is quite easy to cause NPE error if the caller doesn't expect to get a null 
> if the caller believes the key exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35659) Avoid write null to StateStore

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35659:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Avoid write null to StateStore
> --
>
> Key: SPARK-35659
> URL: https://issues.apache.org/jira/browse/SPARK-35659
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> According to {{get}} method doc in StateStore API, it returns non-null row if 
> the key exists. So basically we should avoid write null to StateStore. You 
> cannot distinguish if the returned null row is because the key doesn't exist, 
> or the value is actually null. And due to the defined behavior of {{get}}, it 
> is quite easy to cause NPE error if the caller doesn't expect to get a null 
> if the caller believes the key exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35659) Avoid write null to StateStore

2021-06-06 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35659:

Description: According to {{get}} method doc in StateStore API, it returns 
non-null row if the key exists. So basically we should avoid write null to 
StateStore. You cannot distinguish if the returned null row is because the key 
doesn't exist, or the value is actually null. And due to the defined behavior 
of {{get}}, it is quite easy to cause NPE error if the caller doesn't expect to 
get a null if the caller believes the key exists.  (was: According to {{get}} 
metho doc in StateStore API, it returns non-null row if the key exists. So 
basically we should avoid write null to StateStore. You cannot distinguish if 
the returned null row is because the key doesn't exist, or the value is 
actually null. And due to the defined behavior of {{get}}, it is quite easy to 
cause NPE error if the caller doesn't expect to get a null if the caller 
believes the key exists.)

> Avoid write null to StateStore
> --
>
> Key: SPARK-35659
> URL: https://issues.apache.org/jira/browse/SPARK-35659
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.2, 3.1.2, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> According to {{get}} method doc in StateStore API, it returns non-null row if 
> the key exists. So basically we should avoid write null to StateStore. You 
> cannot distinguish if the returned null row is because the key doesn't exist, 
> or the value is actually null. And due to the defined behavior of {{get}}, it 
> is quite easy to cause NPE error if the caller doesn't expect to get a null 
> if the caller believes the key exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35659) Avoid write null to StateStore

2021-06-06 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-35659:
---

 Summary: Avoid write null to StateStore
 Key: SPARK-35659
 URL: https://issues.apache.org/jira/browse/SPARK-35659
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.2, 3.0.2, 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


According to {{get}} metho doc in StateStore API, it returns non-null row if 
the key exists. So basically we should avoid write null to StateStore. You 
cannot distinguish if the returned null row is because the key doesn't exist, 
or the value is actually null. And due to the defined behavior of {{get}}, it 
is quite easy to cause NPE error if the caller doesn't expect to get a null if 
the caller believes the key exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358275#comment-17358275
 ] 

Adam Binford edited comment on SPARK-35564 at 6/6/21, 10:50 PM:


No the values are fine, it's the tail conditions that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), 
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never 
should be evaluated.


was (Author: kimahriman):
No the values are fine, it's the condition that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), 
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never 
should be evaluated.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358275#comment-17358275
 ] 

Adam Binford commented on SPARK-35564:
--

No the values are fine, it's the condition that cause the issue.
{code:java}
spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), 
when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code}
Here myUdf($"id") gets pulled out as a subexpression even though it never 
should be evaluated.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-35657) createDataFrame fails while to_spark works.

2021-06-06 Thread Kevin Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358228#comment-17358228
 ] 

Kevin Su edited comment on SPARK-35657 at 6/6/21, 10:33 PM:


*spark.createDataFrame,* it doesn't support create from databricks.koalas.

It can only create a DataFrame from an RDD, a list or a pandas.DataFrame.

[https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html]


was (Author: pingsutw):
*spark.createDataFrame,* it doesn't support create from databricks.koalas.

It only can create a DataFrame from an RDD, a list or a pandas.DataFrame.

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html

> createDataFrame fails while to_spark works.
> ---
>
> Key: SPARK-35657
> URL: https://issues.apache.org/jira/browse/SPARK-35657
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source)
>  * Python 3.8.10
>  * OpenJDK 11.0
>  * pandas 1.2.4
>  * pyarrow 4.0.1
>Reporter: Yosi Pramajaya
>Priority: Major
>
> Sample code:
> {{kdf = ks.DataFrame({}}
> {{ 'a': [1, 2, 3],}}
> {{ 'b': [2., 3., 4.],}}
> {{ 'c': ['string1', 'string2', 'string3'],}}
> {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}}
> {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), 
> datetime(2000, 1, 3, 12, 0)]}}
> {{ })}}{{df = kdf.to_spark() # WORKS}}
> {{ df = spark.createDataFrame(kdf) # FAILED}}
> Error:
> {{TypeError: Can not infer schema for type: }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35657) createDataFrame fails while to_spark works.

2021-06-06 Thread Kevin Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358228#comment-17358228
 ] 

Kevin Su commented on SPARK-35657:
--

*spark.createDataFrame,* it doesn't support create from databricks.koalas.

It only can create a DataFrame from an RDD, a list or a pandas.DataFrame.

https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html

> createDataFrame fails while to_spark works.
> ---
>
> Key: SPARK-35657
> URL: https://issues.apache.org/jira/browse/SPARK-35657
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
> Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source)
>  * Python 3.8.10
>  * OpenJDK 11.0
>  * pandas 1.2.4
>  * pyarrow 4.0.1
>Reporter: Yosi Pramajaya
>Priority: Major
>
> Sample code:
> {{kdf = ks.DataFrame({}}
> {{ 'a': [1, 2, 3],}}
> {{ 'b': [2., 3., 4.],}}
> {{ 'c': ['string1', 'string2', 'string3'],}}
> {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}}
> {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), 
> datetime(2000, 1, 3, 12, 0)]}}
> {{ })}}{{df = kdf.to_spark() # WORKS}}
> {{ df = spark.createDataFrame(kdf) # FAILED}}
> Error:
> {{TypeError: Can not infer schema for type: }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358215#comment-17358215
 ] 

L. C. Hsieh commented on SPARK-35564:
-

Do you mean {{CaseWhen(($"id", myUdf($"id") :: ($"id" + 1, myUdf($"id") :: Nil, 
Some(myUdf($"id")))}}?

{{myUdf($"id")}} always runs for all rows, no?


> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358210#comment-17358210
 ] 

Adam Binford commented on SPARK-35564:
--

You can construct a similar CaseWhen that could lead to a similar problem, the 
coalesce was just simpler to demonstrate

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358209#comment-17358209
 ] 

L. C. Hsieh commented on SPARK-35564:
-

For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), 
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be 
performance issue by pulling a subexpr that might not be executed for a row but 
not a bug. But different to elsevalue in when, coalesce is not a condition 
expression, it supposes all arguments can be executed without problem.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358209#comment-17358209
 ] 

L. C. Hsieh edited comment on SPARK-35564 at 6/6/21, 7:49 PM:
--

For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), 
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be 
performance issue by pulling a subexpr that might not be executed for a row but 
not a bug. Different to elsevalue in when, coalesce is not a condition 
expression, it supposes all arguments can be executed without problem.


was (Author: viirya):
For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), 
coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be 
performance issue by pulling a subexpr that might not be executed for a row but 
not a bug. But different to elsevalue in when, coalesce is not a condition 
expression, it supposes all arguments can be executed without problem.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358206#comment-17358206
 ] 

Adam Binford commented on SPARK-35564:
--

Yes that was an example of "will run at least once and maybe more than once" 
that I'm proposing to add more support for in this issue.

An example of current behavior that would be considered a bug is:
{code:java}
spark.range(2).select(coalesce($"id", myUdf($"id")), coalesce($"id" + 1, 
myUdf($"id"))).show()
{code}
myUdf will be pulled out into a subexpression even though it is never executed.

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358170#comment-17358170
 ] 

L. C. Hsieh commented on SPARK-35564:
-

{{select(myUdf($"id"), coalesce($"id", myUdf($"id")))}} => Doesn't 
{{myUdf($"id")}} always run at lease once?

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35654) Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop

2021-06-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35654.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32784
[https://github.com/apache/spark/pull/32784]

> Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop
> --
>
> Key: SPARK-35654
> URL: https://issues.apache.org/jira/browse/SPARK-35654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35654) Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop

2021-06-06 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35654:
-

Assignee: Dongjoon Hyun

> Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop
> --
>
> Key: SPARK-35654
> URL: https://issues.apache.org/jira/browse/SPARK-35654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31241) Support Hive on DataSourceV2

2021-06-06 Thread Dabao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358148#comment-17358148
 ] 

Dabao edited comment on SPARK-31241 at 6/6/21, 3:33 PM:


Hi,[~Jackey Lee]

We’re now working on a project using DataSourceV2 to provide multiple source
 support. Is there any new progress in the current issue?  And could you 
provide any doc for current design, so that we can discuss and improve it in 
detail ?


was (Author: dabao):
Hi, Jacky

We’re now working on a project using DataSourceV2 to provide multiple source
support. Is there any new progress in the current issue?  And could you provide 
any doc for current design, so that we can discuss and improve it in detail ?

> Support Hive on DataSourceV2
> 
>
> Key: SPARK-31241
> URL: https://issues.apache.org/jira/browse/SPARK-31241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> There are 3 reasons why we need to support Hive on DataSourceV2.
> 1. Hive itself is one of Spark data sources.
> 2. HiveTable is essentially a FileTable with its own input and output
> formats, it works fine with FileTable.
> 3. HiveTable should be stateless, and users can freely read or write Hive
> using batch or microbatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31241) Support Hive on DataSourceV2

2021-06-06 Thread Dabao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358148#comment-17358148
 ] 

Dabao commented on SPARK-31241:
---

Hi, Jacky

We’re now working on a project using DataSourceV2 to provide multiple source
support. Is there any new progress in the current issue?  And could you provide 
any doc for current design, so that we can discuss and improve it in detail ?

> Support Hive on DataSourceV2
> 
>
> Key: SPARK-31241
> URL: https://issues.apache.org/jira/browse/SPARK-31241
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jackey Lee
>Priority: Major
>
> There are 3 reasons why we need to support Hive on DataSourceV2.
> 1. Hive itself is one of Spark data sources.
> 2. HiveTable is essentially a FileTable with its own input and output
> formats, it works fine with FileTable.
> 3. HiveTable should be stateless, and users can freely read or write Hive
> using batch or microbatch.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions

2021-06-06 Thread Adam Binford (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358117#comment-17358117
 ] 

Adam Binford commented on SPARK-35564:
--

Turns out this is already happening for certain when and coalesce expressions. 
For example:
{code:java}
spark.range(2).select(myUdf($"id"), coalesce($"id", myUdf($"id")))
{code}
myUdf gets pulled out as a subexpression even though it might only be executed 
once per row. This can be a correctness issue for very specific edge cases 
similar to https://issues.apache.org/jira/browse/SPARK-35449 where myUdf could 
get executed for a row even though it doesn't pass certain conditional checks

> Support subexpression elimination for non-common branches of conditional 
> expressions
> 
>
> Key: SPARK-35564
> URL: https://issues.apache.org/jira/browse/SPARK-35564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Adam Binford
>Priority: Major
>
> https://issues.apache.org/jira/browse/SPARK-7 added support for pulling 
> subexpressions out of branches of conditional expressions for expressions 
> present in all branches. We should be able to take this a step further and 
> pull out subexpressions for any branch, as long as that expression will 
> definitely be evaluated at least once.
> Consider a common data validation example:
> {code:java}
> from pyspark.sql.functions import *
> df = spark.createDataFrame([['word'], ['1234']])
> col = regexp_replace('_1', r'\d', '')
> df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code}
> We only want to keep the value if it's non-empty with numbers removed, 
> otherwise we want it to be null. 
> Because we have no otherwise value, `col` is not a candidate for 
> subexpression elimination (you can see two regular expression replacements in 
> the codegen). But whenever the length is greater than 0, we will have to 
> execute the regular expression replacement twice. Since we know we will 
> always calculate `col` at least once, it makes sense to consider that as a 
> subexpression since we might need it again in the branch value. So we can 
> update the logic from:
> Create a subexpression if an expression will always be evaluated at least 
> twice
> To:
> Create a subexpression if an expression will always be evaluated at least 
> once AND will either always or conditionally be evaluated at least twice.
> The trade off is potentially another subexpression function call (for split 
> subexpressions) if the second evaluation doesn't happen, but this seems like 
> it would be worth it for when it is evaluated the second time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35658) Document Parquet encryption feature in Spark

2021-06-06 Thread Gidon Gershinsky (Jira)

Gidon Gershinsky created SPARK-35658:


 Summary: Document Parquet encryption feature in Spark
 Key: SPARK-35658
 URL: https://issues.apache.org/jira/browse/SPARK-35658
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.2.0
Reporter: Gidon Gershinsky


Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the 
encryption feature which can be called from Spark SQL. The aim of this Jira is 
to document the use of Parquet encryption in Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35588) Merge Binder integration and quickstart notebook

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358035#comment-17358035
 ] 

Apache Spark commented on SPARK-35588:
--

User 'yos1p' has created a pull request for this issue:
https://github.com/apache/spark/pull/32795

> Merge Binder integration and quickstart notebook
> 
>
> Key: SPARK-35588
> URL: https://issues.apache.org/jira/browse/SPARK-35588
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should merge:
> https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
> https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35588) Merge Binder integration and quickstart notebook

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35588:


Assignee: Apache Spark

> Merge Binder integration and quickstart notebook
> 
>
> Key: SPARK-35588
> URL: https://issues.apache.org/jira/browse/SPARK-35588
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> We should merge:
> https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
> https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35588) Merge Binder integration and quickstart notebook

2021-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358034#comment-17358034
 ] 

Apache Spark commented on SPARK-35588:
--

User 'yos1p' has created a pull request for this issue:
https://github.com/apache/spark/pull/32795

> Merge Binder integration and quickstart notebook
> 
>
> Key: SPARK-35588
> URL: https://issues.apache.org/jira/browse/SPARK-35588
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should merge:
> https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
> https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35588) Merge Binder integration and quickstart notebook

2021-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35588:


Assignee: (was: Apache Spark)

> Merge Binder integration and quickstart notebook
> 
>
> Key: SPARK-35588
> URL: https://issues.apache.org/jira/browse/SPARK-35588
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> We should merge:
> https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb
> https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35657) createDataFrame fails while to_spark works.

2021-06-06 Thread Yosi Pramajaya (Jira)

Yosi Pramajaya created SPARK-35657:
--

 Summary: createDataFrame fails while to_spark works.
 Key: SPARK-35657
 URL: https://issues.apache.org/jira/browse/SPARK-35657
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.2.0
 Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source)
 * Python 3.8.10
 * OpenJDK 11.0
 * pandas 1.2.4
 * pyarrow 4.0.1
Reporter: Yosi Pramajaya


Sample code:

{{kdf = ks.DataFrame({}}
{{ 'a': [1, 2, 3],}}
{{ 'b': [2., 3., 4.],}}
{{ 'c': ['string1', 'string2', 'string3'],}}
{{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}}
{{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), 
datetime(2000, 1, 3, 12, 0)]}}
{{ })}}{{df = kdf.to_spark() # WORKS}}
{{ df = spark.createDataFrame(kdf) # FAILED}}

Error:

{{TypeError: Can not infer schema for type: }}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

60 matches

Mail list logo