[jira] [Assigned] (SPARK-35661) Allow deserialized off-heap memory entry
[ https://issues.apache.org/jira/browse/SPARK-35661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35661: Assignee: Apache Spark > Allow deserialized off-heap memory entry > > > Key: SPARK-35661 > URL: https://issues.apache.org/jira/browse/SPARK-35661 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35661) Allow deserialized off-heap memory entry
[ https://issues.apache.org/jira/browse/SPARK-35661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358362#comment-17358362 ] Apache Spark commented on SPARK-35661: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/32800 > Allow deserialized off-heap memory entry > > > Key: SPARK-35661 > URL: https://issues.apache.org/jira/browse/SPARK-35661 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35661) Allow deserialized off-heap memory entry
[ https://issues.apache.org/jira/browse/SPARK-35661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35661: Assignee: (was: Apache Spark) > Allow deserialized off-heap memory entry > > > Key: SPARK-35661 > URL: https://issues.apache.org/jira/browse/SPARK-35661 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35661) Allow deserialized off-heap memory entry
Wenchen Fan created SPARK-35661: --- Summary: Allow deserialized off-heap memory entry Key: SPARK-35661 URL: https://issues.apache.org/jira/browse/SPARK-35661 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1
[ https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35660: - Assignee: Dongjoon Hyun > Upgrade Kubernetes-client to 5.4.1 > -- > > Key: SPARK-35660 > URL: https://issues.apache.org/jira/browse/SPARK-35660 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1
[ https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35660. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32798 [https://github.com/apache/spark/pull/32798] > Upgrade Kubernetes-client to 5.4.1 > -- > > Key: SPARK-35660 > URL: https://issues.apache.org/jira/browse/SPARK-35660 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35635) concurrent insert statements from multiple beeline fail with job aborted exception
[ https://issues.apache.org/jira/browse/SPARK-35635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358356#comment-17358356 ] Chetan Bhat commented on SPARK-35635: - Yes thats the issue. That has to be taken care from the system during concurrent query execution. > concurrent insert statements from multiple beeline fail with job aborted > exception > -- > > Key: SPARK-35635 > URL: https://issues.apache.org/jira/browse/SPARK-35635 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: Spark 3.1.1 >Reporter: Chetan Bhat >Priority: Minor > > Create tables - > CREATE TABLE J1_TBL ( > i integer, > j integer, > t string > ) USING parquet; > CREATE TABLE J2_TBL ( > i integer, > k integer > ) USING parquet; > From 4 concurrent beeline sessions execute the insert into select queries - > INSERT INTO J1_TBL VALUES (1, 4, 'one'); > INSERT INTO J1_TBL VALUES (2, 3, 'two'); > INSERT INTO J1_TBL VALUES (3, 2, 'three'); > INSERT INTO J1_TBL VALUES (4, 1, 'four'); > INSERT INTO J1_TBL VALUES (5, 0, 'five'); > INSERT INTO J1_TBL VALUES (6, 6, 'six'); > INSERT INTO J1_TBL VALUES (7, 7, 'seven'); > INSERT INTO J1_TBL VALUES (8, 8, 'eight'); > INSERT INTO J1_TBL VALUES (0, NULL, 'zero'); > INSERT INTO J1_TBL VALUES (NULL, NULL, 'null'); > INSERT INTO J1_TBL VALUES (NULL, 0, 'zero'); > INSERT INTO J2_TBL VALUES (1, -1); > INSERT INTO J2_TBL VALUES (2, 2); > INSERT INTO J2_TBL VALUES (3, -3); > INSERT INTO J2_TBL VALUES (2, 4); > INSERT INTO J2_TBL VALUES (5, -5); > INSERT INTO J2_TBL VALUES (5, -5); > INSERT INTO J2_TBL VALUES (0, NULL); > INSERT INTO J2_TBL VALUES (NULL, NULL); > INSERT INTO J2_TBL VALUES (NULL, 0); > > Issue : concurrent insert statements from multiple beeline fail with job > aborted exception. > 0: jdbc:hive2://10.19.89.222:23040/> INSERT INTO J1_TBL VALUES (8, 8, > 'eight'); > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:366) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:263) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3$$Lambda$1781/750578465.apply$mcV$sp(Unknown > Source) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:78) > at > org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:62) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:45) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:263) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:258) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1729) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:272) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.SparkException: Job aborted. > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:188) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:109) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:107) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:121) > at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) > at org.apache.spark.sql.Dataset$$Lambda$1650/1168893915.apply(Unknown Source) > at
[jira] [Assigned] (SPARK-35646) Merge contents and remove obsolete pages in API reference section
[ https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35646: Assignee: (was: Apache Spark) > Merge contents and remove obsolete pages in API reference section > - > > Key: SPARK-35646 > URL: https://issues.apache.org/jira/browse/SPARK-35646 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > Now Koalas documentation is in PySpark documentations. We should probably now > remove obsolete pages such as blog post and talks. Also, we should refine and > merge contents properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35646) Merge contents and remove obsolete pages in API reference section
[ https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35646: Assignee: Apache Spark > Merge contents and remove obsolete pages in API reference section > - > > Key: SPARK-35646 > URL: https://issues.apache.org/jira/browse/SPARK-35646 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Now Koalas documentation is in PySpark documentations. We should probably now > remove obsolete pages such as blog post and talks. Also, we should refine and > merge contents properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35646) Merge contents and remove obsolete pages in API reference section
[ https://issues.apache.org/jira/browse/SPARK-35646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358355#comment-17358355 ] Apache Spark commented on SPARK-35646: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/32799 > Merge contents and remove obsolete pages in API reference section > - > > Key: SPARK-35646 > URL: https://issues.apache.org/jira/browse/SPARK-35646 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > Now Koalas documentation is in PySpark documentations. We should probably now > remove obsolete pages such as blog post and talks. Also, we should refine and > merge contents properly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1
[ https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35660: Assignee: Apache Spark > Upgrade Kubernetes-client to 5.4.1 > -- > > Key: SPARK-35660 > URL: https://issues.apache.org/jira/browse/SPARK-35660 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1
[ https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35660: Assignee: (was: Apache Spark) > Upgrade Kubernetes-client to 5.4.1 > -- > > Key: SPARK-35660 > URL: https://issues.apache.org/jira/browse/SPARK-35660 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1
[ https://issues.apache.org/jira/browse/SPARK-35660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358337#comment-17358337 ] Apache Spark commented on SPARK-35660: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/32798 > Upgrade Kubernetes-client to 5.4.1 > -- > > Key: SPARK-35660 > URL: https://issues.apache.org/jira/browse/SPARK-35660 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35660) Upgrade Kubernetes-client to 5.4.1
Dongjoon Hyun created SPARK-35660: - Summary: Upgrade Kubernetes-client to 5.4.1 Key: SPARK-35660 URL: https://issues.apache.org/jira/browse/SPARK-35660 Project: Spark Issue Type: Improvement Components: Build, Kubernetes Affects Versions: 3.2.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35603) Add data source options link for R API documentation.
[ https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35603: Assignee: (was: Apache Spark) > Add data source options link for R API documentation. > - > > Key: SPARK-35603 > URL: https://issues.apache.org/jira/browse/SPARK-35603 > Project: Spark > Issue Type: Documentation > Components: docs, R >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We should add the data source options link for R documentation as well like > we did at https://issues.apache.org/jira/browse/SPARK-34491 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35603) Add data source options link for R API documentation.
[ https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35603: Assignee: Apache Spark > Add data source options link for R API documentation. > - > > Key: SPARK-35603 > URL: https://issues.apache.org/jira/browse/SPARK-35603 > Project: Spark > Issue Type: Documentation > Components: docs, R >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We should add the data source options link for R documentation as well like > we did at https://issues.apache.org/jira/browse/SPARK-34491 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35603) Add data source options link for R API documentation.
[ https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358333#comment-17358333 ] Apache Spark commented on SPARK-35603: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/32797 > Add data source options link for R API documentation. > - > > Key: SPARK-35603 > URL: https://issues.apache.org/jira/browse/SPARK-35603 > Project: Spark > Issue Type: Documentation > Components: docs, R >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We should add the data source options link for R documentation as well like > we did at https://issues.apache.org/jira/browse/SPARK-34491 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35603) Add data source options link for R API documentation.
[ https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358332#comment-17358332 ] Apache Spark commented on SPARK-35603: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/32797 > Add data source options link for R API documentation. > - > > Key: SPARK-35603 > URL: https://issues.apache.org/jira/browse/SPARK-35603 > Project: Spark > Issue Type: Documentation > Components: docs, R >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We should add the data source options link for R documentation as well like > we did at https://issues.apache.org/jira/browse/SPARK-34491 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34765) Linear Models standardization optimization
[ https://issues.apache.org/jira/browse/SPARK-34765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-34765. -- Resolution: Resolved > Linear Models standardization optimization > -- > > Key: SPARK-34765 > URL: https://issues.apache.org/jira/browse/SPARK-34765 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.1, 3.2.0 >Reporter: zhengruifeng >Priority: Major > > Existing impl of standardization in linear models does *NOT* center the > vectors by removing the means, for the purpose of keep the dataset sparsity. > However, this will cause feature values with small var be scaled to large > values, and underlying solver like LBFGS can not efficiently handle this > case. see SPARK-34448 for details. > If internal vectors are centers (like other famous impl, i.e. > GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in > SPARK-34448, the number of iteration to convergence will be reduced from 93 > to 6. Moreover, the final solution is much more close to the one in GLMNET. > luckily, we find a new way to 'virtually' center the vectors without > densifying the dataset, iff: > 1, fitIntercept is true; > 2, no penalty on the intercept, it seem this is always true in existing > impls; > 3, no bounds on the intercept; > > We will also need to check whether this new methods work in all other linear > models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into > those models if possible. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35619) Refactor LinearRegression - make huber support virtual centering
[ https://issues.apache.org/jira/browse/SPARK-35619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-35619. -- Resolution: Resolved > Refactor LinearRegression - make huber support virtual centering > > > Key: SPARK-35619 > URL: https://issues.apache.org/jira/browse/SPARK-35619 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Priority: Major > > 1, make huber regression support virtual centering > 2, as to \{LeastSquares}, it always compute without intercept, and estimate > the intercept after optimizing the linear part. So just re-org the > LeastSquares part -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31241) Support Hive on DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358331#comment-17358331 ] Dabao commented on SPARK-31241: --- Hi, [~Jackey Lee] We’re now working on a project using DataSourceV2 to provide multiple source support. Is there any new progress in the current issue? And could you provide any doc for current design, so that we can discuss and improve it in detail ? > Support Hive on DataSourceV2 > > > Key: SPARK-31241 > URL: https://issues.apache.org/jira/browse/SPARK-31241 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > There are 3 reasons why we need to support Hive on DataSourceV2. > 1. Hive itself is one of Spark data sources. > 2. HiveTable is essentially a FileTable with its own input and output > formats, it works fine with FileTable. > 3. HiveTable should be stateless, and users can freely read or write Hive > using batch or microbatch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35603) Add data source options link for R API documentation.
[ https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-35603: Summary: Add data source options link for R API documentation. (was: Move data source options from R into a single page.) > Add data source options link for R API documentation. > - > > Key: SPARK-35603 > URL: https://issues.apache.org/jira/browse/SPARK-35603 > Project: Spark > Issue Type: Documentation > Components: docs, R >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We should consolidate the data source options from R documentation as well > like we did at https://issues.apache.org/jira/browse/SPARK-34491 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35603) Add data source options link for R API documentation.
[ https://issues.apache.org/jira/browse/SPARK-35603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-35603: Description: We should add the data source options link for R documentation as well like we did at https://issues.apache.org/jira/browse/SPARK-34491 . (was: We should consolidate the data source options from R documentation as well like we did at https://issues.apache.org/jira/browse/SPARK-34491 .) > Add data source options link for R API documentation. > - > > Key: SPARK-35603 > URL: https://issues.apache.org/jira/browse/SPARK-35603 > Project: Spark > Issue Type: Documentation > Components: docs, R >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > We should add the data source options link for R documentation as well like > we did at https://issues.apache.org/jira/browse/SPARK-34491 . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31241) Support Hive on DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dabao updated SPARK-31241: -- Comment: was deleted (was: Hi,[~Jackey Lee] We’re now working on a project using DataSourceV2 to provide multiple source support. Is there any new progress in the current issue? And could you provide any doc for current design, so that we can discuss and improve it in detail ?) > Support Hive on DataSourceV2 > > > Key: SPARK-31241 > URL: https://issues.apache.org/jira/browse/SPARK-31241 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > There are 3 reasons why we need to support Hive on DataSourceV2. > 1. Hive itself is one of Spark data sources. > 2. HiveTable is essentially a FileTable with its own input and output > formats, it works fine with FileTable. > 3. HiveTable should be stateless, and users can freely read or write Hive > using batch or microbatch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35657) createDataFrame fails while to_spark works.
[ https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35657. -- Resolution: Won't Fix > createDataFrame fails while to_spark works. > --- > > Key: SPARK-35657 > URL: https://issues.apache.org/jira/browse/SPARK-35657 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source) > * Python 3.8.10 > * OpenJDK 11.0 > * pandas 1.2.4 > * pyarrow 4.0.1 >Reporter: Yosi Pramajaya >Priority: Major > > Sample code: > {{kdf = ks.DataFrame({}} > {{ 'a': [1, 2, 3],}} > {{ 'b': [2., 3., 4.],}} > {{ 'c': ['string1', 'string2', 'string3'],}} > {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}} > {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), > datetime(2000, 1, 3, 12, 0)]}} > {{ })}}{{df = kdf.to_spark() # WORKS}} > {{ df = spark.createDataFrame(kdf) # FAILED}} > Error: > {{TypeError: Can not infer schema for type: }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35657) createDataFrame fails while to_spark works.
[ https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358302#comment-17358302 ] Hyukjin Kwon commented on SPARK-35657: -- Yeah, let's stick to to_spark for now. > createDataFrame fails while to_spark works. > --- > > Key: SPARK-35657 > URL: https://issues.apache.org/jira/browse/SPARK-35657 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source) > * Python 3.8.10 > * OpenJDK 11.0 > * pandas 1.2.4 > * pyarrow 4.0.1 >Reporter: Yosi Pramajaya >Priority: Major > > Sample code: > {{kdf = ks.DataFrame({}} > {{ 'a': [1, 2, 3],}} > {{ 'b': [2., 3., 4.],}} > {{ 'c': ['string1', 'string2', 'string3'],}} > {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}} > {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), > datetime(2000, 1, 3, 12, 0)]}} > {{ })}}{{df = kdf.to_spark() # WORKS}} > {{ df = spark.createDataFrame(kdf) # FAILED}} > Error: > {{TypeError: Can not infer schema for type: }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35658) Document Parquet encryption feature in Spark
[ https://issues.apache.org/jira/browse/SPARK-35658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-35658: - Target Version/s: (was: 3.2.0) > Document Parquet encryption feature in Spark > > > Key: SPARK-35658 > URL: https://issues.apache.org/jira/browse/SPARK-35658 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.0 >Reporter: Gidon Gershinsky >Priority: Major > > Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the > encryption feature which can be called from Spark SQL. The aim of this Jira > is to document the use of Parquet encryption in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35599) Introduce a way to compare series of array for older pandas
[ https://issues.apache.org/jira/browse/SPARK-35599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-35599. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32772 [https://github.com/apache/spark/pull/32772] > Introduce a way to compare series of array for older pandas > --- > > Key: SPARK-35599 > URL: https://issues.apache.org/jira/browse/SPARK-35599 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.2.0 > > > PySpark tests ComplexOpsTest.test_add failed with older pandas e.g. v1.0.1, > with the ValueError The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all(). > We need to introduce a way to check the equality when the data are arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35599) Introduce a way to compare series of array for older pandas
[ https://issues.apache.org/jira/browse/SPARK-35599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-35599: Assignee: Xinrong Meng > Introduce a way to compare series of array for older pandas > --- > > Key: SPARK-35599 > URL: https://issues.apache.org/jira/browse/SPARK-35599 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > PySpark tests ComplexOpsTest.test_add failed with older pandas e.g. v1.0.1, > with the ValueError The truth value of an array with more than one element is > ambiguous. Use a.any() or a.all(). > We need to introduce a way to check the equality when the data are arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358286#comment-17358286 ] Adam Binford commented on SPARK-35564: -- Is that documented somewhere? I know Boolean expressions aren't guaranteed to short circuit, but I think most spark users would assume multiple when clauses would short circuit > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358285#comment-17358285 ] L. C. Hsieh commented on SPARK-35564: - If you mean a common expr in tail conditions other than the first one, it is similar as coalesce example above as I think it supposes all conditions can be executed without problem. It is still performance consideration here. > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35499) Apply black to pandas API on Spark codes.
[ https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh reassigned SPARK-35499: --- Assignee: Haejoon Lee > Apply black to pandas API on Spark codes. > - > > Key: SPARK-35499 > URL: https://issues.apache.org/jira/browse/SPARK-35499 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Make it easier and more efficient to static analysis, we'd better to apply > `black` to the pandas API on Spark. > Koalas project is using black for [reformatting > script|https://github.com/databricks/koalas/blob/master/dev/reformat]. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35499) Apply black to pandas API on Spark codes.
[ https://issues.apache.org/jira/browse/SPARK-35499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh resolved SPARK-35499. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32779 [https://github.com/apache/spark/pull/32779] > Apply black to pandas API on Spark codes. > - > > Key: SPARK-35499 > URL: https://issues.apache.org/jira/browse/SPARK-35499 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.2.0 > > > Make it easier and more efficient to static analysis, we'd better to apply > `black` to the pandas API on Spark. > Koalas project is using black for [reformatting > script|https://github.com/databricks/koalas/blob/master/dev/reformat]. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35659) Avoid write null to StateStore
[ https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358282#comment-17358282 ] Apache Spark commented on SPARK-35659: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32796 > Avoid write null to StateStore > -- > > Key: SPARK-35659 > URL: https://issues.apache.org/jira/browse/SPARK-35659 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > According to {{get}} method doc in StateStore API, it returns non-null row if > the key exists. So basically we should avoid write null to StateStore. You > cannot distinguish if the returned null row is because the key doesn't exist, > or the value is actually null. And due to the defined behavior of {{get}}, it > is quite easy to cause NPE error if the caller doesn't expect to get a null > if the caller believes the key exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35659) Avoid write null to StateStore
[ https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358281#comment-17358281 ] Apache Spark commented on SPARK-35659: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32796 > Avoid write null to StateStore > -- > > Key: SPARK-35659 > URL: https://issues.apache.org/jira/browse/SPARK-35659 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > According to {{get}} method doc in StateStore API, it returns non-null row if > the key exists. So basically we should avoid write null to StateStore. You > cannot distinguish if the returned null row is because the key doesn't exist, > or the value is actually null. And due to the defined behavior of {{get}}, it > is quite easy to cause NPE error if the caller doesn't expect to get a null > if the caller believes the key exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35659) Avoid write null to StateStore
[ https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35659: Assignee: L. C. Hsieh (was: Apache Spark) > Avoid write null to StateStore > -- > > Key: SPARK-35659 > URL: https://issues.apache.org/jira/browse/SPARK-35659 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > According to {{get}} method doc in StateStore API, it returns non-null row if > the key exists. So basically we should avoid write null to StateStore. You > cannot distinguish if the returned null row is because the key doesn't exist, > or the value is actually null. And due to the defined behavior of {{get}}, it > is quite easy to cause NPE error if the caller doesn't expect to get a null > if the caller believes the key exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35659) Avoid write null to StateStore
[ https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35659: Assignee: Apache Spark (was: L. C. Hsieh) > Avoid write null to StateStore > -- > > Key: SPARK-35659 > URL: https://issues.apache.org/jira/browse/SPARK-35659 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > According to {{get}} method doc in StateStore API, it returns non-null row if > the key exists. So basically we should avoid write null to StateStore. You > cannot distinguish if the returned null row is because the key doesn't exist, > or the value is actually null. And due to the defined behavior of {{get}}, it > is quite easy to cause NPE error if the caller doesn't expect to get a null > if the caller believes the key exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35659) Avoid write null to StateStore
[ https://issues.apache.org/jira/browse/SPARK-35659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35659: Description: According to {{get}} method doc in StateStore API, it returns non-null row if the key exists. So basically we should avoid write null to StateStore. You cannot distinguish if the returned null row is because the key doesn't exist, or the value is actually null. And due to the defined behavior of {{get}}, it is quite easy to cause NPE error if the caller doesn't expect to get a null if the caller believes the key exists. (was: According to {{get}} metho doc in StateStore API, it returns non-null row if the key exists. So basically we should avoid write null to StateStore. You cannot distinguish if the returned null row is because the key doesn't exist, or the value is actually null. And due to the defined behavior of {{get}}, it is quite easy to cause NPE error if the caller doesn't expect to get a null if the caller believes the key exists.) > Avoid write null to StateStore > -- > > Key: SPARK-35659 > URL: https://issues.apache.org/jira/browse/SPARK-35659 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.2, 3.1.2, 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > According to {{get}} method doc in StateStore API, it returns non-null row if > the key exists. So basically we should avoid write null to StateStore. You > cannot distinguish if the returned null row is because the key doesn't exist, > or the value is actually null. And due to the defined behavior of {{get}}, it > is quite easy to cause NPE error if the caller doesn't expect to get a null > if the caller believes the key exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35659) Avoid write null to StateStore
L. C. Hsieh created SPARK-35659: --- Summary: Avoid write null to StateStore Key: SPARK-35659 URL: https://issues.apache.org/jira/browse/SPARK-35659 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.1.2, 3.0.2, 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh According to {{get}} metho doc in StateStore API, it returns non-null row if the key exists. So basically we should avoid write null to StateStore. You cannot distinguish if the returned null row is because the key doesn't exist, or the value is actually null. And due to the defined behavior of {{get}}, it is quite easy to cause NPE error if the caller doesn't expect to get a null if the caller believes the key exists. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358275#comment-17358275 ] Adam Binford edited comment on SPARK-35564 at 6/6/21, 10:50 PM: No the values are fine, it's the tail conditions that cause the issue. {code:java} spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code} Here myUdf($"id") gets pulled out as a subexpression even though it never should be evaluated. was (Author: kimahriman): No the values are fine, it's the condition that cause the issue. {code:java} spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code} Here myUdf($"id") gets pulled out as a subexpression even though it never should be evaluated. > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358275#comment-17358275 ] Adam Binford commented on SPARK-35564: -- No the values are fine, it's the condition that cause the issue. {code:java} spark.range(2).select(when($"id" >= 0, lit(1)).when(myUdf($"id") > 0, lit(2)), when($"id" > -1, lit(1)).when(myUdf($"id") > 0, lit(2))).show(){code} Here myUdf($"id") gets pulled out as a subexpression even though it never should be evaluated. > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35657) createDataFrame fails while to_spark works.
[ https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358228#comment-17358228 ] Kevin Su edited comment on SPARK-35657 at 6/6/21, 10:33 PM: *spark.createDataFrame,* it doesn't support create from databricks.koalas. It can only create a DataFrame from an RDD, a list or a pandas.DataFrame. [https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html] was (Author: pingsutw): *spark.createDataFrame,* it doesn't support create from databricks.koalas. It only can create a DataFrame from an RDD, a list or a pandas.DataFrame. https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html > createDataFrame fails while to_spark works. > --- > > Key: SPARK-35657 > URL: https://issues.apache.org/jira/browse/SPARK-35657 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source) > * Python 3.8.10 > * OpenJDK 11.0 > * pandas 1.2.4 > * pyarrow 4.0.1 >Reporter: Yosi Pramajaya >Priority: Major > > Sample code: > {{kdf = ks.DataFrame({}} > {{ 'a': [1, 2, 3],}} > {{ 'b': [2., 3., 4.],}} > {{ 'c': ['string1', 'string2', 'string3'],}} > {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}} > {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), > datetime(2000, 1, 3, 12, 0)]}} > {{ })}}{{df = kdf.to_spark() # WORKS}} > {{ df = spark.createDataFrame(kdf) # FAILED}} > Error: > {{TypeError: Can not infer schema for type: }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35657) createDataFrame fails while to_spark works.
[ https://issues.apache.org/jira/browse/SPARK-35657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358228#comment-17358228 ] Kevin Su commented on SPARK-35657: -- *spark.createDataFrame,* it doesn't support create from databricks.koalas. It only can create a DataFrame from an RDD, a list or a pandas.DataFrame. https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.SparkSession.createDataFrame.html > createDataFrame fails while to_spark works. > --- > > Key: SPARK-35657 > URL: https://issues.apache.org/jira/browse/SPARK-35657 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 > Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source) > * Python 3.8.10 > * OpenJDK 11.0 > * pandas 1.2.4 > * pyarrow 4.0.1 >Reporter: Yosi Pramajaya >Priority: Major > > Sample code: > {{kdf = ks.DataFrame({}} > {{ 'a': [1, 2, 3],}} > {{ 'b': [2., 3., 4.],}} > {{ 'c': ['string1', 'string2', 'string3'],}} > {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}} > {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), > datetime(2000, 1, 3, 12, 0)]}} > {{ })}}{{df = kdf.to_spark() # WORKS}} > {{ df = spark.createDataFrame(kdf) # FAILED}} > Error: > {{TypeError: Can not infer schema for type: }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358215#comment-17358215 ] L. C. Hsieh commented on SPARK-35564: - Do you mean {{CaseWhen(($"id", myUdf($"id") :: ($"id" + 1, myUdf($"id") :: Nil, Some(myUdf($"id")))}}? {{myUdf($"id")}} always runs for all rows, no? > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358210#comment-17358210 ] Adam Binford commented on SPARK-35564: -- You can construct a similar CaseWhen that could lead to a similar problem, the coalesce was just simpler to demonstrate > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358209#comment-17358209 ] L. C. Hsieh commented on SPARK-35564: - For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be performance issue by pulling a subexpr that might not be executed for a row but not a bug. But different to elsevalue in when, coalesce is not a condition expression, it supposes all arguments can be executed without problem. > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358209#comment-17358209 ] L. C. Hsieh edited comment on SPARK-35564 at 6/6/21, 7:49 PM: -- For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be performance issue by pulling a subexpr that might not be executed for a row but not a bug. Different to elsevalue in when, coalesce is not a condition expression, it supposes all arguments can be executed without problem. was (Author: viirya): For the case {{spark.range(2).select(coalesce($"id", myUdf($"id")), coalesce($"id" + 1, myUdf($"id"))).show()}}, looks like it can possibly be performance issue by pulling a subexpr that might not be executed for a row but not a bug. But different to elsevalue in when, coalesce is not a condition expression, it supposes all arguments can be executed without problem. > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358206#comment-17358206 ] Adam Binford commented on SPARK-35564: -- Yes that was an example of "will run at least once and maybe more than once" that I'm proposing to add more support for in this issue. An example of current behavior that would be considered a bug is: {code:java} spark.range(2).select(coalesce($"id", myUdf($"id")), coalesce($"id" + 1, myUdf($"id"))).show() {code} myUdf will be pulled out into a subexpression even though it is never executed. > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358170#comment-17358170 ] L. C. Hsieh commented on SPARK-35564: - {{select(myUdf($"id"), coalesce($"id", myUdf($"id")))}} => Doesn't {{myUdf($"id")}} always run at lease once? > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35654) Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop
[ https://issues.apache.org/jira/browse/SPARK-35654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35654. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32784 [https://github.com/apache/spark/pull/32784] > Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop > -- > > Key: SPARK-35654 > URL: https://issues.apache.org/jira/browse/SPARK-35654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35654) Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop
[ https://issues.apache.org/jira/browse/SPARK-35654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-35654: - Assignee: Dongjoon Hyun > Allow ShuffleDataIO control DiskBlockManager.deleteFilesOnStop > -- > > Key: SPARK-35654 > URL: https://issues.apache.org/jira/browse/SPARK-35654 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31241) Support Hive on DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358148#comment-17358148 ] Dabao edited comment on SPARK-31241 at 6/6/21, 3:33 PM: Hi,[~Jackey Lee] We’re now working on a project using DataSourceV2 to provide multiple source support. Is there any new progress in the current issue? And could you provide any doc for current design, so that we can discuss and improve it in detail ? was (Author: dabao): Hi, Jacky We’re now working on a project using DataSourceV2 to provide multiple source support. Is there any new progress in the current issue? And could you provide any doc for current design, so that we can discuss and improve it in detail ? > Support Hive on DataSourceV2 > > > Key: SPARK-31241 > URL: https://issues.apache.org/jira/browse/SPARK-31241 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > There are 3 reasons why we need to support Hive on DataSourceV2. > 1. Hive itself is one of Spark data sources. > 2. HiveTable is essentially a FileTable with its own input and output > formats, it works fine with FileTable. > 3. HiveTable should be stateless, and users can freely read or write Hive > using batch or microbatch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31241) Support Hive on DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-31241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358148#comment-17358148 ] Dabao commented on SPARK-31241: --- Hi, Jacky We’re now working on a project using DataSourceV2 to provide multiple source support. Is there any new progress in the current issue? And could you provide any doc for current design, so that we can discuss and improve it in detail ? > Support Hive on DataSourceV2 > > > Key: SPARK-31241 > URL: https://issues.apache.org/jira/browse/SPARK-31241 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jackey Lee >Priority: Major > > There are 3 reasons why we need to support Hive on DataSourceV2. > 1. Hive itself is one of Spark data sources. > 2. HiveTable is essentially a FileTable with its own input and output > formats, it works fine with FileTable. > 3. HiveTable should be stateless, and users can freely read or write Hive > using batch or microbatch. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35564) Support subexpression elimination for non-common branches of conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-35564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358117#comment-17358117 ] Adam Binford commented on SPARK-35564: -- Turns out this is already happening for certain when and coalesce expressions. For example: {code:java} spark.range(2).select(myUdf($"id"), coalesce($"id", myUdf($"id"))) {code} myUdf gets pulled out as a subexpression even though it might only be executed once per row. This can be a correctness issue for very specific edge cases similar to https://issues.apache.org/jira/browse/SPARK-35449 where myUdf could get executed for a row even though it doesn't pass certain conditional checks > Support subexpression elimination for non-common branches of conditional > expressions > > > Key: SPARK-35564 > URL: https://issues.apache.org/jira/browse/SPARK-35564 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: Adam Binford >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-7 added support for pulling > subexpressions out of branches of conditional expressions for expressions > present in all branches. We should be able to take this a step further and > pull out subexpressions for any branch, as long as that expression will > definitely be evaluated at least once. > Consider a common data validation example: > {code:java} > from pyspark.sql.functions import * > df = spark.createDataFrame([['word'], ['1234']]) > col = regexp_replace('_1', r'\d', '') > df = df.withColumn('numbers_removed', when(length(col) > 0, col)){code} > We only want to keep the value if it's non-empty with numbers removed, > otherwise we want it to be null. > Because we have no otherwise value, `col` is not a candidate for > subexpression elimination (you can see two regular expression replacements in > the codegen). But whenever the length is greater than 0, we will have to > execute the regular expression replacement twice. Since we know we will > always calculate `col` at least once, it makes sense to consider that as a > subexpression since we might need it again in the branch value. So we can > update the logic from: > Create a subexpression if an expression will always be evaluated at least > twice > To: > Create a subexpression if an expression will always be evaluated at least > once AND will either always or conditionally be evaluated at least twice. > The trade off is potentially another subexpression function call (for split > subexpressions) if the second evaluation doesn't happen, but this seems like > it would be worth it for when it is evaluated the second time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35658) Document Parquet encryption feature in Spark
Gidon Gershinsky created SPARK-35658: Summary: Document Parquet encryption feature in Spark Key: SPARK-35658 URL: https://issues.apache.org/jira/browse/SPARK-35658 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.2.0 Reporter: Gidon Gershinsky Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the encryption feature which can be called from Spark SQL. The aim of this Jira is to document the use of Parquet encryption in Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35588) Merge Binder integration and quickstart notebook
[ https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358035#comment-17358035 ] Apache Spark commented on SPARK-35588: -- User 'yos1p' has created a pull request for this issue: https://github.com/apache/spark/pull/32795 > Merge Binder integration and quickstart notebook > > > Key: SPARK-35588 > URL: https://issues.apache.org/jira/browse/SPARK-35588 > Project: Spark > Issue Type: Sub-task > Components: docs, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should merge: > https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb > https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35588) Merge Binder integration and quickstart notebook
[ https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35588: Assignee: Apache Spark > Merge Binder integration and quickstart notebook > > > Key: SPARK-35588 > URL: https://issues.apache.org/jira/browse/SPARK-35588 > Project: Spark > Issue Type: Sub-task > Components: docs, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > We should merge: > https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb > https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35588) Merge Binder integration and quickstart notebook
[ https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358034#comment-17358034 ] Apache Spark commented on SPARK-35588: -- User 'yos1p' has created a pull request for this issue: https://github.com/apache/spark/pull/32795 > Merge Binder integration and quickstart notebook > > > Key: SPARK-35588 > URL: https://issues.apache.org/jira/browse/SPARK-35588 > Project: Spark > Issue Type: Sub-task > Components: docs, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should merge: > https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb > https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35588) Merge Binder integration and quickstart notebook
[ https://issues.apache.org/jira/browse/SPARK-35588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35588: Assignee: (was: Apache Spark) > Merge Binder integration and quickstart notebook > > > Key: SPARK-35588 > URL: https://issues.apache.org/jira/browse/SPARK-35588 > Project: Spark > Issue Type: Sub-task > Components: docs, PySpark >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > We should merge: > https://github.com/apache/spark/blob/master/python/docs/source/getting_started/quickstart.ipynb > https://github.com/databricks/koalas/blob/master/docs/source/getting_started/10min.ipynb -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35657) createDataFrame fails while to_spark works.
Yosi Pramajaya created SPARK-35657: -- Summary: createDataFrame fails while to_spark works. Key: SPARK-35657 URL: https://issues.apache.org/jira/browse/SPARK-35657 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Environment: * Apache Spark 3.2.0-SNAPSHOT (build from source) * Python 3.8.10 * OpenJDK 11.0 * pandas 1.2.4 * pyarrow 4.0.1 Reporter: Yosi Pramajaya Sample code: {{kdf = ks.DataFrame({}} {{ 'a': [1, 2, 3],}} {{ 'b': [2., 3., 4.],}} {{ 'c': ['string1', 'string2', 'string3'],}} {{ 'd': [date(2000, 1, 1), date(2000, 2, 1), date(2000, 3, 1)],}} {{ 'e': [datetime(2000, 1, 1, 12, 0), datetime(2000, 1, 2, 12, 0), datetime(2000, 1, 3, 12, 0)]}} {{ })}}{{df = kdf.to_spark() # WORKS}} {{ df = spark.createDataFrame(kdf) # FAILED}} Error: {{TypeError: Can not infer schema for type: }} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org