[jira] [Commented] (SPARK-34762) Many PR's Scala 2.13 build action failed

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303892#comment-17303892
 ] 

Apache Spark commented on SPARK-34762:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31880

> Many PR's Scala 2.13 build action failed
> 
>
> Key: SPARK-34762
> URL: https://issues.apache.org/jira/browse/SPARK-34762
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> PR with Scala 2.13 build failure includes 
>  * [https://github.com/apache/spark/pull/31849]
>  * [https://github.com/apache/spark/pull/31848]
>  * [https://github.com/apache/spark/pull/31844]
>  * [https://github.com/apache/spark/pull/31843]
>  * https://github.com/apache/spark/pull/31841
> {code:java}
> [error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:26:1:
>   error: package org.apache.commons.cli does not exist
> 1278[error] import org.apache.commons.cli.GnuParser;
> 1279[error]  ^
> 1280[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:176:1:
>   error: cannot find symbol
> 1281[error] private final Options options = new Options();
> 1282[error]   ^  symbol:   class Options
> 1283[error]   location: class ServerOptionsProcessor
> 1284[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:177:1:
>   error: package org.apache.commons.cli does not exist
> 1285[error] private org.apache.commons.cli.CommandLine commandLine;
> 1286[error]   ^
> 1287[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:255:1:
>   error: cannot find symbol
> 1288[error] HelpOptionExecutor(String serverName, Options options) {
> 1289[error]   ^  symbol:   class 
> Options
> 1290[error]   location: class HelpOptionExecutor
> 1291[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:176:1:
>   error: cannot find symbol
> 1292[error] private final Options options = new Options();
> 1293[error] ^  symbol:   class Options
> 1294[error]   location: class ServerOptionsProcessor
> 1295[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:185:1:
>   error: cannot find symbol
> 1296[error]   options.addOption(OptionBuilder
> 1297[error] ^  symbol:   variable OptionBuilder
> 1298[error]   location: class ServerOptionsProcessor
> 1299[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:192:1:
>   error: cannot find symbol
> 1300[error]   options.addOption(new Option("H", "help", false, "Print 
> help information"));
> 1301[error] ^  symbol:   class Option
> 1302[error]   location: class ServerOptionsProcessor
> 1303[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:197:1:
>   error: cannot find symbol
> 1304[error] commandLine = new GnuParser().parse(options, argv);
> 1305[error]   ^  symbol:   class GnuParser
> 1306[error]   location: class ServerOptionsProcessor
> 1307[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:211:1:
>   error: cannot find symbol
> 1308[error]   } catch (ParseException e) {
> 1309[error]^  symbol:   class ParseException
> 1310[error]   location: class ServerOptionsProcessor
> 1311[error] 
> /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:262:1:
>   error: cannot find symbol
> 1312[error]   new HelpFormatter().printHelp(serverName, options);
> 1313[error]   ^  symbol:   class HelpFormatter
> 1314[error]   location: class HelpOptionExecutor
> 1315[error] Note: Some input files use or override a deprecated API.
> 1316[error] Note: Recompile with -Xlint:deprecation for details.
> 1317[error] 16 errors
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-34766) Do not capture maven config for views

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303893#comment-17303893
 ] 

Apache Spark commented on SPARK-34766:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/31879

> Do not capture maven config for views
> -
>
> Key: SPARK-34766
> URL: https://issues.apache.org/jira/browse/SPARK-34766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> Due to the bad network, we always use the thirdparty maven repo to run test. 
> e.g.,
> {code:java}
> build/sbt "test:testOnly *SQLQueryTestSuite" 
> -Dspark.sql.maven.additionalRemoteRepositories=x
> {code}
>  
> It's failed with such error msg
> ```
> [info] - show-tblproperties.sql *** FAILED *** (128 milliseconds)
> [info] show-tblproperties.sql
> [info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][
> [info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories x]" 
> Result did not match for query #6
> [info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464)
> ```
> It's not necessary to capture the maven config to view since it's a session 
> level config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303895#comment-17303895
 ] 

Apache Spark commented on SPARK-34087:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/31881

> a memory leak occurs when we clone the spark session
> 
>
> Key: SPARK-34087
> URL: https://issues.apache.org/jira/browse/SPARK-34087
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Fu Chen
>Priority: Major
> Attachments: 1610451044690.jpg
>
>
> In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session 
> because a new ExecutionListenerBus instance will add to AsyncEventQueue when 
> we clone a new session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32061) potential regression if use memoryUsage instead of numRows

2021-03-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-32061.
--
Resolution: Resolved

> potential regression if use memoryUsage instead of numRows
> --
>
> Key: SPARK-32061
> URL: https://issues.apache.org/jira/browse/SPARK-32061
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> 1, if the `memoryUsage` is improperly set, for example, too small to store a 
> instance;
> 2,  the blockify+GMM reuse two matrices whose shape is related to current 
> blockSize:
> {code:java}
> @transient private lazy val auxiliaryProbMat = DenseMatrix.zeros(blockSize, k)
> @transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, 
> numFeatures) {code}
> When implementing blockify+GMM, I found that if I do not pre-allocate those 
> matrices, there will be seriously regression (maybe 3~4 slower, I fogot the 
> detailed numbers);
> 3, in MLP, three pre-allocated objects are also related to numRows:
> {code:java}
> if (ones == null || ones.length != delta.cols) ones = 
> BDV.ones[Double](delta.cols)
> // TODO: allocate outputs as one big array and then create BDMs from it
> if (outputs == null || outputs(0).cols != currentBatchSize) {
> ...
> // TODO: allocate deltas as one big array and then create BDMs from it
> if (deltas == null || deltas(0).cols != currentBatchSize) {
>   deltas = new Array[BDM[Double]](layerModels.length)
> ... {code}
> I am not very familiar with the impl of MLP and failed to find some related 
> document about this pro-allocation. But I guess there maybe regression if we 
> disable this pro-allocation, since those objects look relatively big.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread jobit mathew (Jira)
jobit mathew created SPARK-34785:


 Summary: SPARK-34212 issue not fixed if 
spark.sql.parquet.enableVectorizedReader=true which is default value. Error 
Parquet column cannot be converted in file.
 Key: SPARK-34785
 URL: https://issues.apache.org/jira/browse/SPARK-34785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: jobit mathew


SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
which is default value.

IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
will reduce the performance.

In Hive, 
{code:java}
create table test_decimal(amt decimal(18,2)) stored as parquet; 
insert into test_decimal select 100;
alter table test_decimal change amt amt decimal(19,3);
{code}
In Spark,
{code:java}
select * from test_decimal;
{code}
{code:java}
++
|amt |
++
| 100.000 |{code}
but if spark.sql.parquet.enableVectorizedReader=true below error
{code:java}
: jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
going to print operations logs
printed operations logs
going to print operations logs
printed operations logs
Getting log thread is interrupted, since query is done!
Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
4) (vm2 executor 2): org.apache.spark.sql.execution.QueryExecutionException: 
Parquet column cannot be converted in file 
hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: 
org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:312)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181)
at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
... 20 more

Driver stacktrace:
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteS

[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread jobit mathew (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303904#comment-17303904
 ] 

jobit mathew commented on SPARK-34785:
--

[~dongjoon] can you check it once.

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:312)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
> at 
> org.apa

[jira] [Resolved] (SPARK-34741) MergeIntoTable should avoid ambiguous reference

2021-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34741.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31835
[https://github.com/apache/spark/pull/31835]

> MergeIntoTable should avoid ambiguous reference
> ---
>
> Key: SPARK-34741
> URL: https://issues.apache.org/jira/browse/SPARK-34741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.2.0
>
>
> When resolving the {{UpdateAction}}, which could reference attributes from 
> both target and source tables, Spark should know clearly where the attribute 
> comes from when there're conflicting attributes instead of picking up a 
> random one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34741) MergeIntoTable should avoid ambiguous reference

2021-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34741:
---

Assignee: wuyi

> MergeIntoTable should avoid ambiguous reference
> ---
>
> Key: SPARK-34741
> URL: https://issues.apache.org/jira/browse/SPARK-34741
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> When resolving the {{UpdateAction}}, which could reference attributes from 
> both target and source tables, Spark should know clearly where the attribute 
> comes from when there're conflicting attributes instead of picking up a 
> random one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization

2021-03-18 Thread Darcy Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34771:
---
Description: 
{code:python}
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
2)])})
df = spark.createDataFrame(pdf)
df.toPandas()
{code}

with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine 
without WARNINGS.

with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
with WARNINGS. Because of Unsupported type in conversion, the Arrow 
optimization is actually turned off. 


Detailed steps to reproduce:

{code:python}
$ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
  /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.226:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1615994008526).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/03/17 23:13:31 WARN SQLConf: The SQL config 
'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
instead of it.
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
>>> 2)])})
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
>>>
>>> df.show()
+--+
| point|
+--+
|(0.0, 0.0)|
|(0.0, 0.0)|
+--+

>>> df.schema
StructType(List(StructField(point,ExamplePointUDT,true)))
>>> df.toPandas()
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Unsupported type in conversion to Arrow: ExamplePointUDT
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
   point
0  (0.0,0.0)
1  (0.0,0.0)

{code}


Added a traceback:
{code:python}
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
Traceback (most recent call last):
  File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", 
line 319, in createDataFrame
return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", 
line 451, in _create_from_pandas_with_arrow
arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
  File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas
  File 
"/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py",
 line 529, in dataframe_to_types
type_ = pa.array(c, from_pandas=True).type
  File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint: did 
not recognize Python value type when inferring an Arrow data type
{code}





  was:
{code:python}
spark.conf.set("spark.sql.execution.a

[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization

2021-03-18 Thread Darcy Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34771:
---
Description: 
{code:python}
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
2)])})
df = spark.createDataFrame(pdf)
df.toPandas()
{code}

with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine 
without WARNINGS.

with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
with WARNINGS. Because of Unsupported type in conversion, the Arrow 
optimization is actually turned off. 


Detailed steps to reproduce:

{code:python}
$ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
  /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.226:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1615994008526).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/03/17 23:13:31 WARN SQLConf: The SQL config 
'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
instead of it.
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
>>> 2)])})
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
>>>
>>> df.show()
+--+
| point|
+--+
|(0.0, 0.0)|
|(0.0, 0.0)|
+--+

>>> df.schema
StructType(List(StructField(point,ExamplePointUDT,true)))
>>> df.toPandas()
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Unsupported type in conversion to Arrow: ExamplePointUDT
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
   point
0  (0.0,0.0)
1  (0.0,0.0)

{code}


Added a traceback after the display of the warning message:
{code:python}
import traceback
traceback.print_exc() 
{code}

{code:python}
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
Traceback (most recent call last):
  File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", 
line 319, in createDataFrame
return self._create_from_pandas_with_arrow(data, schema, timezone)
  File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", 
line 451, in _create_from_pandas_with_arrow
arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
  File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas
  File 
"/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py",
 line 529, in dataframe_to_types
type_ = pa.array(c, from_pandas=True).type
  File "pyarrow/array.pxi", line 292, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint

[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization

2021-03-18 Thread Darcy Shen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34771:
---
Description: 
{code:python}
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
2)])})
df = spark.createDataFrame(pdf)
df.toPandas()
{code}

with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine 
without WARNINGS.

with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
with WARNINGS. Because of Unsupported type in conversion, the Arrow 
optimization is actually turned off. 


Detailed steps to reproduce:

{code:python}
$ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
  /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.226:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1615994008526).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/03/17 23:13:31 WARN SQLConf: The SQL config 
'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may 
be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' 
instead of it.
>>> from pyspark.testing.sqlutils  import ExamplePoint
>>> import pandas as pd
>>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
>>> 2)])})
>>> df = spark.createDataFrame(pdf)
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Could not convert (1,1) with type ExamplePoint: did not recognize Python 
value type when inferring an Arrow data type
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
>>>
>>> df.show()
+--+
| point|
+--+
|(0.0, 0.0)|
|(0.0, 0.0)|
+--+

>>> df.schema
StructType(List(StructField(point,ExamplePointUDT,true)))
>>> df.toPandas()
/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: 
UserWarning: toPandas attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Unsupported type in conversion to Arrow: ExamplePointUDT
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
   point
0  (0.0,0.0)
1  (0.0,0.0)

{code}






  was:
{code:python}
spark.conf.set("spark.sql.execution.arrow.enabled", "true")
from pyspark.testing.sqlutils  import ExamplePoint
import pandas as pd
pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 
2)])})
df = spark.createDataFrame(pdf)
df.toPandas()
{code}

with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine 
without WARNINGS.

with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine 
with WARNINGS. Because of Unsupported type in conversion, the Arrow 
optimization is actually turned off. 


Detailed steps to reproduce:

{code:python}
$ bin/pyspark
Python 3.8.8 (default, Feb 24 2021, 13:46:16)
[Clang 10.0.0 ] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.2.0-SNAPSHOT
  /_/

Using Python version 3.8.8 (default, Feb 24 2021 13:46:16)
Spark context Web UI available at http://172.30.0.226:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1615994008526).
SparkSession available as 'spark'.
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
21/03/17

[jira] [Created] (SPARK-34786) read parquet uint64 as decimal

2021-03-18 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-34786:
---

 Summary: read parquet uint64 as decimal
 Key: SPARK-34786
 URL: https://issues.apache.org/jira/browse/SPARK-34786
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan


Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34774) The `change-scala- version.sh` script not replaced scala.version property correctly

2021-03-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34774.
--
Fix Version/s: 3.0.3
   3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 31865
[https://github.com/apache/spark/pull/31865]

> The `change-scala- version.sh` script not replaced scala.version property 
> correctly
> ---
>
> Key: SPARK-34774
> URL: https://issues.apache.org/jira/browse/SPARK-34774
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Execute the following commands in order
>  # dev/change-scala-version.sh 2.13
>  # dev/change-scala-version.sh 2.12
>  # git status
> there will generate git diff as follow:
> {code:java}
> diff --git a/pom.xml b/pom.xml
> index ddc4ce2f68..f43d8c8f78 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -162,7 +162,7 @@
>      3.4.1
>      
>      3.2.2
> -    2.12.10
> +    2.13.5
>      2.12
>      2.0.0
>      --test
> {code}
> seem 'scala.version' property was not replaced correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34774) The `change-scala- version.sh` script not replaced scala.version property correctly

2021-03-18 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34774:


Assignee: Yang Jie

> The `change-scala- version.sh` script not replaced scala.version property 
> correctly
> ---
>
> Key: SPARK-34774
> URL: https://issues.apache.org/jira/browse/SPARK-34774
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Execute the following commands in order
>  # dev/change-scala-version.sh 2.13
>  # dev/change-scala-version.sh 2.12
>  # git status
> there will generate git diff as follow:
> {code:java}
> diff --git a/pom.xml b/pom.xml
> index ddc4ce2f68..f43d8c8f78 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -162,7 +162,7 @@
>      3.4.1
>      
>      3.2.2
> -    2.12.10
> +    2.13.5
>      2.12
>      2.0.0
>      --test
> {code}
> seem 'scala.version' property was not replaced correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34766) Do not capture maven config for views

2021-03-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-34766:

Fix Version/s: 3.1.2

> Do not capture maven config for views
> -
>
> Key: SPARK-34766
> URL: https://issues.apache.org/jira/browse/SPARK-34766
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0, 3.1.2
>
>
> Due to the bad network, we always use the thirdparty maven repo to run test. 
> e.g.,
> {code:java}
> build/sbt "test:testOnly *SQLQueryTestSuite" 
> -Dspark.sql.maven.additionalRemoteRepositories=x
> {code}
>  
> It's failed with such error msg
> ```
> [info] - show-tblproperties.sql *** FAILED *** (128 milliseconds)
> [info] show-tblproperties.sql
> [info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][
> [info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories x]" 
> Result did not match for query #6
> [info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464)
> ```
> It's not necessary to capture the maven config to view since it's a session 
> level config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34786) read parquet uint64 as decimal

2021-03-18 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304136#comment-17304136
 ] 

Kent Yao commented on SPARK-34786:
--

I'd like to do this.

> read parquet uint64 as decimal
> --
>
> Key: SPARK-34786
> URL: https://issues.apache.org/jira/browse/SPARK-34786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>
> Currently Spark can't read parquet uint64 as it doesn't fit the Spark long 
> type. We can read uint64 as decimal as a workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

2021-03-18 Thread akiyamaneko (Jira)
akiyamaneko created SPARK-34787:
---

 Summary: Option variable in Spark historyServer log should be 
displayed as actual value instead of Some(XX)
 Key: SPARK-34787
 URL: https://issues.apache.org/jira/browse/SPARK-34787
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.1
Reporter: akiyamaneko


Option variable in Spark historyServer log should be displayed as actual value 
instead of Some(XX):
{code:html}
21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt 
application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO 
FsHistoryProvider: Parsing 
hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421
 for listing data...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304145#comment-17304145
 ] 

Apache Spark commented on SPARK-34787:
--

User 'kyoty' has created a pull request for this issue:
https://github.com/apache/spark/pull/31882

> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX)
> --
>
> Key: SPARK-34787
> URL: https://issues.apache.org/jira/browse/SPARK-34787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
>
> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX):
> {code:html}
> 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt 
> application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO 
> FsHistoryProvider: Parsing 
> hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421
>  for listing data...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34787:


Assignee: Apache Spark

> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX)
> --
>
> Key: SPARK-34787
> URL: https://issues.apache.org/jira/browse/SPARK-34787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: Apache Spark
>Priority: Minor
>
> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX):
> {code:html}
> 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt 
> application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO 
> FsHistoryProvider: Parsing 
> hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421
>  for listing data...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34787:


Assignee: (was: Apache Spark)

> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX)
> --
>
> Key: SPARK-34787
> URL: https://issues.apache.org/jira/browse/SPARK-34787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Priority: Minor
>
> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX):
> {code:html}
> 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt 
> application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO 
> FsHistoryProvider: Parsing 
> hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421
>  for listing data...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread wuyi (Jira)
wuyi created SPARK-34788:


 Summary: Spark throws FileNotFoundException instead of IOException 
when disk is full
 Key: SPARK-34788
 URL: https://issues.apache.org/jira/browse/SPARK-34788
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0, 3.0.0, 2.4.7
Reporter: wuyi


When the disk is full, Spark throws FileNotFoundException instead of 
IOException with the hint. It's quite a confusing error to users.

 

And there's probably a way to detect the disk full: when we get 
`FileNotFoundException`, we try 
[http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] to 
see if SyncFailedException throws. If SyncFailedException throws, then we throw 
IOException with the disk full hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread wuyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304185#comment-17304185
 ] 

wuyi commented on SPARK-34788:
--

[~wuxiaofei] Could you take a look at this?

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users.
>  
> And there's probably a way to detect the disk full: when we get 
> `FileNotFoundException`, we try 
> [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] 
> to see if SyncFailedException throws. If SyncFailedException throws, then we 
> throw IOException with the disk full hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-34788:
-
Component/s: Shuffle

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users.
>  
> And there's probably a way to detect the disk full: when we get 
> `FileNotFoundException`, we try 
> [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] 
> to see if SyncFailedException throws. If SyncFailedException throws, then we 
> throw IOException with the disk full hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34760) run JavaSQLDataSourceExample failed with Exception in runBasicDataSourceExample().

2021-03-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-34760:


Assignee: zengrui

> run JavaSQLDataSourceExample failed with Exception in 
> runBasicDataSourceExample().
> --
>
> Key: SPARK-34760
> URL: https://issues.apache.org/jira/browse/SPARK-34760
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 3.0.1, 3.1.1
>Reporter: zengrui
>Assignee: zengrui
>Priority: Minor
>
> run JavaSparkSQLExample failed with Exception in runBasicDataSourceExample().
> when excecute 
> 'peopleDF.write().partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("people_partitioned_bucketed");'
> throws Exception: 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: partition column favorite_color is 
> not defined in table people_partitioned_bucketed, defined table columns are: 
> age, name;'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34760) run JavaSQLDataSourceExample failed with Exception in runBasicDataSourceExample().

2021-03-18 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-34760.
--
Fix Version/s: 3.0.3
   3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 31851
[https://github.com/apache/spark/pull/31851]

> run JavaSQLDataSourceExample failed with Exception in 
> runBasicDataSourceExample().
> --
>
> Key: SPARK-34760
> URL: https://issues.apache.org/jira/browse/SPARK-34760
> Project: Spark
>  Issue Type: Bug
>  Components: Examples
>Affects Versions: 3.0.1, 3.1.1
>Reporter: zengrui
>Assignee: zengrui
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> run JavaSparkSQLExample failed with Exception in runBasicDataSourceExample().
> when excecute 
> 'peopleDF.write().partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("people_partitioned_bucketed");'
> throws Exception: 'Exception in thread "main" 
> org.apache.spark.sql.AnalysisException: partition column favorite_color is 
> not defined in table people_partitioned_bucketed, defined table columns are: 
> age, name;'



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Attila Zsolt Piros (Jira)
Attila Zsolt Piros created SPARK-34789:
--

 Summary: Introduce Jetty based construct for integration tests 
where HTTP(S) is used
 Key: SPARK-34789
 URL: https://issues.apache.org/jira/browse/SPARK-34789
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 3.2.0
Reporter: Attila Zsolt Piros


This came up during 
https://github.com/apache/spark/pull/31877#discussion_r596831803.

Short summary: we have some tests where HTTP(S) is used to access files. The 
current solution uses github urls like 
"https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
This connects two Spark version in an unhealthy way like connecting the 
"master" branch which is moving part with the committed test code which is a 
non-moving (as it might be even released).
So this way a test running for an earlier version of Spark expects something 
(filename, content, path) from a the latter release and what is worse when the 
moving version is changed the earlier test will break. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304245#comment-17304245
 ] 

Attila Zsolt Piros commented on SPARK-34789:


I am working on this


> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread hezuojiao (Jira)
hezuojiao created SPARK-34790:
-

 Summary: Fail in fetch shuffle blocks in batch when i/o encryption 
is enabled.
 Key: SPARK-34790
 URL: https://issues.apache.org/jira/browse/SPARK-34790
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.1
Reporter: hezuojiao


When set spark.io.encryption.enabled=true, lots of test cases in 
AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
incompatible with io encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Attila Zsolt Piros (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34789:
---
Description: 
This came up during 
https://github.com/apache/spark/pull/31877#discussion_r596831803.

Short summary: we have some tests where HTTP(S) is used to access files. The 
current solution uses github urls like 
"https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
This connects two Spark version in an unhealthy way like connecting the 
"master" branch which is moving part with the committed test code which is a 
non-moving (as it might be even released).
So this way a test running for an earlier version of Spark expects something 
(filename, content, path) from a the latter release and what is worse when the 
moving version is changed the earlier test will break. 

The idea is to introduce a method like:

{noformat}
withHttpServer(files) {
}
{noformat}

Which uses a Jetty ResourceHandler to serve the listed files (or directories / 
or just the root where it is started from) and stops the server in the finally.

 

  was:
This came up during 
https://github.com/apache/spark/pull/31877#discussion_r596831803.

Short summary: we have some tests where HTTP(S) is used to access files. The 
current solution uses github urls like 
"https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
This connects two Spark version in an unhealthy way like connecting the 
"master" branch which is moving part with the committed test code which is a 
non-moving (as it might be even released).
So this way a test running for an earlier version of Spark expects something 
(filename, content, path) from a the latter release and what is worse when the 
moving version is changed the earlier test will break. 

 


> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
> The idea is to introduce a method like:
> {noformat}
> withHttpServer(files) {
> }
> {noformat}
> Which uses a Jetty ResourceHandler to serve the listed files (or directories 
> / or just the root where it is started from) and stops the server in the 
> finally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34693) Support null in conversions to and from Arrow

2021-03-18 Thread Laurens (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304266#comment-17304266
 ] 

Laurens commented on SPARK-34693:
-

{code:java}
def tps(df) -> pd.DataFrame["workflow_id": int, "task_id": int, "task_slack": 
int]:
df.set_index("id", inplace=True)
  
graph = dict()
forward_dict = dict()
finish_times = dict()
task_runtimes = dict()
task_arrival_times = dict()
workflow_id = None
for row in df.to_records(): # 0: task id, 1: wf id, 2: children, 3: 
parents, 4: ts submit, 5: runtime
graph[row[0]] = set(row[3].flatten())
forward_dict[row[0]] = set(row[2].flatten())
task_runtimes[row[0]] = row[5]
task_arrival_times[row[0]] = row[4]
workflow_id = row[1]

del df
del row

try:
groups = list(toposort(graph))
except CircularDependencyError:
del forward_dict
del finish_times
del task_runtimes
del task_arrival_times
del workflow_id
return pd.DataFrame(columns=["workflow_id", "task_id", "task_slack"])

del graphif len(groups) < 2:
del forward_dict
del finish_times
del task_runtimes
del task_arrival_times
del workflow_id
del groups
return pd.DataFrame(columns=["workflow_id", "task_id", "task_slack"])   
 # Compute all full paths
max_for_task = dict()

q = deque()
for i in groups[0]:
max_for_task[i] = task_arrival_times[i] + task_runtimes[i]
q.append((max_for_task[i], [i]))

paths = []
max_path = -1

while len(q) > 0:
time, path = q.popleft()  # get partial path  # We are are at time t[0] 
having travelled path t[1]
if len(forward_dict[path[-1]]) == 0: # End task
if time < max_path: continue  # smaller path, cannot be a CP
if time > max_path:  # If we have a new max, clear the list and set 
it to the max
max_path = time
paths.clear()
paths.append(path)  # Add new and identical length paths
else:
for c in forward_dict[path[-1]]:  # Loop over all children of the 
last task in the path we took
if c in task_runtimes:
# Special case: we find a child that arrives later then the 
current path.
# Start a new path then from this child onwards and do not 
mark the previous nodes on the critical path
# as they can actually delay.
if time < task_arrival_times[c]:
if c not in max_for_task:
max_for_task[c] = task_arrival_times[c] + 
task_runtimes[c]
q.append((max_for_task[c], [c]))
else:
# If the finishing of one of the children + the runtime 
causes the same or a new maximum, add a path to the queue
# to explore from there on at that time.
child_finish = time + task_runtimes[c]
if c in max_for_task and child_finish < 
max_for_task[c]:  continue
max_for_task[c] = child_finish
l = path.copy()
l.append(c)
q.append((child_finish, l))

del time
del max_for_task
del forward_dict
del max_path
del q
del path

if(len(paths) == 1):
citical_path_tasks = set(paths[0])
else:
citical_path_tasks = _reduce(set.union, [set(cp) for cp in paths])
del paths

rows = deque()
for i in range(len(groups)):
max_in_group = -1
for task_id in groups[i]:  # Compute max finish time in the group
if task_id not in finish_times: 
continue  # Task was not in snapshot of trace
task_finish = finish_times[task_id]
if task_finish > max_in_group:
max_in_group = task_finish

for task_id in groups[i]:
if task_id in citical_path_tasks:  # Skip if task is on a critical 
path
rows.append([workflow_id, task_id, 0])
else: 
if task_id not in finish_times: continue  # Task was not in 
snapshot of trace
rows.append([workflow_id, task_id, max_in_group - 
finish_times[task_id]])

del groups
del citical_path_tasks
del finish_times
return pd.DataFrame(rows, columns=["workflow_id", "task_id", "task_slack"])
{code}
and the reading part:
{code:java}
kdf = ks.read_parquet(os.path.join(hdfs_path, folder, "tasks", "schema-1.0"),
 columns=[
 "workflow_id", "id", "task_id", "children", "parents", "ts_submit", "runtime"
 ], pandas_metadata=False, engine='pyarrow')
 
kdf = kdf[((kdf['children'].map(len) > 0) | (kdf['parents'].map(len) > 0))]
kdf = kdf.astype({"

[jira] [Created] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Jeet (Jira)
Jeet created SPARK-34791:


 Summary: SparkR throws node stack overflow
 Key: SPARK-34791
 URL: https://issues.apache.org/jira/browse/SPARK-34791
 Project: Spark
  Issue Type: Question
  Components: SparkR
Affects Versions: 3.0.1
Reporter: Jeet


SparkR throws "node stack overflow" error upon running code (sample below) on 
R-4.0.2 with Spark 3.0.1.

Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)

{{}}
{code:java}
source('sample.R')
myclsr = myclosure_func()
myclsr$get_some_date('2021-01-01')

## spark.lapply throws node stack overflow
result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
source('sample.R')
another_closure = myclosure_func()
return(another_closure$get_some_date(rdate))
})

{code}
{{}}

Sample.R

{{}}
{code:java}
## util function, which calls itself
getPreviousBusinessDate <- function(asofdate) {
asdt <- asofdate;
asdt <- as.Date(asofdate)-1;

wd <- format(as.Date(asdt),"%A")
if(wd == "Saturday" | wd == "Sunday") {
return (getPreviousBusinessDate(asdt));
}

return (asdt);
}

## closure which calls util function
myclosure_func = function() {

myclosure = list()

get_some_date = function (random_date) {
return (getPreviousBusinessDate(random_date))
}
myclosure$get_some_date = get_some_date

return(myclosure)
}

{code}
{{}}

{{}}

This seems to have caused by sourcing sample.R twice. Once before invoking 
Spark session and another within Spark session.

{{}}

{{}}

{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Jeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeet updated SPARK-34791:
-
Description: 
SparkR throws "node stack overflow" error upon running code (sample below) on 
R-4.0.2 with Spark 3.0.1.

Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)

{{}}
{code:java}
source('sample.R')
myclsr = myclosure_func()
myclsr$get_some_date('2021-01-01')

## spark.lapply throws node stack overflow
result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
source('sample.R')
another_closure = myclosure_func()
return(another_closure$get_some_date(rdate))
})

{code}
{{}}

Sample.R

{{}}
{code:java}
## util function, which calls itself
getPreviousBusinessDate <- function(asofdate) {
asdt <- asofdate;
asdt <- as.Date(asofdate)-1;

wd <- format(as.Date(asdt),"%A")
if(wd == "Saturday" | wd == "Sunday") {
return (getPreviousBusinessDate(asdt));
}

return (asdt);
}

## closure which calls util function
myclosure_func = function() {

myclosure = list()

get_some_date = function (random_date) {
return (getPreviousBusinessDate(random_date))
}
myclosure$get_some_date = get_some_date

return(myclosure)
}

{code}
This seems to have caused by sourcing sample.R twice. Once before invoking 
Spark session and another within Spark session.

 

  was:
SparkR throws "node stack overflow" error upon running code (sample below) on 
R-4.0.2 with Spark 3.0.1.

Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)

{{}}
{code:java}
source('sample.R')
myclsr = myclosure_func()
myclsr$get_some_date('2021-01-01')

## spark.lapply throws node stack overflow
result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
source('sample.R')
another_closure = myclosure_func()
return(another_closure$get_some_date(rdate))
})

{code}
{{}}

Sample.R

{{}}
{code:java}
## util function, which calls itself
getPreviousBusinessDate <- function(asofdate) {
asdt <- asofdate;
asdt <- as.Date(asofdate)-1;

wd <- format(as.Date(asdt),"%A")
if(wd == "Saturday" | wd == "Sunday") {
return (getPreviousBusinessDate(asdt));
}

return (asdt);
}

## closure which calls util function
myclosure_func = function() {

myclosure = list()

get_some_date = function (random_date) {
return (getPreviousBusinessDate(random_date))
}
myclosure$get_some_date = get_some_date

return(myclosure)
}

{code}
{{}}

{{}}

This seems to have caused by sourcing sample.R twice. Once before invoking 
Spark session and another within Spark session.

{{}}

{{}}

{{}}


> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: Jeet
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {{}}
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> {{}}
> Sample.R
> {{}}
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Jeet (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeet updated SPARK-34791:
-
Description: 
SparkR throws "node stack overflow" error upon running code (sample below) on 
R-4.0.2 with Spark 3.0.1.

Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
{code:java}
source('sample.R')
myclsr = myclosure_func()
myclsr$get_some_date('2021-01-01')

## spark.lapply throws node stack overflow
result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
source('sample.R')
another_closure = myclosure_func()
return(another_closure$get_some_date(rdate))
})

{code}
Sample.R
{code:java}
## util function, which calls itself
getPreviousBusinessDate <- function(asofdate) {
asdt <- asofdate;
asdt <- as.Date(asofdate)-1;

wd <- format(as.Date(asdt),"%A")
if(wd == "Saturday" | wd == "Sunday") {
return (getPreviousBusinessDate(asdt));
}

return (asdt);
}

## closure which calls util function
myclosure_func = function() {

myclosure = list()

get_some_date = function (random_date) {
return (getPreviousBusinessDate(random_date))
}
myclosure$get_some_date = get_some_date

return(myclosure)
}

{code}
This seems to have caused by sourcing sample.R twice. Once before invoking 
Spark session and another within Spark session.

 

  was:
SparkR throws "node stack overflow" error upon running code (sample below) on 
R-4.0.2 with Spark 3.0.1.

Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)

{{}}
{code:java}
source('sample.R')
myclsr = myclosure_func()
myclsr$get_some_date('2021-01-01')

## spark.lapply throws node stack overflow
result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
source('sample.R')
another_closure = myclosure_func()
return(another_closure$get_some_date(rdate))
})

{code}
{{}}

Sample.R

{{}}
{code:java}
## util function, which calls itself
getPreviousBusinessDate <- function(asofdate) {
asdt <- asofdate;
asdt <- as.Date(asofdate)-1;

wd <- format(as.Date(asdt),"%A")
if(wd == "Saturday" | wd == "Sunday") {
return (getPreviousBusinessDate(asdt));
}

return (asdt);
}

## closure which calls util function
myclosure_func = function() {

myclosure = list()

get_some_date = function (random_date) {
return (getPreviousBusinessDate(random_date))
}
myclosure$get_some_date = get_some_date

return(myclosure)
}

{code}
This seems to have caused by sourcing sample.R twice. Once before invoking 
Spark session and another within Spark session.

 


> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: Jeet
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-18 Thread Michael Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chen updated SPARK-34780:
-
Affects Version/s: 3.1.1

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358). I suspect this also 
> happens in other versions of Spark but haven't tested yet.
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-18 Thread Michael Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Chen updated SPARK-34780:
-
Description: 
When a dataframe is cached, the logical plan can contain copies of the spark 
session meaning the SQLConfs are stored. Then if a different dataframe can 
replace parts of it's logical plan with a cached logical plan, the cached 
SQLConfs will be used for the evaluation of the cached logical plan. This is 
because HadoopFsRelation ignores sparkSession for equality checks (introduced 
in https://issues.apache.org/jira/browse/SPARK-17358).
{code:java}
test("cache uses old SQLConf") {
  import testImplicits._
  withTempDir { dir =>
val tableDir = dir.getAbsoluteFile + "/table"
val df = Seq("a").toDF("key")
df.write.parquet(tableDir)
SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
val compression1Stats = spark.read.parquet(tableDir).select("key").
  queryExecution.optimizedPlan.collect {
  case l: LogicalRelation => l
  case m: InMemoryRelation => m
}.map(_.computeStats())

SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
val df2 = spark.read.parquet(tableDir).select("key")
df2.cache()
val compression10Stats = df2.queryExecution.optimizedPlan.collect {
  case l: LogicalRelation => l
  case m: InMemoryRelation => m
}.map(_.computeStats())

SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
val compression1StatsWithCache = spark.read.parquet(tableDir).select("key").
  queryExecution.optimizedPlan.collect {
  case l: LogicalRelation => l
  case m: InMemoryRelation => m
}.map(_.computeStats())

// I expect these stats to be the same because file compression factor is 
the same
assert(compression1Stats == compression1StatsWithCache)
// Instead, we can see the file compression factor is being cached and used 
along with
// the logical plan
assert(compression10Stats == compression1StatsWithCache)
  }
}{code}
 

  was:
When a dataframe is cached, the logical plan can contain copies of the spark 
session meaning the SQLConfs are stored. Then if a different dataframe can 
replace parts of it's logical plan with a cached logical plan, the cached 
SQLConfs will be used for the evaluation of the cached logical plan. This is 
because HadoopFsRelation ignores sparkSession for equality checks (introduced 
in https://issues.apache.org/jira/browse/SPARK-17358). I suspect this also 
happens in other versions of Spark but haven't tested yet.
{code:java}
test("cache uses old SQLConf") {
  import testImplicits._
  withTempDir { dir =>
val tableDir = dir.getAbsoluteFile + "/table"
val df = Seq("a").toDF("key")
df.write.parquet(tableDir)
SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
val compression1Stats = spark.read.parquet(tableDir).select("key").
  queryExecution.optimizedPlan.collect {
  case l: LogicalRelation => l
  case m: InMemoryRelation => m
}.map(_.computeStats())

SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
val df2 = spark.read.parquet(tableDir).select("key")
df2.cache()
val compression10Stats = df2.queryExecution.optimizedPlan.collect {
  case l: LogicalRelation => l
  case m: InMemoryRelation => m
}.map(_.computeStats())

SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
val compression1StatsWithCache = spark.read.parquet(tableDir).select("key").
  queryExecution.optimizedPlan.collect {
  case l: LogicalRelation => l
  case m: InMemoryRelation => m
}.map(_.computeStats())

// I expect these stats to be the same because file compression factor is 
the same
assert(compression1Stats == compression1StatsWithCache)
// Instead, we can see the file compression factor is being cached and used 
along with
// the logical plan
assert(compression10Stats == compression1StatsWithCache)
  }
}{code}
 


> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import test

[jira] [Commented] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304315#comment-17304315
 ] 

Apache Spark commented on SPARK-33482:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31848

> V2 Datasources that extend FileScan preclude exchange reuse
> ---
>
> Key: SPARK-33482
> URL: https://issues.apache.org/jira/browse/SPARK-33482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Bruce Robbins
>Priority: Major
>
> Sample query:
> {noformat}
> spark.read.parquet("tbl").createOrReplaceTempView("tbl")
> spark.read.parquet("lookup").createOrReplaceTempView("lookup")
> sql("""
>select tbl.col1, fk1, fk2
>from tbl, lookup l1, lookup l2
>where fk1 = l1.key
>and fk2 = l2.key
> """).explain
> {noformat}
> Test files can be created as so:
> {noformat}
> import scala.util.Random
> val rand = Random
> val tbl = spark.range(1, 1).map { x =>
>   (rand.nextLong.abs % 20,
>rand.nextLong.abs % 20,
>x)
> }.toDF("fk1", "fk2", "col1")
> tbl.write.mode("overwrite").parquet("tbl")
> val lookup = spark.range(0, 20).map { x =>
>   (x + 1, x * 1, (x + 1) * 1)
> }.toDF("key", "col1", "col2")
> lookup.write.mode("overwrite").parquet("lookup")
> {noformat}
> Output with V1 Parquet reader:
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, 
> DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- FileScan parquet [key#6L] Batched: true, DataFilters: 
> [isnotnull(key#6L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
>+- ReusedExchange [key#12L], BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
> {noformat}
> With V1 Parquet reader, the exchange for lookup is reused (see last line).
> Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""):
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: 
> [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct, PushedFilters: 
> [IsNotNull(fk1), IsNotNull(fk2)]
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- BatchScan[key#6L] ParquetScan DataFilters: 
> [isnotnull(key#6L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [id=#83]
>   +- *(2) Filter isnotnull(key#12L)
>  +- *(2) ColumnarToRow
> +- BatchScan[key#12L] ParquetScan DataFilters: 
> [isnotnull(key#12L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
> {noformat}
> With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 
> lines).
> You can see the same issue with the Orc reader (and I assume any other 
> datasource that extends Filescan)

[jira] [Commented] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304316#comment-17304316
 ] 

Apache Spark commented on SPARK-33482:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/31848

> V2 Datasources that extend FileScan preclude exchange reuse
> ---
>
> Key: SPARK-33482
> URL: https://issues.apache.org/jira/browse/SPARK-33482
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: Bruce Robbins
>Priority: Major
>
> Sample query:
> {noformat}
> spark.read.parquet("tbl").createOrReplaceTempView("tbl")
> spark.read.parquet("lookup").createOrReplaceTempView("lookup")
> sql("""
>select tbl.col1, fk1, fk2
>from tbl, lookup l1, lookup l2
>where fk1 = l1.key
>and fk2 = l2.key
> """).explain
> {noformat}
> Test files can be created as so:
> {noformat}
> import scala.util.Random
> val rand = Random
> val tbl = spark.range(1, 1).map { x =>
>   (rand.nextLong.abs % 20,
>rand.nextLong.abs % 20,
>x)
> }.toDF("fk1", "fk2", "col1")
> tbl.write.mode("overwrite").parquet("tbl")
> val lookup = spark.range(0, 20).map { x =>
>   (x + 1, x * 1, (x + 1) * 1)
> }.toDF("key", "col1", "col2")
> lookup.write.mode("overwrite").parquet("lookup")
> {noformat}
> Output with V1 Parquet reader:
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, 
> DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, 
> Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- FileScan parquet [key#6L] Batched: true, DataFilters: 
> [isnotnull(key#6L)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: 
> struct
>+- ReusedExchange [key#12L], BroadcastExchange 
> HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75]
> {noformat}
> With V1 Parquet reader, the exchange for lookup is reused (see last line).
> Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""):
> {noformat}
>  == Physical Plan ==
> *(3) Project [col1#2L, fk1#0L, fk2#1L]
> +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false
>:- *(3) Project [fk1#0L, fk2#1L, col1#2L]
>:  +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false
>: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L))
>: :  +- *(3) ColumnarToRow
>: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: 
> [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], 
> PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], 
> ReadSchema: struct, PushedFilters: 
> [IsNotNull(fk1), IsNotNull(fk2)]
>: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, 
> bigint, false]),false), [id=#75]
>:+- *(1) Filter isnotnull(key#6L)
>:   +- *(1) ColumnarToRow
>:  +- BatchScan[key#6L] ParquetScan DataFilters: 
> [isnotnull(key#6L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, 
> false]),false), [id=#83]
>   +- *(2) Filter isnotnull(key#12L)
>  +- *(2) ColumnarToRow
> +- BatchScan[key#12L] ParquetScan DataFilters: 
> [isnotnull(key#12L)], Format: parquet, Location: 
> InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], 
> PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: 
> struct, PushedFilters: [IsNotNull(key)]
> {noformat}
> With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 
> lines).
> You can see the same issue with the Orc reader (and I assume any other 
> datasource that extends Filescan)

[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2021-03-18 Thread Erik Erlandson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304397#comment-17304397
 ] 

Erik Erlandson commented on SPARK-24432:


[~dongjoon] should this be closed, now that spark 3.1 is available (per 
[above|https://issues.apache.org/jira/browse/SPARK-24432?focusedCommentId=17224905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17224905])

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3

2021-03-18 Thread kondziolka9ld (Jira)
kondziolka9ld created SPARK-34792:
-

 Summary: Restore previous behaviour of randomSplit from 
spark-2.4.7 in spark-3
 Key: SPARK-34792
 URL: https://issues.apache.org/jira/browse/SPARK-34792
 Project: Spark
  Issue Type: Question
  Components: Spark Core, SQL
Affects Versions: 3.0.1
Reporter: kondziolka9ld


Hi, 

Please consider a following difference even despite of the same seed.

Is it possible to restore the same
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val Array(f, s) =  Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 
0.7), 42)
f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
scala> f.show
+-+
|value|
+-+
|4|
+-+
scala> s.show
+-+
|value|
+-+
|1|
|2|
|3|
|5|
|6|
|7|
|8|
|9|
|   10|
+-+
{code}
while as on spark-3
{code:java}
scala> val Array(f, s) =  Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 
0.7), 42)
f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
scala> f.show
+-+
|value|
+-+
|5|
|   10|
+-+
scala> s.show
+-+
|value|
+-+
|1|
|2|
|3|
|4|
|6|
|7|
|8|
|9|
+-+
{code}
I guess that implementation of `sample` method changed.

Is it possible to restore previous behaviour?

Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3

2021-03-18 Thread kondziolka9ld (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kondziolka9ld updated SPARK-34792:
--
Description: 
Hi, 

Please consider a following difference of `randomSplit` method even despite of 
the same seed.

 
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val Array(f, s) =  Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 
0.7), 42)
f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
scala> f.show
+-+
|value|
+-+
|4|
+-+
scala> s.show
+-+
|value|
+-+
|1|
|2|
|3|
|5|
|6|
|7|
|8|
|9|
|   10|
+-+
{code}
while as on spark-3
{code:java}
scala> val Array(f, s) =  Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 
0.7), 42)
f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
scala> f.show
+-+
|value|
+-+
|5|
|   10|
+-+
scala> s.show
+-+
|value|
+-+
|1|
|2|
|3|
|4|
|6|
|7|
|8|
|9|
+-+
{code}
I guess that implementation of `sample` method changed.

Is it possible to restore previous behaviour?

Thanks in advance!

  was:
Hi, 

Please consider a following difference even despite of the same seed.

Is it possible to restore the same
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.7
  /_/
 
Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val Array(f, s) =  Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 
0.7), 42)
f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
scala> f.show
+-+
|value|
+-+
|4|
+-+
scala> s.show
+-+
|value|
+-+
|1|
|2|
|3|
|5|
|6|
|7|
|8|
|9|
|   10|
+-+
{code}
while as on spark-3
{code:java}
scala> val Array(f, s) =  Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 
0.7), 42)
f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
scala> f.show
+-+
|value|
+-+
|5|
|   10|
+-+
scala> s.show
+-+
|value|
+-+
|1|
|2|
|3|
|4|
|6|
|7|
|8|
|9|
+-+
{code}
I guess that implementation of `sample` method changed.

Is it possible to restore previous behaviour?

Thanks in advance!


> Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
> -
>
> Key: SPARK-34792
> URL: https://issues.apache.org/jira/browse/SPARK-34792
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.0.1
>Reporter: kondziolka9ld
>Priority: Major
>
> Hi, 
> Please consider a following difference of `randomSplit` method even despite 
> of the same seed.
>  
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |4|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |5|
> |6|
> |7|
> |8|
> |9|
> |   10|
> +-+
> {code}
> while as on spark-3
> {code:java}
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |5|
> |   10|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |4|
> |6|
> |7|
> |8|
> |9|
> +-+
> {code}
> I guess that implementation of `sample` method changed.
> Is it possible to restore previous behaviour?

[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304404#comment-17304404
 ] 

Dongjoon Hyun commented on SPARK-24432:
---

No, [~eje]. This is not closed due to the JIRA description.
>  This requires a Kubernetes-specific external shuffle service. The feature is 
> available in our fork at github.com/apache-spark-on-k8s/spark.

This issue claims external shuffle service requirement.

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304406#comment-17304406
 ] 

Dongjoon Hyun commented on SPARK-24432:
---

BTW, ping [~liyinan926] since he is the reporter of this issue. If the scope is 
changed, we can close this issue.

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-18 Thread Max Gekk (Jira)
Max Gekk created SPARK-34793:


 Summary: Prohibit saving of day-time and year-month intervals
 Key: SPARK-34793
 URL: https://issues.apache.org/jira/browse/SPARK-34793
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk
Assignee: Max Gekk


Temporary prohibit saving of year-month and day-time intervals to datasources 
till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34793:


Assignee: Max Gekk  (was: Apache Spark)

> Prohibit saving of day-time and year-month intervals
> 
>
> Key: SPARK-34793
> URL: https://issues.apache.org/jira/browse/SPARK-34793
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Temporary prohibit saving of year-month and day-time intervals to datasources 
> till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304425#comment-17304425
 ] 

Apache Spark commented on SPARK-34793:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31884

> Prohibit saving of day-time and year-month intervals
> 
>
> Key: SPARK-34793
> URL: https://issues.apache.org/jira/browse/SPARK-34793
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Temporary prohibit saving of year-month and day-time intervals to datasources 
> till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34793:


Assignee: Apache Spark  (was: Max Gekk)

> Prohibit saving of day-time and year-month intervals
> 
>
> Key: SPARK-34793
> URL: https://issues.apache.org/jira/browse/SPARK-34793
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Temporary prohibit saving of year-month and day-time intervals to datasources 
> till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304427#comment-17304427
 ] 

Apache Spark commented on SPARK-34793:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31884

> Prohibit saving of day-time and year-month intervals
> 
>
> Key: SPARK-34793
> URL: https://issues.apache.org/jira/browse/SPARK-34793
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Temporary prohibit saving of year-month and day-time intervals to datasources 
> till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304433#comment-17304433
 ] 

Apache Spark commented on SPARK-34636:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31885

> sql method in UnresolvedAttribute, AttributeReference and Alias don't quote 
> qualified names properly.
> -
>
> Key: SPARK-34636
> URL: https://issues.apache.org/jira/browse/SPARK-34636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> UnresolvedReference, AttributeReference and Alias which take qualifier don't 
> quote qualified names properly.
> One instance is reported in SPARK-34626.
> Other instances are like as follows.
> {code}
> UnresolvedAttribute("a`b"::"c.d"::Nil).sql
> a`b.`c.d` // expected: `a``b`.`c.d`
> AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
> c.d.`a.b` // expected: `c.d`.`a.b`
> Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = 
> "d.e"::Nil).sql
> `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304434#comment-17304434
 ] 

Apache Spark commented on SPARK-34636:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31885

> sql method in UnresolvedAttribute, AttributeReference and Alias don't quote 
> qualified names properly.
> -
>
> Key: SPARK-34636
> URL: https://issues.apache.org/jira/browse/SPARK-34636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> UnresolvedReference, AttributeReference and Alias which take qualifier don't 
> quote qualified names properly.
> One instance is reported in SPARK-34626.
> Other instances are like as follows.
> {code}
> UnresolvedAttribute("a`b"::"c.d"::Nil).sql
> a`b.`c.d` // expected: `a``b`.`c.d`
> AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
> c.d.`a.b` // expected: `c.d`.`a.b`
> Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = 
> "d.e"::Nil).sql
> `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types

2021-03-18 Thread Matthew Powers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304443#comment-17304443
 ] 

Matthew Powers commented on SPARK-27790:


[~maxgekk] - how will DateTime addition work with the new intervals?  Something 
like this?
 * Add three months: col("some_date") + make_year_month(???)
 * Add 2 years and 10 seconds: col("some_time") + make_year_month(???) + 
make_day_second(???)

Thanks for the great description of the problem in this JIRA ticket.  

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304453#comment-17304453
 ] 

Dongjoon Hyun commented on SPARK-34785:
---

SPARK-34212 is complete by itself because it's designed to fix the correctness 
issue. You will not get incorrect values.
For this specific `Schema evolution` requirement in the PR description, I don't 
have a bandwidth, [~jobitmathew].

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch

[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304454#comment-17304454
 ] 

Dongjoon Hyun commented on SPARK-34785:
---

FYI, each data sources have different schema evolution capabilities. And, 
Parquet is not the best built-in data source in terms of it. We are tracking it 
with the following test suite.
- 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 

[jira] [Comment Edited] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304454#comment-17304454
 ] 

Dongjoon Hyun edited comment on SPARK-34785 at 3/18/21, 9:14 PM:
-

FYI, each data sources have different schema evolution capabilities. And, 
Parquet is not the best built-in data source in terms of it. We are tracking it 
with the following test suite.
- 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

The recommendation is to use MR code path if you are in that situation.


was (Author: dongjoon):
FYI, each data sources have different schema evolution capabilities. And, 
Parquet is not the best built-in data source in terms of it. We are tracking it 
with the following test suite.
- 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304457#comment-17304457
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

Thank you for reporting, [~Kotlov].
1. Could you file a new JIRA because it might be related to K8s client version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Henry Yu
>Assignee: Igor Calabria
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304459#comment-17304459
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

Does your example work with any Spark versions before? Then, what is the latest 
Spark version?

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Henry Yu
>Assignee: Igor Calabria
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304470#comment-17304470
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

I found your JIRA, SPARK-34674 . Please comment on there because this issue 
seems irrelevant to yours technically.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Henry Yu
>Assignee: Igor Calabria
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-03-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-34674:
---

I reopen this because the affected version is not the same. The following is a 
copy of my comment at the other JIRA.

1. Does your example work with any Spark versions before? Then, what is the 
latest Spark version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304472#comment-17304472
 ] 

Dongjoon Hyun edited comment on SPARK-34674 at 3/18/21, 9:39 PM:
-

I reopen this because the affected version is not the same according to 
[~Kotlov]. The following is a copy of my comment at the other JIRA.

1. Does your example work with any Spark versions before? Then, what is the 
latest Spark version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.


was (Author: dongjoon):
I reopen this because the affected version is not the same. The following is a 
copy of my comment at the other JIRA.

1. Does your example work with any Spark versions before? Then, what is the 
latest Spark version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Daniel Solow (Jira)
Daniel Solow created SPARK-34794:


 Summary: Nested higher-order functions broken in DSL
 Key: SPARK-34794
 URL: https://issues.apache.org/jira/browse/SPARK-34794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
 Environment: 3.1.1
Reporter: Daniel Solow


In Spark 3, if I have:

{code:java}
val df = Seq(
(Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")
{code}
and I want to take the cross product of these two arrays, I can do the 
following in SQL:

{code:java}
df.selectExpr("""
FLATTEN(
TRANSFORM(
numbers,
number -> TRANSFORM(
letters,
letter -> (number AS number, letter AS letter)
)
)
) AS zipped
""").show(false)
++
|zipped  |
++
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
++
{code}

This works fine. But when I try the equivalent using the scala DSL, the result 
is wrong:

{code:java}
df.select(
f.flatten(
f.transform(
$"numbers",
(number: Column) => { f.transform(
$"letters",
(letter: Column) => { f.struct(
number.as("number"),
letter.as("letter")
) }
) }
)
).as("zipped")
).show(10, false)
++
|zipped  |
++
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
++
{code}

Note that the numbers are not included in the output. The explain for this 
second version is:

{code:java}
== Parsed Logical Plan ==
'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Analyzed Logical Plan ==
zipped: array>
Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
x#446, false)), lambda x#445, false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Optimized Logical Plan ==
LocalRelation [zipped#444]

== Physical Plan ==
LocalTableScan [zipped#444]
{code}

Seems like variable name x is hardcoded. And sure enough: 
https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3

2021-03-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34792.
---
Resolution: Not A Problem

Hi, [~kondziolka9ld]. 
This doesn't look like a bug. It can be different in many reasons. Not only 
Spark's code, but also there is difference between Scala 2.11 and Scala 2.12. 
BTW, please send your question to d...@spark.apache.org next time.

> Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
> -
>
> Key: SPARK-34792
> URL: https://issues.apache.org/jira/browse/SPARK-34792
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.0.1
>Reporter: kondziolka9ld
>Priority: Major
>
> Hi, 
> Please consider a following difference of `randomSplit` method even despite 
> of the same seed.
>  
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |4|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |5|
> |6|
> |7|
> |8|
> |9|
> |   10|
> +-+
> {code}
> while as on spark-3
> {code:java}
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |5|
> |   10|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |4|
> |6|
> |7|
> |8|
> |9|
> +-+
> {code}
> I guess that implementation of `sample` method changed.
> Is it possible to restore previous behaviour?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304504#comment-17304504
 ] 

Dongjoon Hyun commented on SPARK-34791:
---

Thank you for reporting, [~jeetendray]. Could you try to use the latest 
version, Apache Spark 3.1.1?
Our CI is using R 4.0.3.
{code}
root@e3c1c7d8cbf8:/# R

R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
{code}

> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304507#comment-17304507
 ] 

Dongjoon Hyun commented on SPARK-34790:
---

Thank you for reporting, [~hezuojiao]. However, it may not be a bug. Sometimes, 
a test suite like AdaptiveQueryExecSuite has some assumption. Could you be more 
specific by posting the failure output you saw?


> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34790:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304508#comment-17304508
 ] 

Dongjoon Hyun commented on SPARK-34790:
---

According to the issue title, I convert this into a subtask of SPARK-33828.

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34789:
--
Issue Type: Improvement  (was: Task)

> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
> The idea is to introduce a method like:
> {noformat}
> withHttpServer(files) {
> }
> {noformat}
> Which uses a Jetty ResourceHandler to serve the listed files (or directories 
> / or just the root where it is started from) and stops the server in the 
> finally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304509#comment-17304509
 ] 

Dongjoon Hyun commented on SPARK-34789:
---

Thank you for working on this, [~attilapiros]. This will be useful.

> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
> The idea is to introduce a method like:
> {noformat}
> withHttpServer(files) {
> }
> {noformat}
> Which uses a Jetty ResourceHandler to serve the listed files (or directories 
> / or just the root where it is started from) and stops the server in the 
> finally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304511#comment-17304511
 ] 

Dongjoon Hyun commented on SPARK-34788:
---

Could you be more specific, [~Ngone51]? In case of disk full, many parts are 
related. Which code path do you see `FileNotFoundException` instead of 
`IOException`?

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users.
>  
> And there's probably a way to detect the disk full: when we get 
> `FileNotFoundException`, we try 
> [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] 
> to see if SyncFailedException throws. If SyncFailedException throws, then we 
> throw IOException with the disk full hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34788:
--
Affects Version/s: (was: 2.4.7)
   (was: 3.1.0)
   (was: 3.0.0)
   3.2.0

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users.
>  
> And there's probably a way to detect the disk full: when we get 
> `FileNotFoundException`, we try 
> [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] 
> to see if SyncFailedException throws. If SyncFailedException throws, then we 
> throw IOException with the disk full hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

2021-03-18 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34787:
--
Affects Version/s: (was: 3.0.1)
   3.2.0

> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX)
> --
>
> Key: SPARK-34787
> URL: https://issues.apache.org/jira/browse/SPARK-34787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: akiyamaneko
>Priority: Minor
>
> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX):
> {code:html}
> 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt 
> application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO 
> FsHistoryProvider: Parsing 
> hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421
>  for listing data...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-34795:


 Summary: Adds a new job in GitHub Actions to check the output of 
TPC-DS queries
 Key: SPARK-34795
 URL: https://issues.apache.org/jira/browse/SPARK-34795
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Takeshi Yamamuro


This ticket aims at adding a new job in GitHub Actions to check the output of 
TPC-DS queries. There are some cases where we noticed runtime-realted bugs far 
after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34795:
-
Description: This ticket aims at adding a new job in GitHub Actions to 
check the output of TPC-DS queries. There are some cases where we noticed 
runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I 
think it is worth adding a new job in GitHub Actions to check query output of 
TPC-DS (sf=1).  (was: This ticket aims at adding a new job in GitHub Actions to 
check the output of TPC-DS queries. There are some cases where we noticed 
runtime-realted bugs far after merging commits (e.g. .SPARK-33822). Therefore, 
I think it is worth adding a new job in GitHub Actions to check query output of 
TPC-DS (sf=1).)

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304521#comment-17304521
 ] 

Apache Spark commented on SPARK-34795:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31886

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34795:


Assignee: Apache Spark

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34795:


Assignee: (was: Apache Spark)

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types

2021-03-18 Thread Simeon Simeonov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304526#comment-17304526
 ] 

Simeon Simeonov commented on SPARK-27790:
-

Maxim, this is good stuff. 

Does ANSI SQL allow operations on dates using the YEAR-MONTH interval type? I 
didn't see that mentioned in Milestone 1.

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34794:


Assignee: Apache Spark

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Assignee: Apache Spark
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34794:


Assignee: (was: Apache Spark)

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304532#comment-17304532
 ] 

Apache Spark commented on SPARK-34794:
--

User 'dmsolow' has created a pull request for this issue:
https://github.com/apache/spark/pull/31887

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Daniel Solow (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304533#comment-17304533
 ] 

Daniel Solow commented on SPARK-34794:
--

PR is at https://github.com/apache/spark/pull/31887

Thanks to [~rspitzer] for suggesting AtomicInteger

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304538#comment-17304538
 ] 

Apache Spark commented on SPARK-34596:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31888

> NewInstance.doGenCode should not throw malformed class name error
> -
>
> Key: SPARK-34596
> URL: https://issues.apache.org/jira/browse/SPARK-34596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Kris Mok
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Similar to SPARK-32238 and SPARK-32999, the use of 
> {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic 
> because Scala classes may trigger {{java.lang.InternalError: Malformed class 
> name}}.
> This happens more often when using nested classes in Scala (or declaring 
> classes in Scala REPL which implies class nesting).
> Note that on newer versions of JDK the underlying malformed class name no 
> longer reproduces (fixed in the JDK by 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less 
> of an issue there. But on JDK8u this problem still exists so we still have to 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error

2021-03-18 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304539#comment-17304539
 ] 

Apache Spark commented on SPARK-34596:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31888

> NewInstance.doGenCode should not throw malformed class name error
> -
>
> Key: SPARK-34596
> URL: https://issues.apache.org/jira/browse/SPARK-34596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Kris Mok
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Similar to SPARK-32238 and SPARK-32999, the use of 
> {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic 
> because Scala classes may trigger {{java.lang.InternalError: Malformed class 
> name}}.
> This happens more often when using nested classes in Scala (or declaring 
> classes in Scala REPL which implies class nesting).
> Note that on newer versions of JDK the underlying malformed class name no 
> longer reproduces (fixed in the JDK by 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less 
> of an issue there. But on JDK8u this problem still exists so we still have to 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34781) Eliminate LEFT SEMI/ANTI join to its left child side with AQE

2021-03-18 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34781.
--
Fix Version/s: 3.2.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31873

> Eliminate LEFT SEMI/ANTI join to its left child side with AQE
> -
>
> Key: SPARK-34781
> URL: https://issues.apache.org/jira/browse/SPARK-34781
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.2.0
>
>
> In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases 
> for LEFT SEMI and LEFT ANI joins:
>  # Join is left semi join, join right side is non-empty and condition is 
> empty. Eliminate join to its left side.
>  # Join is left anti join, join right side is empty. Eliminate join to its 
> left side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)
Cheng Su created SPARK-34796:


 Summary: Codegen compilation error for query with LIMIT operator 
and without AQE
 Key: SPARK-34796
 URL: https://issues.apache.org/jira/browse/SPARK-34796
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.1.0, 3.2.0
Reporter: Cheng Su


Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")  
withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:
{code:java}
17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
17:46:01.540 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an 
rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
 at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) 
at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
 at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) 
at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) 
at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) 
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) 
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392)
 at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:384)
 at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1445) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:384) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1312)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:833) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:410) at 
org.codehaus.janino.UnitCompi

[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304550#comment-17304550
 ] 

Cheng Su commented on SPARK-34796:
--

FYI I am working on a PR to fix it now.

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Blocker
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")  
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
> {code:java}
> 17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable 
> to load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> 17:46:01.540 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an 
> rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
>  at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) at 
> org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
>  at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
> org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
> org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
> org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
> org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
>  at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
>  at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
> org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
> org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
> org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
>  at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
>  at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
> org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
> org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
> org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501)
>  at 
> org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490)
>  at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
> org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
> org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362)
>  at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335)
>  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
> org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
> org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392)
>  at 
> org.codehaus.janin

[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-34796:
-
Description: 
Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
 
  withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:
{code:java}
17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
17:46:01.540 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an 
rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
 at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) 
at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
 at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) 
at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) 
at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) 
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) 
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392)
 at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:384)
 at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1445) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:384) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1312)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:833) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:410) at 
org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:389)
 at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDecl

[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-34796:
-
Description: 
Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
 
  withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:

 

https://gist.github.com/c21/ea760c75b546d903247582be656d9d66

  was:
Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
 
  withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:
{code:java}
17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
17:46:01.540 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an 
rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
 at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) 
at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
 at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) 
at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) 
at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) 
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) 
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at

[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304554#comment-17304554
 ] 

Hyukjin Kwon commented on SPARK-34796:
--

[~chengsu] I am lowering the priority because we enable AQE by default now and 
users get affected less.

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Blocker
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34796:
-
Priority: Critical  (was: Blocker)

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Critical
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304555#comment-17304555
 ] 

Hyukjin Kwon commented on SPARK-34791:
--

It's possibly a regression from SPARK-29777 ... yeah it would be great if we 
can confirm if it works with the latest versions.


> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304558#comment-17304558
 ] 

Hyukjin Kwon commented on SPARK-34791:
--

cc [~falaki] FYI

> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304556#comment-17304556
 ] 

Hyukjin Kwon commented on SPARK-34791:
--

See also https://github.com/apache/spark/pull/26429#discussion_r346103050

> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304559#comment-17304559
 ] 

Cheng Su commented on SPARK-34796:
--

[~hyukjin.kwon] - yeah I agree with it. Thanks for correcting priority here.

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Critical
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31661) Document usage of blockSize

2021-03-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31661:


Assignee: zhengruifeng

> Document usage of blockSize
> ---
>
> Key: SPARK-31661
> URL: https://issues.apache.org/jira/browse/SPARK-31661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31976) use MemoryUsage to control the size of block

2021-03-18 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31976:


Assignee: zhengruifeng

> use MemoryUsage to control the size of block
> 
>
> Key: SPARK-31976
> URL: https://issues.apache.org/jira/browse/SPARK-31976
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it maybe reasonable to control the size of block by memory usage, instead 
> of number of rows.
>  
> note1: param blockSize had already used in ALS and MLP to stack vectors 
> (expected to be dense);
> note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;
>  
> There may be two ways to impl:
> 1, compute the sparsity of input vectors ahead of train (this can be computed 
> with other statistics computation, maybe no extra pass), and infer a 
> reasonable number of vectors to stack;
> 2, stack the input vectors adaptively, by monitoring the memory usage in a 
> block;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304561#comment-17304561
 ] 

Hyukjin Kwon commented on SPARK-34780:
--

cc [~sunchao] [~maxgekk] FYI

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34782) Not equal functions (!=, <>) are missing in SQL doc

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304560#comment-17304560
 ] 

Hyukjin Kwon commented on SPARK-34782:
--

Duplicate of SPARK-34747?

> Not equal functions (!=, <>) are missing in SQL doc 
> 
>
> Key: SPARK-34782
> URL: https://issues.apache.org/jira/browse/SPARK-34782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> https://spark.apache.org/docs/latest/api/sql/search.html?q=%3C%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34747) Add virtual operators to the built-in function document.

2021-03-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34747.
--
Fix Version/s: 3.0.3
   3.1.2
   3.2.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/31841

> Add virtual operators to the built-in function document.
> 
>
> Key: SPARK-34747
> URL: https://issues.apache.org/jira/browse/SPARK-34747
> Project: Spark
>  Issue Type: Bug
>  Components: docs, SQL
>Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> After SPARK-34697, DESCRIBE FUNCTION and SHOW FUNCTIONS can describe/show 
> built-in operators including the following virtual operators.
> * !=
> * <>
> * between
> * case
> * ||
> But they are still absent from the built-in functions document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34754) sparksql 'add jar' not support hdfs ha mode in k8s

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304568#comment-17304568
 ] 

Hyukjin Kwon commented on SPARK-34754:
--

[~lithiumlee-_-] can you test if it works in higher versions? K8S support just 
became GA from Spark 3.1.

> sparksql  'add jar' not support  hdfs ha mode in k8s  
> --
>
> Key: SPARK-34754
> URL: https://issues.apache.org/jira/browse/SPARK-34754
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.7
>Reporter: lithiumlee-_-
>Priority: Major
>
> Submit app to K8S,  the executors meet exception  
> "java.net.UnknownHostException: xx". 
> The udf jar uri using hdfs ha style, but the exception stack show  
> "...*createNonHAProxy*..."
>  
> hql: 
> {code:java}
> // code placeholder
> add jar hdfs://xx/test.jar;
> create temporary function test_udf as 'com.xxx.xxx';
> create table test.test_udf as 
> select test_udf('1') name_1;
>  {code}
>  
>  
> exception:
> {code:java}
> // code placeholder
>  TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.89.44, executor 
> 1): java.lang.IllegalArgumentException: java.net.UnknownHostException: xx
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:439)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:321)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:696)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:636)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2796)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2830)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2812)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
> at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1866)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:721)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:816)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:808)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:808)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: xx
> ... 28 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34600) Support user defined types in Pandas UDF

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304569#comment-17304569
 ] 

Hyukjin Kwon commented on SPARK-34600:
--

[~eddyxu] if this ticket is an umbrella ticket, let's create another JIRA and 
move your PR (https://github.com/apache/spark/pull/31735) to that JIRA. 
umbrella JIRA doesn't usually have a PR against it.

> Support user defined types in Pandas UDF
> 
>
> Key: SPARK-34600
> URL: https://issues.apache.org/jira/browse/SPARK-34600
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Lei (Eddy) Xu
>Priority: Major
>  Labels: pandas, sql, udf
>
> Because Pandas UDF uses pyarrow to passing data, it does not currently 
> support UserDefinedTypes, as what normal python udf does.
> For example:
> {code:python}
> class BoxType(UserDefinedType):
> @classmethod
> def sqlType(cls) -> StructType:
> return StructType(
> fields=[
> StructField("xmin", DoubleType(), False),
> StructField("ymin", DoubleType(), False),
> StructField("xmax", DoubleType(), False),
> StructField("ymax", DoubleType(), False),
> ]
> )
> @pandas_udf(
>  returnType=StructType([StructField("boxes", ArrayType(Box()))]
> )
> def pandas_pf(s: pd.DataFrame) -> pd.DataFrame:
>yield s
> {code}
> The logs show
> {code}
> try:
> to_arrow_type(self._returnType_placeholder)
> except TypeError:
> >   raise NotImplementedError(
> "Invalid return type with scalar Pandas UDFs: %s is "
> E   NotImplementedError: Invalid return type with scalar 
> Pandas UDFs: StructType(List(StructField(boxes,ArrayType(Box,true),true))) is 
> not supported
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-18 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304571#comment-17304571
 ] 

Hyukjin Kwon commented on SPARK-34776:
--

cc [~viirya] FYI

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
>

  1   2   >