[jira] [Commented] (SPARK-34762) Many PR's Scala 2.13 build action failed
[ https://issues.apache.org/jira/browse/SPARK-34762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303892#comment-17303892 ] Apache Spark commented on SPARK-34762: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31880 > Many PR's Scala 2.13 build action failed > > > Key: SPARK-34762 > URL: https://issues.apache.org/jira/browse/SPARK-34762 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0, 3.1.2 > > > PR with Scala 2.13 build failure includes > * [https://github.com/apache/spark/pull/31849] > * [https://github.com/apache/spark/pull/31848] > * [https://github.com/apache/spark/pull/31844] > * [https://github.com/apache/spark/pull/31843] > * https://github.com/apache/spark/pull/31841 > {code:java} > [error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:26:1: > error: package org.apache.commons.cli does not exist > 1278[error] import org.apache.commons.cli.GnuParser; > 1279[error] ^ > 1280[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:176:1: > error: cannot find symbol > 1281[error] private final Options options = new Options(); > 1282[error] ^ symbol: class Options > 1283[error] location: class ServerOptionsProcessor > 1284[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:177:1: > error: package org.apache.commons.cli does not exist > 1285[error] private org.apache.commons.cli.CommandLine commandLine; > 1286[error] ^ > 1287[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:255:1: > error: cannot find symbol > 1288[error] HelpOptionExecutor(String serverName, Options options) { > 1289[error] ^ symbol: class > Options > 1290[error] location: class HelpOptionExecutor > 1291[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:176:1: > error: cannot find symbol > 1292[error] private final Options options = new Options(); > 1293[error] ^ symbol: class Options > 1294[error] location: class ServerOptionsProcessor > 1295[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:185:1: > error: cannot find symbol > 1296[error] options.addOption(OptionBuilder > 1297[error] ^ symbol: variable OptionBuilder > 1298[error] location: class ServerOptionsProcessor > 1299[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:192:1: > error: cannot find symbol > 1300[error] options.addOption(new Option("H", "help", false, "Print > help information")); > 1301[error] ^ symbol: class Option > 1302[error] location: class ServerOptionsProcessor > 1303[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:197:1: > error: cannot find symbol > 1304[error] commandLine = new GnuParser().parse(options, argv); > 1305[error] ^ symbol: class GnuParser > 1306[error] location: class ServerOptionsProcessor > 1307[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:211:1: > error: cannot find symbol > 1308[error] } catch (ParseException e) { > 1309[error]^ symbol: class ParseException > 1310[error] location: class ServerOptionsProcessor > 1311[error] > /home/runner/work/spark/spark/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/HiveServer2.java:262:1: > error: cannot find symbol > 1312[error] new HelpFormatter().printHelp(serverName, options); > 1313[error] ^ symbol: class HelpFormatter > 1314[error] location: class HelpOptionExecutor > 1315[error] Note: Some input files use or override a deprecated API. > 1316[error] Note: Recompile with -Xlint:deprecation for details. > 1317[error] 16 errors > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Commented] (SPARK-34766) Do not capture maven config for views
[ https://issues.apache.org/jira/browse/SPARK-34766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303893#comment-17303893 ] Apache Spark commented on SPARK-34766: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/31879 > Do not capture maven config for views > - > > Key: SPARK-34766 > URL: https://issues.apache.org/jira/browse/SPARK-34766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0 > > > Due to the bad network, we always use the thirdparty maven repo to run test. > e.g., > {code:java} > build/sbt "test:testOnly *SQLQueryTestSuite" > -Dspark.sql.maven.additionalRemoteRepositories=x > {code} > > It's failed with such error msg > ``` > [info] - show-tblproperties.sql *** FAILED *** (128 milliseconds) > [info] show-tblproperties.sql > [info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][ > [info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories x]" > Result did not match for query #6 > [info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464) > ``` > It's not necessary to capture the maven config to view since it's a session > level config. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34087) a memory leak occurs when we clone the spark session
[ https://issues.apache.org/jira/browse/SPARK-34087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303895#comment-17303895 ] Apache Spark commented on SPARK-34087: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/31881 > a memory leak occurs when we clone the spark session > > > Key: SPARK-34087 > URL: https://issues.apache.org/jira/browse/SPARK-34087 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: Fu Chen >Priority: Major > Attachments: 1610451044690.jpg > > > In Spark-3.0.1, the memory leak occurs when we keep cloning the spark session > because a new ExecutionListenerBus instance will add to AsyncEventQueue when > we clone a new session. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32061) potential regression if use memoryUsage instead of numRows
[ https://issues.apache.org/jira/browse/SPARK-32061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-32061. -- Resolution: Resolved > potential regression if use memoryUsage instead of numRows > -- > > Key: SPARK-32061 > URL: https://issues.apache.org/jira/browse/SPARK-32061 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > 1, if the `memoryUsage` is improperly set, for example, too small to store a > instance; > 2, the blockify+GMM reuse two matrices whose shape is related to current > blockSize: > {code:java} > @transient private lazy val auxiliaryProbMat = DenseMatrix.zeros(blockSize, k) > @transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, > numFeatures) {code} > When implementing blockify+GMM, I found that if I do not pre-allocate those > matrices, there will be seriously regression (maybe 3~4 slower, I fogot the > detailed numbers); > 3, in MLP, three pre-allocated objects are also related to numRows: > {code:java} > if (ones == null || ones.length != delta.cols) ones = > BDV.ones[Double](delta.cols) > // TODO: allocate outputs as one big array and then create BDMs from it > if (outputs == null || outputs(0).cols != currentBatchSize) { > ... > // TODO: allocate deltas as one big array and then create BDMs from it > if (deltas == null || deltas(0).cols != currentBatchSize) { > deltas = new Array[BDM[Double]](layerModels.length) > ... {code} > I am not very familiar with the impl of MLP and failed to find some related > document about this pro-allocation. But I guess there maybe regression if we > disable this pro-allocation, since those objects look relatively big. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
jobit mathew created SPARK-34785: Summary: SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file. Key: SPARK-34785 URL: https://issues.apache.org/jira/browse/SPARK-34785 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: jobit mathew SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it will reduce the performance. In Hive, {code:java} create table test_decimal(amt decimal(18,2)) stored as parquet; insert into test_decimal select 100; alter table test_decimal change amt amt decimal(19,3); {code} In Spark, {code:java} select * from test_decimal; {code} {code:java} ++ |amt | ++ | 100.000 |{code} but if spark.sql.parquet.enableVectorizedReader=true below error {code:java} : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; going to print operations logs printed operations logs going to print operations logs printed operations logs Getting log thread is interrupted, since query is done! Error: org.apache.hive.service.cli.HiveSQLException: Error running query: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4) (vm2 executor 2): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:312) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:181) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) ... 20 more Driver stacktrace: at org.apache.spark.sql.hive.thriftserver.SparkExecuteS
[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
[ https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303904#comment-17303904 ] jobit mathew commented on SPARK-34785: -- [~dongjoon] can you check it once. > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. Error Parquet column cannot be converted in file. > -- > > Key: SPARK-34785 > URL: https://issues.apache.org/jira/browse/SPARK-34785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Major > > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. > IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it > will reduce the performance. > In Hive, > {code:java} > create table test_decimal(amt decimal(18,2)) stored as parquet; > insert into test_decimal select 100; > alter table test_decimal change amt amt decimal(19,3); > {code} > In Spark, > {code:java} > select * from test_decimal; > {code} > {code:java} > ++ > |amt | > ++ > | 100.000 |{code} > but if spark.sql.parquet.enableVectorizedReader=true below error > {code:java} > : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; > going to print operations logs > printed operations logs > going to print operations logs > printed operations logs > Getting log thread is interrupted, since query is done! > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4) (vm2 executor 2): > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], > Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:312) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283) > at > org.apa
[jira] [Resolved] (SPARK-34741) MergeIntoTable should avoid ambiguous reference
[ https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34741. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31835 [https://github.com/apache/spark/pull/31835] > MergeIntoTable should avoid ambiguous reference > --- > > Key: SPARK-34741 > URL: https://issues.apache.org/jira/browse/SPARK-34741 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.2.0 > > > When resolving the {{UpdateAction}}, which could reference attributes from > both target and source tables, Spark should know clearly where the attribute > comes from when there're conflicting attributes instead of picking up a > random one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34741) MergeIntoTable should avoid ambiguous reference
[ https://issues.apache.org/jira/browse/SPARK-34741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34741: --- Assignee: wuyi > MergeIntoTable should avoid ambiguous reference > --- > > Key: SPARK-34741 > URL: https://issues.apache.org/jira/browse/SPARK-34741 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > When resolving the {{UpdateAction}}, which could reference attributes from > both target and source tables, Spark should know clearly where the attribute > comes from when there're conflicting attributes instead of picking up a > random one. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization
[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darcy Shen updated SPARK-34771: --- Description: {code:python} spark.conf.set("spark.sql.execution.arrow.enabled", "true") from pyspark.testing.sqlutils import ExamplePoint import pandas as pd pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.toPandas() {code} with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS. with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off. Detailed steps to reproduce: {code:python} $ bin/pyspark Python 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) Spark context Web UI available at http://172.30.0.226:4040 Spark context available as 'sc' (master = local[*], app id = local-1615994008526). SparkSession available as 'spark'. >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") 21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. >>> from pyspark.testing.sqlutils import ExamplePoint >>> import pandas as pd >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, >>> 2)])}) >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) >>> >>> df.show() +--+ | point| +--+ |(0.0, 0.0)| |(0.0, 0.0)| +--+ >>> df.schema StructType(List(StructField(point,ExamplePointUDT,true))) >>> df.toPandas() /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unsupported type in conversion to Arrow: ExamplePointUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) point 0 (0.0,0.0) 1 (0.0,0.0) {code} Added a traceback: {code:python} >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) Traceback (most recent call last): File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 319, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 451, in _create_from_pandas_with_arrow arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas File "/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 529, in dataframe_to_types type_ = pa.array(c, from_pandas=True).type File "pyarrow/array.pxi", line 292, in pyarrow.lib.array File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type {code} was: {code:python} spark.conf.set("spark.sql.execution.a
[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization
[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darcy Shen updated SPARK-34771: --- Description: {code:python} spark.conf.set("spark.sql.execution.arrow.enabled", "true") from pyspark.testing.sqlutils import ExamplePoint import pandas as pd pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.toPandas() {code} with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS. with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off. Detailed steps to reproduce: {code:python} $ bin/pyspark Python 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) Spark context Web UI available at http://172.30.0.226:4040 Spark context available as 'sc' (master = local[*], app id = local-1615994008526). SparkSession available as 'spark'. >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") 21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. >>> from pyspark.testing.sqlutils import ExamplePoint >>> import pandas as pd >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, >>> 2)])}) >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) >>> >>> df.show() +--+ | point| +--+ |(0.0, 0.0)| |(0.0, 0.0)| +--+ >>> df.schema StructType(List(StructField(point,ExamplePointUDT,true))) >>> df.toPandas() /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unsupported type in conversion to Arrow: ExamplePointUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) point 0 (0.0,0.0) 1 (0.0,0.0) {code} Added a traceback after the display of the warning message: {code:python} import traceback traceback.print_exc() {code} {code:python} >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:329: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) Traceback (most recent call last): File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 319, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py", line 451, in _create_from_pandas_with_arrow arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) File "pyarrow/types.pxi", line 1317, in pyarrow.lib.Schema.from_pandas File "/Users/da/opt/miniconda3/envs/spark/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 529, in dataframe_to_types type_ = pa.array(c, from_pandas=True).type File "pyarrow/array.pxi", line 292, in pyarrow.lib.array File "pyarrow/array.pxi", line 83, in pyarrow.lib._ndarray_to_array File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: Could not convert (1,1) with type ExamplePoint
[jira] [Updated] (SPARK-34771) Support UDT for Pandas with Arrow Optimization
[ https://issues.apache.org/jira/browse/SPARK-34771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Darcy Shen updated SPARK-34771: --- Description: {code:python} spark.conf.set("spark.sql.execution.arrow.enabled", "true") from pyspark.testing.sqlutils import ExamplePoint import pandas as pd pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.toPandas() {code} with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS. with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off. Detailed steps to reproduce: {code:python} $ bin/pyspark Python 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) Spark context Web UI available at http://172.30.0.226:4040 Spark context available as 'sc' (master = local[*], app id = local-1615994008526). SparkSession available as 'spark'. >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") 21/03/17 23:13:31 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it. >>> from pyspark.testing.sqlutils import ExamplePoint >>> import pandas as pd >>> pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, >>> 2)])}) >>> df = spark.createDataFrame(pdf) /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:332: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Could not convert (1,1) with type ExamplePoint: did not recognize Python value type when inferring an Arrow data type Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) >>> >>> df.show() +--+ | point| +--+ |(0.0, 0.0)| |(0.0, 0.0)| +--+ >>> df.schema StructType(List(StructField(point,ExamplePointUDT,true))) >>> df.toPandas() /Users/da/github/apache/spark/python/pyspark/sql/pandas/conversion.py:87: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unsupported type in conversion to Arrow: ExamplePointUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) point 0 (0.0,0.0) 1 (0.0,0.0) {code} was: {code:python} spark.conf.set("spark.sql.execution.arrow.enabled", "true") from pyspark.testing.sqlutils import ExamplePoint import pandas as pd pdf = pd.DataFrame({'point': pd.Series([ExamplePoint(1, 1), ExamplePoint(2, 2)])}) df = spark.createDataFrame(pdf) df.toPandas() {code} with `spark.sql.execution.arrow.enabled` = false, the above snippet works fine without WARNINGS. with `spark.sql.execution.arrow.enabled` = true, the above snippet works fine with WARNINGS. Because of Unsupported type in conversion, the Arrow optimization is actually turned off. Detailed steps to reproduce: {code:python} $ bin/pyspark Python 3.8.8 (default, Feb 24 2021, 13:46:16) [Clang 10.0.0 ] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/03/17 23:13:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.2.0-SNAPSHOT /_/ Using Python version 3.8.8 (default, Feb 24 2021 13:46:16) Spark context Web UI available at http://172.30.0.226:4040 Spark context available as 'sc' (master = local[*], app id = local-1615994008526). SparkSession available as 'spark'. >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") 21/03/17
[jira] [Created] (SPARK-34786) read parquet uint64 as decimal
Wenchen Fan created SPARK-34786: --- Summary: read parquet uint64 as decimal Key: SPARK-34786 URL: https://issues.apache.org/jira/browse/SPARK-34786 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan Currently Spark can't read parquet uint64 as it doesn't fit the Spark long type. We can read uint64 as decimal as a workaround. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34774) The `change-scala- version.sh` script not replaced scala.version property correctly
[ https://issues.apache.org/jira/browse/SPARK-34774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-34774. -- Fix Version/s: 3.0.3 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 31865 [https://github.com/apache/spark/pull/31865] > The `change-scala- version.sh` script not replaced scala.version property > correctly > --- > > Key: SPARK-34774 > URL: https://issues.apache.org/jira/browse/SPARK-34774 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > Execute the following commands in order > # dev/change-scala-version.sh 2.13 > # dev/change-scala-version.sh 2.12 > # git status > there will generate git diff as follow: > {code:java} > diff --git a/pom.xml b/pom.xml > index ddc4ce2f68..f43d8c8f78 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -162,7 +162,7 @@ > 3.4.1 > > 3.2.2 > - 2.12.10 > + 2.13.5 > 2.12 > 2.0.0 > --test > {code} > seem 'scala.version' property was not replaced correctly -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34774) The `change-scala- version.sh` script not replaced scala.version property correctly
[ https://issues.apache.org/jira/browse/SPARK-34774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-34774: Assignee: Yang Jie > The `change-scala- version.sh` script not replaced scala.version property > correctly > --- > > Key: SPARK-34774 > URL: https://issues.apache.org/jira/browse/SPARK-34774 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > Execute the following commands in order > # dev/change-scala-version.sh 2.13 > # dev/change-scala-version.sh 2.12 > # git status > there will generate git diff as follow: > {code:java} > diff --git a/pom.xml b/pom.xml > index ddc4ce2f68..f43d8c8f78 100644 > --- a/pom.xml > +++ b/pom.xml > @@ -162,7 +162,7 @@ > 3.4.1 > > 3.2.2 > - 2.12.10 > + 2.13.5 > 2.12 > 2.0.0 > --test > {code} > seem 'scala.version' property was not replaced correctly -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34766) Do not capture maven config for views
[ https://issues.apache.org/jira/browse/SPARK-34766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-34766: Fix Version/s: 3.1.2 > Do not capture maven config for views > - > > Key: SPARK-34766 > URL: https://issues.apache.org/jira/browse/SPARK-34766 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: ulysses you >Assignee: ulysses you >Priority: Minor > Fix For: 3.2.0, 3.1.2 > > > Due to the bad network, we always use the thirdparty maven repo to run test. > e.g., > {code:java} > build/sbt "test:testOnly *SQLQueryTestSuite" > -Dspark.sql.maven.additionalRemoteRepositories=x > {code} > > It's failed with such error msg > ``` > [info] - show-tblproperties.sql *** FAILED *** (128 milliseconds) > [info] show-tblproperties.sql > [info] Expected "...rredTempViewNames [][]", but got "...rredTempViewNames [][ > [info] view.sqlConfig.spark.sql.maven.additionalRemoteRepositories x]" > Result did not match for query #6 > [info] SHOW TBLPROPERTIES view (SQLQueryTestSuite.scala:464) > ``` > It's not necessary to capture the maven config to view since it's a session > level config. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34786) read parquet uint64 as decimal
[ https://issues.apache.org/jira/browse/SPARK-34786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304136#comment-17304136 ] Kent Yao commented on SPARK-34786: -- I'd like to do this. > read parquet uint64 as decimal > -- > > Key: SPARK-34786 > URL: https://issues.apache.org/jira/browse/SPARK-34786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > > Currently Spark can't read parquet uint64 as it doesn't fit the Spark long > type. We can read uint64 as decimal as a workaround. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)
akiyamaneko created SPARK-34787: --- Summary: Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX) Key: SPARK-34787 URL: https://issues.apache.org/jira/browse/SPARK-34787 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.1 Reporter: akiyamaneko Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX): {code:html} 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO FsHistoryProvider: Parsing hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421 for listing data... {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)
[ https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304145#comment-17304145 ] Apache Spark commented on SPARK-34787: -- User 'kyoty' has created a pull request for this issue: https://github.com/apache/spark/pull/31882 > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX) > -- > > Key: SPARK-34787 > URL: https://issues.apache.org/jira/browse/SPARK-34787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX): > {code:html} > 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt > application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO > FsHistoryProvider: Parsing > hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421 > for listing data... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)
[ https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34787: Assignee: Apache Spark > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX) > -- > > Key: SPARK-34787 > URL: https://issues.apache.org/jira/browse/SPARK-34787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Assignee: Apache Spark >Priority: Minor > > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX): > {code:html} > 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt > application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO > FsHistoryProvider: Parsing > hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421 > for listing data... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)
[ https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34787: Assignee: (was: Apache Spark) > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX) > -- > > Key: SPARK-34787 > URL: https://issues.apache.org/jira/browse/SPARK-34787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.1 >Reporter: akiyamaneko >Priority: Minor > > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX): > {code:html} > 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt > application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO > FsHistoryProvider: Parsing > hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421 > for listing data... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full
wuyi created SPARK-34788: Summary: Spark throws FileNotFoundException instead of IOException when disk is full Key: SPARK-34788 URL: https://issues.apache.org/jira/browse/SPARK-34788 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0, 3.0.0, 2.4.7 Reporter: wuyi When the disk is full, Spark throws FileNotFoundException instead of IOException with the hint. It's quite a confusing error to users. And there's probably a way to detect the disk full: when we get `FileNotFoundException`, we try [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] to see if SyncFailedException throws. If SyncFailedException throws, then we throw IOException with the disk full hint. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full
[ https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304185#comment-17304185 ] wuyi commented on SPARK-34788: -- [~wuxiaofei] Could you take a look at this? > Spark throws FileNotFoundException instead of IOException when disk is full > --- > > Key: SPARK-34788 > URL: https://issues.apache.org/jira/browse/SPARK-34788 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.7, 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > When the disk is full, Spark throws FileNotFoundException instead of > IOException with the hint. It's quite a confusing error to users. > > And there's probably a way to detect the disk full: when we get > `FileNotFoundException`, we try > [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] > to see if SyncFailedException throws. If SyncFailedException throws, then we > throw IOException with the disk full hint. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full
[ https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-34788: - Component/s: Shuffle > Spark throws FileNotFoundException instead of IOException when disk is full > --- > > Key: SPARK-34788 > URL: https://issues.apache.org/jira/browse/SPARK-34788 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.7, 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > When the disk is full, Spark throws FileNotFoundException instead of > IOException with the hint. It's quite a confusing error to users. > > And there's probably a way to detect the disk full: when we get > `FileNotFoundException`, we try > [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] > to see if SyncFailedException throws. If SyncFailedException throws, then we > throw IOException with the disk full hint. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34760) run JavaSQLDataSourceExample failed with Exception in runBasicDataSourceExample().
[ https://issues.apache.org/jira/browse/SPARK-34760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-34760: Assignee: zengrui > run JavaSQLDataSourceExample failed with Exception in > runBasicDataSourceExample(). > -- > > Key: SPARK-34760 > URL: https://issues.apache.org/jira/browse/SPARK-34760 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 3.0.1, 3.1.1 >Reporter: zengrui >Assignee: zengrui >Priority: Minor > > run JavaSparkSQLExample failed with Exception in runBasicDataSourceExample(). > when excecute > 'peopleDF.write().partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("people_partitioned_bucketed");' > throws Exception: 'Exception in thread "main" > org.apache.spark.sql.AnalysisException: partition column favorite_color is > not defined in table people_partitioned_bucketed, defined table columns are: > age, name;' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34760) run JavaSQLDataSourceExample failed with Exception in runBasicDataSourceExample().
[ https://issues.apache.org/jira/browse/SPARK-34760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-34760. -- Fix Version/s: 3.0.3 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 31851 [https://github.com/apache/spark/pull/31851] > run JavaSQLDataSourceExample failed with Exception in > runBasicDataSourceExample(). > -- > > Key: SPARK-34760 > URL: https://issues.apache.org/jira/browse/SPARK-34760 > Project: Spark > Issue Type: Bug > Components: Examples >Affects Versions: 3.0.1, 3.1.1 >Reporter: zengrui >Assignee: zengrui >Priority: Minor > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > run JavaSparkSQLExample failed with Exception in runBasicDataSourceExample(). > when excecute > 'peopleDF.write().partitionBy("favorite_color").bucketBy(42,"name").saveAsTable("people_partitioned_bucketed");' > throws Exception: 'Exception in thread "main" > org.apache.spark.sql.AnalysisException: partition column favorite_color is > not defined in table people_partitioned_bucketed, defined table columns are: > age, name;' -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used
Attila Zsolt Piros created SPARK-34789: -- Summary: Introduce Jetty based construct for integration tests where HTTP(S) is used Key: SPARK-34789 URL: https://issues.apache.org/jira/browse/SPARK-34789 Project: Spark Issue Type: Task Components: Tests Affects Versions: 3.2.0 Reporter: Attila Zsolt Piros This came up during https://github.com/apache/spark/pull/31877#discussion_r596831803. Short summary: we have some tests where HTTP(S) is used to access files. The current solution uses github urls like "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";. This connects two Spark version in an unhealthy way like connecting the "master" branch which is moving part with the committed test code which is a non-moving (as it might be even released). So this way a test running for an earlier version of Spark expects something (filename, content, path) from a the latter release and what is worse when the moving version is changed the earlier test will break. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used
[ https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304245#comment-17304245 ] Attila Zsolt Piros commented on SPARK-34789: I am working on this > Introduce Jetty based construct for integration tests where HTTP(S) is used > --- > > Key: SPARK-34789 > URL: https://issues.apache.org/jira/browse/SPARK-34789 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up during > https://github.com/apache/spark/pull/31877#discussion_r596831803. > Short summary: we have some tests where HTTP(S) is used to access files. The > current solution uses github urls like > "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";. > This connects two Spark version in an unhealthy way like connecting the > "master" branch which is moving part with the committed test code which is a > non-moving (as it might be even released). > So this way a test running for an earlier version of Spark expects something > (filename, content, path) from a the latter release and what is worse when > the moving version is changed the earlier test will break. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
hezuojiao created SPARK-34790: - Summary: Fail in fetch shuffle blocks in batch when i/o encryption is enabled. Key: SPARK-34790 URL: https://issues.apache.org/jira/browse/SPARK-34790 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.1 Reporter: hezuojiao When set spark.io.encryption.enabled=true, lots of test cases in AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is incompatible with io encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used
[ https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-34789: --- Description: This came up during https://github.com/apache/spark/pull/31877#discussion_r596831803. Short summary: we have some tests where HTTP(S) is used to access files. The current solution uses github urls like "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";. This connects two Spark version in an unhealthy way like connecting the "master" branch which is moving part with the committed test code which is a non-moving (as it might be even released). So this way a test running for an earlier version of Spark expects something (filename, content, path) from a the latter release and what is worse when the moving version is changed the earlier test will break. The idea is to introduce a method like: {noformat} withHttpServer(files) { } {noformat} Which uses a Jetty ResourceHandler to serve the listed files (or directories / or just the root where it is started from) and stops the server in the finally. was: This came up during https://github.com/apache/spark/pull/31877#discussion_r596831803. Short summary: we have some tests where HTTP(S) is used to access files. The current solution uses github urls like "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";. This connects two Spark version in an unhealthy way like connecting the "master" branch which is moving part with the committed test code which is a non-moving (as it might be even released). So this way a test running for an earlier version of Spark expects something (filename, content, path) from a the latter release and what is worse when the moving version is changed the earlier test will break. > Introduce Jetty based construct for integration tests where HTTP(S) is used > --- > > Key: SPARK-34789 > URL: https://issues.apache.org/jira/browse/SPARK-34789 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up during > https://github.com/apache/spark/pull/31877#discussion_r596831803. > Short summary: we have some tests where HTTP(S) is used to access files. The > current solution uses github urls like > "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";. > This connects two Spark version in an unhealthy way like connecting the > "master" branch which is moving part with the committed test code which is a > non-moving (as it might be even released). > So this way a test running for an earlier version of Spark expects something > (filename, content, path) from a the latter release and what is worse when > the moving version is changed the earlier test will break. > The idea is to introduce a method like: > {noformat} > withHttpServer(files) { > } > {noformat} > Which uses a Jetty ResourceHandler to serve the listed files (or directories > / or just the root where it is started from) and stops the server in the > finally. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34693) Support null in conversions to and from Arrow
[ https://issues.apache.org/jira/browse/SPARK-34693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304266#comment-17304266 ] Laurens commented on SPARK-34693: - {code:java} def tps(df) -> pd.DataFrame["workflow_id": int, "task_id": int, "task_slack": int]: df.set_index("id", inplace=True) graph = dict() forward_dict = dict() finish_times = dict() task_runtimes = dict() task_arrival_times = dict() workflow_id = None for row in df.to_records(): # 0: task id, 1: wf id, 2: children, 3: parents, 4: ts submit, 5: runtime graph[row[0]] = set(row[3].flatten()) forward_dict[row[0]] = set(row[2].flatten()) task_runtimes[row[0]] = row[5] task_arrival_times[row[0]] = row[4] workflow_id = row[1] del df del row try: groups = list(toposort(graph)) except CircularDependencyError: del forward_dict del finish_times del task_runtimes del task_arrival_times del workflow_id return pd.DataFrame(columns=["workflow_id", "task_id", "task_slack"]) del graphif len(groups) < 2: del forward_dict del finish_times del task_runtimes del task_arrival_times del workflow_id del groups return pd.DataFrame(columns=["workflow_id", "task_id", "task_slack"]) # Compute all full paths max_for_task = dict() q = deque() for i in groups[0]: max_for_task[i] = task_arrival_times[i] + task_runtimes[i] q.append((max_for_task[i], [i])) paths = [] max_path = -1 while len(q) > 0: time, path = q.popleft() # get partial path # We are are at time t[0] having travelled path t[1] if len(forward_dict[path[-1]]) == 0: # End task if time < max_path: continue # smaller path, cannot be a CP if time > max_path: # If we have a new max, clear the list and set it to the max max_path = time paths.clear() paths.append(path) # Add new and identical length paths else: for c in forward_dict[path[-1]]: # Loop over all children of the last task in the path we took if c in task_runtimes: # Special case: we find a child that arrives later then the current path. # Start a new path then from this child onwards and do not mark the previous nodes on the critical path # as they can actually delay. if time < task_arrival_times[c]: if c not in max_for_task: max_for_task[c] = task_arrival_times[c] + task_runtimes[c] q.append((max_for_task[c], [c])) else: # If the finishing of one of the children + the runtime causes the same or a new maximum, add a path to the queue # to explore from there on at that time. child_finish = time + task_runtimes[c] if c in max_for_task and child_finish < max_for_task[c]: continue max_for_task[c] = child_finish l = path.copy() l.append(c) q.append((child_finish, l)) del time del max_for_task del forward_dict del max_path del q del path if(len(paths) == 1): citical_path_tasks = set(paths[0]) else: citical_path_tasks = _reduce(set.union, [set(cp) for cp in paths]) del paths rows = deque() for i in range(len(groups)): max_in_group = -1 for task_id in groups[i]: # Compute max finish time in the group if task_id not in finish_times: continue # Task was not in snapshot of trace task_finish = finish_times[task_id] if task_finish > max_in_group: max_in_group = task_finish for task_id in groups[i]: if task_id in citical_path_tasks: # Skip if task is on a critical path rows.append([workflow_id, task_id, 0]) else: if task_id not in finish_times: continue # Task was not in snapshot of trace rows.append([workflow_id, task_id, max_in_group - finish_times[task_id]]) del groups del citical_path_tasks del finish_times return pd.DataFrame(rows, columns=["workflow_id", "task_id", "task_slack"]) {code} and the reading part: {code:java} kdf = ks.read_parquet(os.path.join(hdfs_path, folder, "tasks", "schema-1.0"), columns=[ "workflow_id", "id", "task_id", "children", "parents", "ts_submit", "runtime" ], pandas_metadata=False, engine='pyarrow') kdf = kdf[((kdf['children'].map(len) > 0) | (kdf['parents'].map(len) > 0))] kdf = kdf.astype({"
[jira] [Created] (SPARK-34791) SparkR throws node stack overflow
Jeet created SPARK-34791: Summary: SparkR throws node stack overflow Key: SPARK-34791 URL: https://issues.apache.org/jira/browse/SPARK-34791 Project: Spark Issue Type: Question Components: SparkR Affects Versions: 3.0.1 Reporter: Jeet SparkR throws "node stack overflow" error upon running code (sample below) on R-4.0.2 with Spark 3.0.1. Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) {{}} {code:java} source('sample.R') myclsr = myclosure_func() myclsr$get_some_date('2021-01-01') ## spark.lapply throws node stack overflow result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { source('sample.R') another_closure = myclosure_func() return(another_closure$get_some_date(rdate)) }) {code} {{}} Sample.R {{}} {code:java} ## util function, which calls itself getPreviousBusinessDate <- function(asofdate) { asdt <- asofdate; asdt <- as.Date(asofdate)-1; wd <- format(as.Date(asdt),"%A") if(wd == "Saturday" | wd == "Sunday") { return (getPreviousBusinessDate(asdt)); } return (asdt); } ## closure which calls util function myclosure_func = function() { myclosure = list() get_some_date = function (random_date) { return (getPreviousBusinessDate(random_date)) } myclosure$get_some_date = get_some_date return(myclosure) } {code} {{}} {{}} This seems to have caused by sourcing sample.R twice. Once before invoking Spark session and another within Spark session. {{}} {{}} {{}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34791) SparkR throws node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeet updated SPARK-34791: - Description: SparkR throws "node stack overflow" error upon running code (sample below) on R-4.0.2 with Spark 3.0.1. Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) {{}} {code:java} source('sample.R') myclsr = myclosure_func() myclsr$get_some_date('2021-01-01') ## spark.lapply throws node stack overflow result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { source('sample.R') another_closure = myclosure_func() return(another_closure$get_some_date(rdate)) }) {code} {{}} Sample.R {{}} {code:java} ## util function, which calls itself getPreviousBusinessDate <- function(asofdate) { asdt <- asofdate; asdt <- as.Date(asofdate)-1; wd <- format(as.Date(asdt),"%A") if(wd == "Saturday" | wd == "Sunday") { return (getPreviousBusinessDate(asdt)); } return (asdt); } ## closure which calls util function myclosure_func = function() { myclosure = list() get_some_date = function (random_date) { return (getPreviousBusinessDate(random_date)) } myclosure$get_some_date = get_some_date return(myclosure) } {code} This seems to have caused by sourcing sample.R twice. Once before invoking Spark session and another within Spark session. was: SparkR throws "node stack overflow" error upon running code (sample below) on R-4.0.2 with Spark 3.0.1. Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) {{}} {code:java} source('sample.R') myclsr = myclosure_func() myclsr$get_some_date('2021-01-01') ## spark.lapply throws node stack overflow result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { source('sample.R') another_closure = myclosure_func() return(another_closure$get_some_date(rdate)) }) {code} {{}} Sample.R {{}} {code:java} ## util function, which calls itself getPreviousBusinessDate <- function(asofdate) { asdt <- asofdate; asdt <- as.Date(asofdate)-1; wd <- format(as.Date(asdt),"%A") if(wd == "Saturday" | wd == "Sunday") { return (getPreviousBusinessDate(asdt)); } return (asdt); } ## closure which calls util function myclosure_func = function() { myclosure = list() get_some_date = function (random_date) { return (getPreviousBusinessDate(random_date)) } myclosure$get_some_date = get_some_date return(myclosure) } {code} {{}} {{}} This seems to have caused by sourcing sample.R twice. Once before invoking Spark session and another within Spark session. {{}} {{}} {{}} > SparkR throws node stack overflow > - > > Key: SPARK-34791 > URL: https://issues.apache.org/jira/browse/SPARK-34791 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 3.0.1 >Reporter: Jeet >Priority: Major > > SparkR throws "node stack overflow" error upon running code (sample below) on > R-4.0.2 with Spark 3.0.1. > Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) > {{}} > {code:java} > source('sample.R') > myclsr = myclosure_func() > myclsr$get_some_date('2021-01-01') > ## spark.lapply throws node stack overflow > result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { > source('sample.R') > another_closure = myclosure_func() > return(another_closure$get_some_date(rdate)) > }) > {code} > {{}} > Sample.R > {{}} > {code:java} > ## util function, which calls itself > getPreviousBusinessDate <- function(asofdate) { > asdt <- asofdate; > asdt <- as.Date(asofdate)-1; > wd <- format(as.Date(asdt),"%A") > if(wd == "Saturday" | wd == "Sunday") { > return (getPreviousBusinessDate(asdt)); > } > return (asdt); > } > ## closure which calls util function > myclosure_func = function() { > myclosure = list() > get_some_date = function (random_date) { > return (getPreviousBusinessDate(random_date)) > } > myclosure$get_some_date = get_some_date > return(myclosure) > } > {code} > This seems to have caused by sourcing sample.R twice. Once before invoking > Spark session and another within Spark session. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34791) SparkR throws node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeet updated SPARK-34791: - Description: SparkR throws "node stack overflow" error upon running code (sample below) on R-4.0.2 with Spark 3.0.1. Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) {code:java} source('sample.R') myclsr = myclosure_func() myclsr$get_some_date('2021-01-01') ## spark.lapply throws node stack overflow result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { source('sample.R') another_closure = myclosure_func() return(another_closure$get_some_date(rdate)) }) {code} Sample.R {code:java} ## util function, which calls itself getPreviousBusinessDate <- function(asofdate) { asdt <- asofdate; asdt <- as.Date(asofdate)-1; wd <- format(as.Date(asdt),"%A") if(wd == "Saturday" | wd == "Sunday") { return (getPreviousBusinessDate(asdt)); } return (asdt); } ## closure which calls util function myclosure_func = function() { myclosure = list() get_some_date = function (random_date) { return (getPreviousBusinessDate(random_date)) } myclosure$get_some_date = get_some_date return(myclosure) } {code} This seems to have caused by sourcing sample.R twice. Once before invoking Spark session and another within Spark session. was: SparkR throws "node stack overflow" error upon running code (sample below) on R-4.0.2 with Spark 3.0.1. Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) {{}} {code:java} source('sample.R') myclsr = myclosure_func() myclsr$get_some_date('2021-01-01') ## spark.lapply throws node stack overflow result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { source('sample.R') another_closure = myclosure_func() return(another_closure$get_some_date(rdate)) }) {code} {{}} Sample.R {{}} {code:java} ## util function, which calls itself getPreviousBusinessDate <- function(asofdate) { asdt <- asofdate; asdt <- as.Date(asofdate)-1; wd <- format(as.Date(asdt),"%A") if(wd == "Saturday" | wd == "Sunday") { return (getPreviousBusinessDate(asdt)); } return (asdt); } ## closure which calls util function myclosure_func = function() { myclosure = list() get_some_date = function (random_date) { return (getPreviousBusinessDate(random_date)) } myclosure$get_some_date = get_some_date return(myclosure) } {code} This seems to have caused by sourcing sample.R twice. Once before invoking Spark session and another within Spark session. > SparkR throws node stack overflow > - > > Key: SPARK-34791 > URL: https://issues.apache.org/jira/browse/SPARK-34791 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 3.0.1 >Reporter: Jeet >Priority: Major > > SparkR throws "node stack overflow" error upon running code (sample below) on > R-4.0.2 with Spark 3.0.1. > Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) > {code:java} > source('sample.R') > myclsr = myclosure_func() > myclsr$get_some_date('2021-01-01') > ## spark.lapply throws node stack overflow > result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { > source('sample.R') > another_closure = myclosure_func() > return(another_closure$get_some_date(rdate)) > }) > {code} > Sample.R > {code:java} > ## util function, which calls itself > getPreviousBusinessDate <- function(asofdate) { > asdt <- asofdate; > asdt <- as.Date(asofdate)-1; > wd <- format(as.Date(asdt),"%A") > if(wd == "Saturday" | wd == "Sunday") { > return (getPreviousBusinessDate(asdt)); > } > return (asdt); > } > ## closure which calls util function > myclosure_func = function() { > myclosure = list() > get_some_date = function (random_date) { > return (getPreviousBusinessDate(random_date)) > } > myclosure$get_some_date = get_some_date > return(myclosure) > } > {code} > This seems to have caused by sourcing sample.R twice. Once before invoking > Spark session and another within Spark session. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Chen updated SPARK-34780: - Affects Version/s: 3.1.1 > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). I suspect this also > happens in other versions of Spark but haven't tested yet. > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Chen updated SPARK-34780: - Description: When a dataframe is cached, the logical plan can contain copies of the spark session meaning the SQLConfs are stored. Then if a different dataframe can replace parts of it's logical plan with a cached logical plan, the cached SQLConfs will be used for the evaluation of the cached logical plan. This is because HadoopFsRelation ignores sparkSession for equality checks (introduced in https://issues.apache.org/jira/browse/SPARK-17358). {code:java} test("cache uses old SQLConf") { import testImplicits._ withTempDir { dir => val tableDir = dir.getAbsoluteFile + "/table" val df = Seq("a").toDF("key") df.write.parquet(tableDir) SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") val compression1Stats = spark.read.parquet(tableDir).select("key"). queryExecution.optimizedPlan.collect { case l: LogicalRelation => l case m: InMemoryRelation => m }.map(_.computeStats()) SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") val df2 = spark.read.parquet(tableDir).select("key") df2.cache() val compression10Stats = df2.queryExecution.optimizedPlan.collect { case l: LogicalRelation => l case m: InMemoryRelation => m }.map(_.computeStats()) SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") val compression1StatsWithCache = spark.read.parquet(tableDir).select("key"). queryExecution.optimizedPlan.collect { case l: LogicalRelation => l case m: InMemoryRelation => m }.map(_.computeStats()) // I expect these stats to be the same because file compression factor is the same assert(compression1Stats == compression1StatsWithCache) // Instead, we can see the file compression factor is being cached and used along with // the logical plan assert(compression10Stats == compression1StatsWithCache) } }{code} was: When a dataframe is cached, the logical plan can contain copies of the spark session meaning the SQLConfs are stored. Then if a different dataframe can replace parts of it's logical plan with a cached logical plan, the cached SQLConfs will be used for the evaluation of the cached logical plan. This is because HadoopFsRelation ignores sparkSession for equality checks (introduced in https://issues.apache.org/jira/browse/SPARK-17358). I suspect this also happens in other versions of Spark but haven't tested yet. {code:java} test("cache uses old SQLConf") { import testImplicits._ withTempDir { dir => val tableDir = dir.getAbsoluteFile + "/table" val df = Seq("a").toDF("key") df.write.parquet(tableDir) SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") val compression1Stats = spark.read.parquet(tableDir).select("key"). queryExecution.optimizedPlan.collect { case l: LogicalRelation => l case m: InMemoryRelation => m }.map(_.computeStats()) SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") val df2 = spark.read.parquet(tableDir).select("key") df2.cache() val compression10Stats = df2.queryExecution.optimizedPlan.collect { case l: LogicalRelation => l case m: InMemoryRelation => m }.map(_.computeStats()) SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") val compression1StatsWithCache = spark.read.parquet(tableDir).select("key"). queryExecution.optimizedPlan.collect { case l: LogicalRelation => l case m: InMemoryRelation => m }.map(_.computeStats()) // I expect these stats to be the same because file compression factor is the same assert(compression1Stats == compression1StatsWithCache) // Instead, we can see the file compression factor is being cached and used along with // the logical plan assert(compression10Stats == compression1StatsWithCache) } }{code} > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import test
[jira] [Commented] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304315#comment-17304315 ] Apache Spark commented on SPARK-33482: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/31848 > V2 Datasources that extend FileScan preclude exchange reuse > --- > > Key: SPARK-33482 > URL: https://issues.apache.org/jira/browse/SPARK-33482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Bruce Robbins >Priority: Major > > Sample query: > {noformat} > spark.read.parquet("tbl").createOrReplaceTempView("tbl") > spark.read.parquet("lookup").createOrReplaceTempView("lookup") > sql(""" >select tbl.col1, fk1, fk2 >from tbl, lookup l1, lookup l2 >where fk1 = l1.key >and fk2 = l2.key > """).explain > {noformat} > Test files can be created as so: > {noformat} > import scala.util.Random > val rand = Random > val tbl = spark.range(1, 1).map { x => > (rand.nextLong.abs % 20, >rand.nextLong.abs % 20, >x) > }.toDF("fk1", "fk2", "col1") > tbl.write.mode("overwrite").parquet("tbl") > val lookup = spark.range(0, 20).map { x => > (x + 1, x * 1, (x + 1) * 1) > }.toDF("key", "col1", "col2") > lookup.write.mode("overwrite").parquet("lookup") > {noformat} > Output with V1 Parquet reader: > {noformat} > == Physical Plan == > *(3) Project [col1#2L, fk1#0L, fk2#1L] > +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false >:- *(3) Project [fk1#0L, fk2#1L, col1#2L] >: +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false >: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L)) >: : +- *(3) ColumnarToRow >: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, > DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, > Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], > PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], > ReadSchema: struct >: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > bigint, false]),false), [id=#75] >:+- *(1) Filter isnotnull(key#6L) >: +- *(1) ColumnarToRow >: +- FileScan parquet [key#6L] Batched: true, DataFilters: > [isnotnull(key#6L)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: > struct >+- ReusedExchange [key#12L], BroadcastExchange > HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75] > {noformat} > With V1 Parquet reader, the exchange for lookup is reused (see last line). > Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""): > {noformat} > == Physical Plan == > *(3) Project [col1#2L, fk1#0L, fk2#1L] > +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false >:- *(3) Project [fk1#0L, fk2#1L, col1#2L] >: +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false >: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L)) >: : +- *(3) ColumnarToRow >: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: > [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], > PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], > ReadSchema: struct, PushedFilters: > [IsNotNull(fk1), IsNotNull(fk2)] >: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > bigint, false]),false), [id=#75] >:+- *(1) Filter isnotnull(key#6L) >: +- *(1) ColumnarToRow >: +- BatchScan[key#6L] ParquetScan DataFilters: > [isnotnull(key#6L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: > struct, PushedFilters: [IsNotNull(key)] >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false]),false), [id=#83] > +- *(2) Filter isnotnull(key#12L) > +- *(2) ColumnarToRow > +- BatchScan[key#12L] ParquetScan DataFilters: > [isnotnull(key#12L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: > struct, PushedFilters: [IsNotNull(key)] > {noformat} > With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 > lines). > You can see the same issue with the Orc reader (and I assume any other > datasource that extends Filescan)
[jira] [Commented] (SPARK-33482) V2 Datasources that extend FileScan preclude exchange reuse
[ https://issues.apache.org/jira/browse/SPARK-33482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304316#comment-17304316 ] Apache Spark commented on SPARK-33482: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/31848 > V2 Datasources that extend FileScan preclude exchange reuse > --- > > Key: SPARK-33482 > URL: https://issues.apache.org/jira/browse/SPARK-33482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: Bruce Robbins >Priority: Major > > Sample query: > {noformat} > spark.read.parquet("tbl").createOrReplaceTempView("tbl") > spark.read.parquet("lookup").createOrReplaceTempView("lookup") > sql(""" >select tbl.col1, fk1, fk2 >from tbl, lookup l1, lookup l2 >where fk1 = l1.key >and fk2 = l2.key > """).explain > {noformat} > Test files can be created as so: > {noformat} > import scala.util.Random > val rand = Random > val tbl = spark.range(1, 1).map { x => > (rand.nextLong.abs % 20, >rand.nextLong.abs % 20, >x) > }.toDF("fk1", "fk2", "col1") > tbl.write.mode("overwrite").parquet("tbl") > val lookup = spark.range(0, 20).map { x => > (x + 1, x * 1, (x + 1) * 1) > }.toDF("key", "col1", "col2") > lookup.write.mode("overwrite").parquet("lookup") > {noformat} > Output with V1 Parquet reader: > {noformat} > == Physical Plan == > *(3) Project [col1#2L, fk1#0L, fk2#1L] > +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false >:- *(3) Project [fk1#0L, fk2#1L, col1#2L] >: +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false >: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L)) >: : +- *(3) ColumnarToRow >: : +- FileScan parquet [fk1#0L,fk2#1L,col1#2L] Batched: true, > DataFilters: [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: Parquet, > Location: InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], > PartitionFilters: [], PushedFilters: [IsNotNull(fk1), IsNotNull(fk2)], > ReadSchema: struct >: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > bigint, false]),false), [id=#75] >:+- *(1) Filter isnotnull(key#6L) >: +- *(1) ColumnarToRow >: +- FileScan parquet [key#6L] Batched: true, DataFilters: > [isnotnull(key#6L)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilters: [IsNotNull(key)], ReadSchema: > struct >+- ReusedExchange [key#12L], BroadcastExchange > HashedRelationBroadcastMode(List(input[0, bigint, false]),false), [id=#75] > {noformat} > With V1 Parquet reader, the exchange for lookup is reused (see last line). > Output with V2 Parquet reader (spark.sql.sources.useV1SourceList=""): > {noformat} > == Physical Plan == > *(3) Project [col1#2L, fk1#0L, fk2#1L] > +- *(3) BroadcastHashJoin [fk2#1L], [key#12L], Inner, BuildRight, false >:- *(3) Project [fk1#0L, fk2#1L, col1#2L] >: +- *(3) BroadcastHashJoin [fk1#0L], [key#6L], Inner, BuildRight, false >: :- *(3) Filter (isnotnull(fk1#0L) AND isnotnull(fk2#1L)) >: : +- *(3) ColumnarToRow >: : +- BatchScan[fk1#0L, fk2#1L, col1#2L] ParquetScan DataFilters: > [isnotnull(fk1#0L), isnotnull(fk2#1L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/tbl], > PartitionFilters: [], PushedFilers: [IsNotNull(fk1), IsNotNull(fk2)], > ReadSchema: struct, PushedFilters: > [IsNotNull(fk1), IsNotNull(fk2)] >: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, > bigint, false]),false), [id=#75] >:+- *(1) Filter isnotnull(key#6L) >: +- *(1) ColumnarToRow >: +- BatchScan[key#6L] ParquetScan DataFilters: > [isnotnull(key#6L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: > struct, PushedFilters: [IsNotNull(key)] >+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, > false]),false), [id=#83] > +- *(2) Filter isnotnull(key#12L) > +- *(2) ColumnarToRow > +- BatchScan[key#12L] ParquetScan DataFilters: > [isnotnull(key#12L)], Format: parquet, Location: > InMemoryFileIndex[file:/Users/bruce/github/spark_upstream/lookup], > PartitionFilters: [], PushedFilers: [IsNotNull(key)], ReadSchema: > struct, PushedFilters: [IsNotNull(key)] > {noformat} > With the V2 Parquet reader, the exchange for lookup is not reused (see last 4 > lines). > You can see the same issue with the Orc reader (and I assume any other > datasource that extends Filescan)
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304397#comment-17304397 ] Erik Erlandson commented on SPARK-24432: [~dongjoon] should this be closed, now that spark 3.1 is available (per [above|https://issues.apache.org/jira/browse/SPARK-24432?focusedCommentId=17224905&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17224905]) > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
kondziolka9ld created SPARK-34792: - Summary: Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3 Key: SPARK-34792 URL: https://issues.apache.org/jira/browse/SPARK-34792 Project: Spark Issue Type: Question Components: Spark Core, SQL Affects Versions: 3.0.1 Reporter: kondziolka9ld Hi, Please consider a following difference even despite of the same seed. Is it possible to restore the same {code:java} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.7 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) Type in expressions to have them evaluated. Type :help for more information. scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] scala> f.show +-+ |value| +-+ |4| +-+ scala> s.show +-+ |value| +-+ |1| |2| |3| |5| |6| |7| |8| |9| | 10| +-+ {code} while as on spark-3 {code:java} scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] scala> f.show +-+ |value| +-+ |5| | 10| +-+ scala> s.show +-+ |value| +-+ |1| |2| |3| |4| |6| |7| |8| |9| +-+ {code} I guess that implementation of `sample` method changed. Is it possible to restore previous behaviour? Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
[ https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kondziolka9ld updated SPARK-34792: -- Description: Hi, Please consider a following difference of `randomSplit` method even despite of the same seed. {code:java} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.7 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) Type in expressions to have them evaluated. Type :help for more information. scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] scala> f.show +-+ |value| +-+ |4| +-+ scala> s.show +-+ |value| +-+ |1| |2| |3| |5| |6| |7| |8| |9| | 10| +-+ {code} while as on spark-3 {code:java} scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] scala> f.show +-+ |value| +-+ |5| | 10| +-+ scala> s.show +-+ |value| +-+ |1| |2| |3| |4| |6| |7| |8| |9| +-+ {code} I guess that implementation of `sample` method changed. Is it possible to restore previous behaviour? Thanks in advance! was: Hi, Please consider a following difference even despite of the same seed. Is it possible to restore the same {code:java} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.7 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) Type in expressions to have them evaluated. Type :help for more information. scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] scala> f.show +-+ |value| +-+ |4| +-+ scala> s.show +-+ |value| +-+ |1| |2| |3| |5| |6| |7| |8| |9| | 10| +-+ {code} while as on spark-3 {code:java} scala> val Array(f, s) = Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] scala> f.show +-+ |value| +-+ |5| | 10| +-+ scala> s.show +-+ |value| +-+ |1| |2| |3| |4| |6| |7| |8| |9| +-+ {code} I guess that implementation of `sample` method changed. Is it possible to restore previous behaviour? Thanks in advance! > Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3 > - > > Key: SPARK-34792 > URL: https://issues.apache.org/jira/browse/SPARK-34792 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 3.0.1 >Reporter: kondziolka9ld >Priority: Major > > Hi, > Please consider a following difference of `randomSplit` method even despite > of the same seed. > > {code:java} > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.7 > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |4| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |5| > |6| > |7| > |8| > |9| > | 10| > +-+ > {code} > while as on spark-3 > {code:java} > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |5| > | 10| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |4| > |6| > |7| > |8| > |9| > +-+ > {code} > I guess that implementation of `sample` method changed. > Is it possible to restore previous behaviour?
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304404#comment-17304404 ] Dongjoon Hyun commented on SPARK-24432: --- No, [~eje]. This is not closed due to the JIRA description. > This requires a Kubernetes-specific external shuffle service. The feature is > available in our fork at github.com/apache-spark-on-k8s/spark. This issue claims external shuffle service requirement. > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation
[ https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304406#comment-17304406 ] Dongjoon Hyun commented on SPARK-24432: --- BTW, ping [~liyinan926] since he is the reporter of this issue. If the scope is changed, we can close this issue. > Add support for dynamic resource allocation > --- > > Key: SPARK-24432 > URL: https://issues.apache.org/jira/browse/SPARK-24432 > Project: Spark > Issue Type: New Feature > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Yinan Li >Priority: Major > > This is an umbrella ticket for work on adding support for dynamic resource > allocation into the Kubernetes mode. This requires a Kubernetes-specific > external shuffle service. The feature is available in our fork at > github.com/apache-spark-on-k8s/spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34793) Prohibit saving of day-time and year-month intervals
Max Gekk created SPARK-34793: Summary: Prohibit saving of day-time and year-month intervals Key: SPARK-34793 URL: https://issues.apache.org/jira/browse/SPARK-34793 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk Assignee: Max Gekk Temporary prohibit saving of year-month and day-time intervals to datasources till they are supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34793) Prohibit saving of day-time and year-month intervals
[ https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34793: Assignee: Max Gekk (was: Apache Spark) > Prohibit saving of day-time and year-month intervals > > > Key: SPARK-34793 > URL: https://issues.apache.org/jira/browse/SPARK-34793 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Temporary prohibit saving of year-month and day-time intervals to datasources > till they are supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34793) Prohibit saving of day-time and year-month intervals
[ https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304425#comment-17304425 ] Apache Spark commented on SPARK-34793: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31884 > Prohibit saving of day-time and year-month intervals > > > Key: SPARK-34793 > URL: https://issues.apache.org/jira/browse/SPARK-34793 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Temporary prohibit saving of year-month and day-time intervals to datasources > till they are supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34793) Prohibit saving of day-time and year-month intervals
[ https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34793: Assignee: Apache Spark (was: Max Gekk) > Prohibit saving of day-time and year-month intervals > > > Key: SPARK-34793 > URL: https://issues.apache.org/jira/browse/SPARK-34793 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Temporary prohibit saving of year-month and day-time intervals to datasources > till they are supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34793) Prohibit saving of day-time and year-month intervals
[ https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304427#comment-17304427 ] Apache Spark commented on SPARK-34793: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31884 > Prohibit saving of day-time and year-month intervals > > > Key: SPARK-34793 > URL: https://issues.apache.org/jira/browse/SPARK-34793 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Temporary prohibit saving of year-month and day-time intervals to datasources > till they are supported. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.
[ https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304433#comment-17304433 ] Apache Spark commented on SPARK-34636: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31885 > sql method in UnresolvedAttribute, AttributeReference and Alias don't quote > qualified names properly. > - > > Key: SPARK-34636 > URL: https://issues.apache.org/jira/browse/SPARK-34636 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > UnresolvedReference, AttributeReference and Alias which take qualifier don't > quote qualified names properly. > One instance is reported in SPARK-34626. > Other instances are like as follows. > {code} > UnresolvedAttribute("a`b"::"c.d"::Nil).sql > a`b.`c.d` // expected: `a``b`.`c.d` > AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql > c.d.`a.b` // expected: `c.d`.`a.b` > Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = > "d.e"::Nil).sql > `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.
[ https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304434#comment-17304434 ] Apache Spark commented on SPARK-34636: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/31885 > sql method in UnresolvedAttribute, AttributeReference and Alias don't quote > qualified names properly. > - > > Key: SPARK-34636 > URL: https://issues.apache.org/jira/browse/SPARK-34636 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > UnresolvedReference, AttributeReference and Alias which take qualifier don't > quote qualified names properly. > One instance is reported in SPARK-34626. > Other instances are like as follows. > {code} > UnresolvedAttribute("a`b"::"c.d"::Nil).sql > a`b.`c.d` // expected: `a``b`.`c.d` > AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql > c.d.`a.b` // expected: `c.d`.`a.b` > Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = > "d.e"::Nil).sql > `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types
[ https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304443#comment-17304443 ] Matthew Powers commented on SPARK-27790: [~maxgekk] - how will DateTime addition work with the new intervals? Something like this? * Add three months: col("some_date") + make_year_month(???) * Add 2 years and 10 seconds: col("some_time") + make_year_month(???) + make_day_second(???) Thanks for the great description of the problem in this JIRA ticket. > Support ANSI SQL INTERVAL types > --- > > Key: SPARK-27790 > URL: https://issues.apache.org/jira/browse/SPARK-27790 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Spark has an INTERVAL data type, but it is “broken”: > # It cannot be persisted > # It is not comparable because it crosses the month day line. That is there > is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not > all months have the same number of days. > I propose here to introduce the two flavours of INTERVAL as described in the > ANSI SQL Standard and deprecate the Sparks interval type. > * ANSI describes two non overlapping “classes”: > ** YEAR-MONTH, > ** DAY-SECOND ranges > * Members within each class can be compared and sorted. > * Supports datetime arithmetic > * Can be persisted. > The old and new flavors of INTERVAL can coexist until Spark INTERVAL is > eventually retired. Also any semantic “breakage” can be controlled via legacy > config settings. > *Milestone 1* -- Spark Interval equivalency ( The new interval types meet > or exceed all function of the existing SQL Interval): > * Add two new DataType implementations for interval year-month and > day-second. Includes the JSON format and DLL string. > * Infra support: check the caller sides of DateType/TimestampType > * Support the two new interval types in Dataset/UDF. > * Interval literals (with a legacy config to still allow mixed year-month > day-seconds fields and return legacy interval values) > * Interval arithmetic(interval * num, interval / num, interval +/- interval) > * Datetime functions/operators: Datetime - Datetime (to days or day second), > Datetime +/- interval > * Cast to and from the new two interval types, cast string to interval, cast > interval to string (pretty printing), with the SQL syntax to specify the types > * Support sorting intervals. > *Milestone 2* -- Persistence: > * Ability to create tables of type interval > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > *Milestone 3* -- Client support > * JDBC support > * Hive Thrift server > *Milestone 4* -- PySpark and Spark R integration > * Python UDF can take and return intervals > * DataFrame support -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
[ https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304453#comment-17304453 ] Dongjoon Hyun commented on SPARK-34785: --- SPARK-34212 is complete by itself because it's designed to fix the correctness issue. You will not get incorrect values. For this specific `Schema evolution` requirement in the PR description, I don't have a bandwidth, [~jobitmathew]. > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. Error Parquet column cannot be converted in file. > -- > > Key: SPARK-34785 > URL: https://issues.apache.org/jira/browse/SPARK-34785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Major > > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. > IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it > will reduce the performance. > In Hive, > {code:java} > create table test_decimal(amt decimal(18,2)) stored as parquet; > insert into test_decimal select 100; > alter table test_decimal change amt amt decimal(19,3); > {code} > In Spark, > {code:java} > select * from test_decimal; > {code} > {code:java} > ++ > |amt | > ++ > | 100.000 |{code} > but if spark.sql.parquet.enableVectorizedReader=true below error > {code:java} > : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; > going to print operations logs > printed operations logs > going to print operations logs > printed operations logs > Getting log thread is interrupted, since query is done! > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4) (vm2 executor 2): > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], > Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch
[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
[ https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304454#comment-17304454 ] Dongjoon Hyun commented on SPARK-34785: --- FYI, each data sources have different schema evolution capabilities. And, Parquet is not the best built-in data source in terms of it. We are tracking it with the following test suite. - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. Error Parquet column cannot be converted in file. > -- > > Key: SPARK-34785 > URL: https://issues.apache.org/jira/browse/SPARK-34785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Major > > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. > IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it > will reduce the performance. > In Hive, > {code:java} > create table test_decimal(amt decimal(18,2)) stored as parquet; > insert into test_decimal select 100; > alter table test_decimal change amt amt decimal(19,3); > {code} > In Spark, > {code:java} > select * from test_decimal; > {code} > {code:java} > ++ > |amt | > ++ > | 100.000 |{code} > but if spark.sql.parquet.enableVectorizedReader=true below error > {code:java} > : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; > going to print operations logs > printed operations logs > going to print operations logs > printed operations logs > Getting log thread is interrupted, since query is done! > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4) (vm2 executor 2): > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], > Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735) > at
[jira] [Comment Edited] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
[ https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304454#comment-17304454 ] Dongjoon Hyun edited comment on SPARK-34785 at 3/18/21, 9:14 PM: - FYI, each data sources have different schema evolution capabilities. And, Parquet is not the best built-in data source in terms of it. We are tracking it with the following test suite. - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala The recommendation is to use MR code path if you are in that situation. was (Author: dongjoon): FYI, each data sources have different schema evolution capabilities. And, Parquet is not the best built-in data source in terms of it. We are tracking it with the following test suite. - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. Error Parquet column cannot be converted in file. > -- > > Key: SPARK-34785 > URL: https://issues.apache.org/jira/browse/SPARK-34785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Major > > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. > IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it > will reduce the performance. > In Hive, > {code:java} > create table test_decimal(amt decimal(18,2)) stored as parquet; > insert into test_decimal select 100; > alter table test_decimal change amt amt decimal(19,3); > {code} > In Spark, > {code:java} > select * from test_decimal; > {code} > {code:java} > ++ > |amt | > ++ > | 100.000 |{code} > but if spark.sql.parquet.enableVectorizedReader=true below error > {code:java} > : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; > going to print operations logs > printed operations logs > going to print operations logs > printed operations logs > Getting log thread is interrupted, since query is done! > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4) (vm2 executor 2): > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], > Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304457#comment-17304457 ] Dongjoon Hyun commented on SPARK-27812: --- Thank you for reporting, [~Kotlov]. 1. Could you file a new JIRA because it might be related to K8s client version? 2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I don't think your use case is a legit Spark example although it might be a behavior change across Spark versions. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 2.4.4 >Reporter: Henry Yu >Assignee: Igor Calabria >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304459#comment-17304459 ] Dongjoon Hyun commented on SPARK-27812: --- Does your example work with any Spark versions before? Then, what is the latest Spark version? > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 2.4.4 >Reporter: Henry Yu >Assignee: Igor Calabria >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.
[ https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304470#comment-17304470 ] Dongjoon Hyun commented on SPARK-27812: --- I found your JIRA, SPARK-34674 . Please comment on there because this issue seems irrelevant to yours technically. > kubernetes client import non-daemon thread which block jvm exit. > > > Key: SPARK-27812 > URL: https://issues.apache.org/jira/browse/SPARK-27812 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.3, 2.4.4 >Reporter: Henry Yu >Assignee: Igor Calabria >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > I try spark-submit to k8s with cluster mode. Driver pod failed to exit with > An Okhttp Websocket Non-Daemon Thread. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-34674: --- I reopen this because the affected version is not the same. The following is a copy of my comment at the other JIRA. 1. Does your example work with any Spark versions before? Then, what is the latest Spark version? 2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I don't think your use case is a legit Spark example although it might be a behavior change across Spark versions. > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey >Priority: Major > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304472#comment-17304472 ] Dongjoon Hyun edited comment on SPARK-34674 at 3/18/21, 9:39 PM: - I reopen this because the affected version is not the same according to [~Kotlov]. The following is a copy of my comment at the other JIRA. 1. Does your example work with any Spark versions before? Then, what is the latest Spark version? 2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I don't think your use case is a legit Spark example although it might be a behavior change across Spark versions. was (Author: dongjoon): I reopen this because the affected version is not the same. The following is a copy of my comment at the other JIRA. 1. Does your example work with any Spark versions before? Then, what is the latest Spark version? 2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I don't think your use case is a legit Spark example although it might be a behavior change across Spark versions. > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey >Priority: Major > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34794) Nested higher-order functions broken in DSL
Daniel Solow created SPARK-34794: Summary: Nested higher-order functions broken in DSL Key: SPARK-34794 URL: https://issues.apache.org/jira/browse/SPARK-34794 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Environment: 3.1.1 Reporter: Daniel Solow In Spark 3, if I have: {code:java} val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") {code} and I want to take the cross product of these two arrays, I can do the following in SQL: {code:java} df.selectExpr(""" FLATTEN( TRANSFORM( numbers, number -> TRANSFORM( letters, letter -> (number AS number, letter AS letter) ) ) ) AS zipped """).show(false) ++ |zipped | ++ |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| ++ {code} This works fine. But when I try the equivalent using the scala DSL, the result is wrong: {code:java} df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ++ |zipped | ++ |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| ++ {code} Note that the numbers are not included in the output. The explain for this second version is: {code:java} == Parsed Logical Plan == 'Project [flatten(transform('numbers, lambdafunction(transform('letters, lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, false))) AS zipped#444] +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] +- LocalRelation [_1#303, _2#304] == Analyzed Logical Plan == zipped: array> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda x#446, false)), lambda x#445, false))) AS zipped#444] +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] +- LocalRelation [_1#303, _2#304] == Optimized Logical Plan == LocalRelation [zipped#444] == Physical Plan == LocalTableScan [zipped#444] {code} Seems like variable name x is hardcoded. And sure enough: https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
[ https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34792. --- Resolution: Not A Problem Hi, [~kondziolka9ld]. This doesn't look like a bug. It can be different in many reasons. Not only Spark's code, but also there is difference between Scala 2.11 and Scala 2.12. BTW, please send your question to d...@spark.apache.org next time. > Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3 > - > > Key: SPARK-34792 > URL: https://issues.apache.org/jira/browse/SPARK-34792 > Project: Spark > Issue Type: Question > Components: Spark Core, SQL >Affects Versions: 3.0.1 >Reporter: kondziolka9ld >Priority: Major > > Hi, > Please consider a following difference of `randomSplit` method even despite > of the same seed. > > {code:java} > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.7 > /_/ > > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282) > Type in expressions to have them evaluated. > Type :help for more information. > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |4| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |5| > |6| > |7| > |8| > |9| > | 10| > +-+ > {code} > while as on spark-3 > {code:java} > scala> val Array(f, s) = > Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42) > f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int] > scala> f.show > +-+ > |value| > +-+ > |5| > | 10| > +-+ > scala> s.show > +-+ > |value| > +-+ > |1| > |2| > |3| > |4| > |6| > |7| > |8| > |9| > +-+ > {code} > I guess that implementation of `sample` method changed. > Is it possible to restore previous behaviour? > Thanks in advance! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304504#comment-17304504 ] Dongjoon Hyun commented on SPARK-34791: --- Thank you for reporting, [~jeetendray]. Could you try to use the latest version, Apache Spark 3.1.1? Our CI is using R 4.0.3. {code} root@e3c1c7d8cbf8:/# R R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out" Copyright (C) 2020 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) {code} > SparkR throws node stack overflow > - > > Key: SPARK-34791 > URL: https://issues.apache.org/jira/browse/SPARK-34791 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 3.0.1 >Reporter: obfuscated_dvlper >Priority: Major > > SparkR throws "node stack overflow" error upon running code (sample below) on > R-4.0.2 with Spark 3.0.1. > Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) > {code:java} > source('sample.R') > myclsr = myclosure_func() > myclsr$get_some_date('2021-01-01') > ## spark.lapply throws node stack overflow > result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { > source('sample.R') > another_closure = myclosure_func() > return(another_closure$get_some_date(rdate)) > }) > {code} > Sample.R > {code:java} > ## util function, which calls itself > getPreviousBusinessDate <- function(asofdate) { > asdt <- asofdate; > asdt <- as.Date(asofdate)-1; > wd <- format(as.Date(asdt),"%A") > if(wd == "Saturday" | wd == "Sunday") { > return (getPreviousBusinessDate(asdt)); > } > return (asdt); > } > ## closure which calls util function > myclosure_func = function() { > myclosure = list() > get_some_date = function (random_date) { > return (getPreviousBusinessDate(random_date)) > } > myclosure$get_some_date = get_some_date > return(myclosure) > } > {code} > This seems to have caused by sourcing sample.R twice. Once before invoking > Spark session and another within Spark session. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
[ https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304507#comment-17304507 ] Dongjoon Hyun commented on SPARK-34790: --- Thank you for reporting, [~hezuojiao]. However, it may not be a bug. Sometimes, a test suite like AdaptiveQueryExecSuite has some assumption. Could you be more specific by posting the failure output you saw? > Fail in fetch shuffle blocks in batch when i/o encryption is enabled. > - > > Key: SPARK-34790 > URL: https://issues.apache.org/jira/browse/SPARK-34790 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: hezuojiao >Priority: Major > > When set spark.io.encryption.enabled=true, lots of test cases in > AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is > incompatible with io encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
[ https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34790: -- Parent: SPARK-33828 Issue Type: Sub-task (was: Bug) > Fail in fetch shuffle blocks in batch when i/o encryption is enabled. > - > > Key: SPARK-34790 > URL: https://issues.apache.org/jira/browse/SPARK-34790 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: hezuojiao >Priority: Major > > When set spark.io.encryption.enabled=true, lots of test cases in > AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is > incompatible with io encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
[ https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304508#comment-17304508 ] Dongjoon Hyun commented on SPARK-34790: --- According to the issue title, I convert this into a subtask of SPARK-33828. > Fail in fetch shuffle blocks in batch when i/o encryption is enabled. > - > > Key: SPARK-34790 > URL: https://issues.apache.org/jira/browse/SPARK-34790 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: hezuojiao >Priority: Major > > When set spark.io.encryption.enabled=true, lots of test cases in > AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is > incompatible with io encryption. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used
[ https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34789: -- Issue Type: Improvement (was: Task) > Introduce Jetty based construct for integration tests where HTTP(S) is used > --- > > Key: SPARK-34789 > URL: https://issues.apache.org/jira/browse/SPARK-34789 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up during > https://github.com/apache/spark/pull/31877#discussion_r596831803. > Short summary: we have some tests where HTTP(S) is used to access files. The > current solution uses github urls like > "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";. > This connects two Spark version in an unhealthy way like connecting the > "master" branch which is moving part with the committed test code which is a > non-moving (as it might be even released). > So this way a test running for an earlier version of Spark expects something > (filename, content, path) from a the latter release and what is worse when > the moving version is changed the earlier test will break. > The idea is to introduce a method like: > {noformat} > withHttpServer(files) { > } > {noformat} > Which uses a Jetty ResourceHandler to serve the listed files (or directories > / or just the root where it is started from) and stops the server in the > finally. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used
[ https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304509#comment-17304509 ] Dongjoon Hyun commented on SPARK-34789: --- Thank you for working on this, [~attilapiros]. This will be useful. > Introduce Jetty based construct for integration tests where HTTP(S) is used > --- > > Key: SPARK-34789 > URL: https://issues.apache.org/jira/browse/SPARK-34789 > Project: Spark > Issue Type: Task > Components: Tests >Affects Versions: 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up during > https://github.com/apache/spark/pull/31877#discussion_r596831803. > Short summary: we have some tests where HTTP(S) is used to access files. The > current solution uses github urls like > "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt";. > This connects two Spark version in an unhealthy way like connecting the > "master" branch which is moving part with the committed test code which is a > non-moving (as it might be even released). > So this way a test running for an earlier version of Spark expects something > (filename, content, path) from a the latter release and what is worse when > the moving version is changed the earlier test will break. > The idea is to introduce a method like: > {noformat} > withHttpServer(files) { > } > {noformat} > Which uses a Jetty ResourceHandler to serve the listed files (or directories > / or just the root where it is started from) and stops the server in the > finally. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full
[ https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304511#comment-17304511 ] Dongjoon Hyun commented on SPARK-34788: --- Could you be more specific, [~Ngone51]? In case of disk full, many parts are related. Which code path do you see `FileNotFoundException` instead of `IOException`? > Spark throws FileNotFoundException instead of IOException when disk is full > --- > > Key: SPARK-34788 > URL: https://issues.apache.org/jira/browse/SPARK-34788 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.4.7, 3.0.0, 3.1.0 >Reporter: wuyi >Priority: Major > > When the disk is full, Spark throws FileNotFoundException instead of > IOException with the hint. It's quite a confusing error to users. > > And there's probably a way to detect the disk full: when we get > `FileNotFoundException`, we try > [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] > to see if SyncFailedException throws. If SyncFailedException throws, then we > throw IOException with the disk full hint. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full
[ https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34788: -- Affects Version/s: (was: 2.4.7) (was: 3.1.0) (was: 3.0.0) 3.2.0 > Spark throws FileNotFoundException instead of IOException when disk is full > --- > > Key: SPARK-34788 > URL: https://issues.apache.org/jira/browse/SPARK-34788 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > > When the disk is full, Spark throws FileNotFoundException instead of > IOException with the hint. It's quite a confusing error to users. > > And there's probably a way to detect the disk full: when we get > `FileNotFoundException`, we try > [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] > to see if SyncFailedException throws. If SyncFailedException throws, then we > throw IOException with the disk full hint. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)
[ https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34787: -- Affects Version/s: (was: 3.0.1) 3.2.0 > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX) > -- > > Key: SPARK-34787 > URL: https://issues.apache.org/jira/browse/SPARK-34787 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: akiyamaneko >Priority: Minor > > Option variable in Spark historyServer log should be displayed as actual > value instead of Some(XX): > {code:html} > 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt > application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO > FsHistoryProvider: Parsing > hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421 > for listing data... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries
Takeshi Yamamuro created SPARK-34795: Summary: Adds a new job in GitHub Actions to check the output of TPC-DS queries Key: SPARK-34795 URL: https://issues.apache.org/jira/browse/SPARK-34795 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.2.0 Reporter: Takeshi Yamamuro This ticket aims at adding a new job in GitHub Actions to check the output of TPC-DS queries. There are some cases where we noticed runtime-realted bugs far after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34795: - Description: This ticket aims at adding a new job in GitHub Actions to check the output of TPC-DS queries. There are some cases where we noticed runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). (was: This ticket aims at adding a new job in GitHub Actions to check the output of TPC-DS queries. There are some cases where we noticed runtime-realted bugs far after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).) > Adds a new job in GitHub Actions to check the output of TPC-DS queries > -- > > Key: SPARK-34795 > URL: https://issues.apache.org/jira/browse/SPARK-34795 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket aims at adding a new job in GitHub Actions to check the output of > TPC-DS queries. There are some cases where we noticed runtime-realted bugs > after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth > adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304521#comment-17304521 ] Apache Spark commented on SPARK-34795: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/31886 > Adds a new job in GitHub Actions to check the output of TPC-DS queries > -- > > Key: SPARK-34795 > URL: https://issues.apache.org/jira/browse/SPARK-34795 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket aims at adding a new job in GitHub Actions to check the output of > TPC-DS queries. There are some cases where we noticed runtime-realted bugs > after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth > adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34795: Assignee: Apache Spark > Adds a new job in GitHub Actions to check the output of TPC-DS queries > -- > > Key: SPARK-34795 > URL: https://issues.apache.org/jira/browse/SPARK-34795 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Major > > This ticket aims at adding a new job in GitHub Actions to check the output of > TPC-DS queries. There are some cases where we noticed runtime-realted bugs > after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth > adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries
[ https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34795: Assignee: (was: Apache Spark) > Adds a new job in GitHub Actions to check the output of TPC-DS queries > -- > > Key: SPARK-34795 > URL: https://issues.apache.org/jira/browse/SPARK-34795 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Takeshi Yamamuro >Priority: Major > > This ticket aims at adding a new job in GitHub Actions to check the output of > TPC-DS queries. There are some cases where we noticed runtime-realted bugs > after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth > adding a new job in GitHub Actions to check query output of TPC-DS (sf=1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types
[ https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304526#comment-17304526 ] Simeon Simeonov commented on SPARK-27790: - Maxim, this is good stuff. Does ANSI SQL allow operations on dates using the YEAR-MONTH interval type? I didn't see that mentioned in Milestone 1. > Support ANSI SQL INTERVAL types > --- > > Key: SPARK-27790 > URL: https://issues.apache.org/jira/browse/SPARK-27790 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Spark has an INTERVAL data type, but it is “broken”: > # It cannot be persisted > # It is not comparable because it crosses the month day line. That is there > is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not > all months have the same number of days. > I propose here to introduce the two flavours of INTERVAL as described in the > ANSI SQL Standard and deprecate the Sparks interval type. > * ANSI describes two non overlapping “classes”: > ** YEAR-MONTH, > ** DAY-SECOND ranges > * Members within each class can be compared and sorted. > * Supports datetime arithmetic > * Can be persisted. > The old and new flavors of INTERVAL can coexist until Spark INTERVAL is > eventually retired. Also any semantic “breakage” can be controlled via legacy > config settings. > *Milestone 1* -- Spark Interval equivalency ( The new interval types meet > or exceed all function of the existing SQL Interval): > * Add two new DataType implementations for interval year-month and > day-second. Includes the JSON format and DLL string. > * Infra support: check the caller sides of DateType/TimestampType > * Support the two new interval types in Dataset/UDF. > * Interval literals (with a legacy config to still allow mixed year-month > day-seconds fields and return legacy interval values) > * Interval arithmetic(interval * num, interval / num, interval +/- interval) > * Datetime functions/operators: Datetime - Datetime (to days or day second), > Datetime +/- interval > * Cast to and from the new two interval types, cast string to interval, cast > interval to string (pretty printing), with the SQL syntax to specify the types > * Support sorting intervals. > *Milestone 2* -- Persistence: > * Ability to create tables of type interval > * Ability to write to common file formats such as Parquet and JSON. > * INSERT, SELECT, UPDATE, MERGE > * Discovery > *Milestone 3* -- Client support > * JDBC support > * Hive Thrift server > *Milestone 4* -- PySpark and Spark R integration > * Python UDF can take and return intervals > * DataFrame support -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34794) Nested higher-order functions broken in DSL
[ https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34794: Assignee: Apache Spark > Nested higher-order functions broken in DSL > --- > > Key: SPARK-34794 > URL: https://issues.apache.org/jira/browse/SPARK-34794 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: 3.1.1 >Reporter: Daniel Solow >Assignee: Apache Spark >Priority: Major > > In Spark 3, if I have: > {code:java} > val df = Seq( > (Seq(1,2,3), Seq("a", "b", "c")) > ).toDF("numbers", "letters") > {code} > and I want to take the cross product of these two arrays, I can do the > following in SQL: > {code:java} > df.selectExpr(""" > FLATTEN( > TRANSFORM( > numbers, > number -> TRANSFORM( > letters, > letter -> (number AS number, letter AS letter) > ) > ) > ) AS zipped > """).show(false) > ++ > |zipped | > ++ > |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| > ++ > {code} > This works fine. But when I try the equivalent using the scala DSL, the > result is wrong: > {code:java} > df.select( > f.flatten( > f.transform( > $"numbers", > (number: Column) => { f.transform( > $"letters", > (letter: Column) => { f.struct( > number.as("number"), > letter.as("letter") > ) } > ) } > ) > ).as("zipped") > ).show(10, false) > ++ > |zipped | > ++ > |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| > ++ > {code} > Note that the numbers are not included in the output. The explain for this > second version is: > {code:java} > == Parsed Logical Plan == > 'Project [flatten(transform('numbers, lambdafunction(transform('letters, > lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, > NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, > false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Analyzed Logical Plan == > zipped: array> > Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, > lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda > x#446, false)), lambda x#445, false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Optimized Logical Plan == > LocalRelation [zipped#444] > == Physical Plan == > LocalTableScan [zipped#444] > {code} > Seems like variable name x is hardcoded. And sure enough: > https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34794) Nested higher-order functions broken in DSL
[ https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34794: Assignee: (was: Apache Spark) > Nested higher-order functions broken in DSL > --- > > Key: SPARK-34794 > URL: https://issues.apache.org/jira/browse/SPARK-34794 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: 3.1.1 >Reporter: Daniel Solow >Priority: Major > > In Spark 3, if I have: > {code:java} > val df = Seq( > (Seq(1,2,3), Seq("a", "b", "c")) > ).toDF("numbers", "letters") > {code} > and I want to take the cross product of these two arrays, I can do the > following in SQL: > {code:java} > df.selectExpr(""" > FLATTEN( > TRANSFORM( > numbers, > number -> TRANSFORM( > letters, > letter -> (number AS number, letter AS letter) > ) > ) > ) AS zipped > """).show(false) > ++ > |zipped | > ++ > |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| > ++ > {code} > This works fine. But when I try the equivalent using the scala DSL, the > result is wrong: > {code:java} > df.select( > f.flatten( > f.transform( > $"numbers", > (number: Column) => { f.transform( > $"letters", > (letter: Column) => { f.struct( > number.as("number"), > letter.as("letter") > ) } > ) } > ) > ).as("zipped") > ).show(10, false) > ++ > |zipped | > ++ > |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| > ++ > {code} > Note that the numbers are not included in the output. The explain for this > second version is: > {code:java} > == Parsed Logical Plan == > 'Project [flatten(transform('numbers, lambdafunction(transform('letters, > lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, > NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, > false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Analyzed Logical Plan == > zipped: array> > Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, > lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda > x#446, false)), lambda x#445, false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Optimized Logical Plan == > LocalRelation [zipped#444] > == Physical Plan == > LocalTableScan [zipped#444] > {code} > Seems like variable name x is hardcoded. And sure enough: > https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34794) Nested higher-order functions broken in DSL
[ https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304532#comment-17304532 ] Apache Spark commented on SPARK-34794: -- User 'dmsolow' has created a pull request for this issue: https://github.com/apache/spark/pull/31887 > Nested higher-order functions broken in DSL > --- > > Key: SPARK-34794 > URL: https://issues.apache.org/jira/browse/SPARK-34794 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: 3.1.1 >Reporter: Daniel Solow >Priority: Major > > In Spark 3, if I have: > {code:java} > val df = Seq( > (Seq(1,2,3), Seq("a", "b", "c")) > ).toDF("numbers", "letters") > {code} > and I want to take the cross product of these two arrays, I can do the > following in SQL: > {code:java} > df.selectExpr(""" > FLATTEN( > TRANSFORM( > numbers, > number -> TRANSFORM( > letters, > letter -> (number AS number, letter AS letter) > ) > ) > ) AS zipped > """).show(false) > ++ > |zipped | > ++ > |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| > ++ > {code} > This works fine. But when I try the equivalent using the scala DSL, the > result is wrong: > {code:java} > df.select( > f.flatten( > f.transform( > $"numbers", > (number: Column) => { f.transform( > $"letters", > (letter: Column) => { f.struct( > number.as("number"), > letter.as("letter") > ) } > ) } > ) > ).as("zipped") > ).show(10, false) > ++ > |zipped | > ++ > |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| > ++ > {code} > Note that the numbers are not included in the output. The explain for this > second version is: > {code:java} > == Parsed Logical Plan == > 'Project [flatten(transform('numbers, lambdafunction(transform('letters, > lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, > NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, > false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Analyzed Logical Plan == > zipped: array> > Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, > lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda > x#446, false)), lambda x#445, false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Optimized Logical Plan == > LocalRelation [zipped#444] > == Physical Plan == > LocalTableScan [zipped#444] > {code} > Seems like variable name x is hardcoded. And sure enough: > https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34794) Nested higher-order functions broken in DSL
[ https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304533#comment-17304533 ] Daniel Solow commented on SPARK-34794: -- PR is at https://github.com/apache/spark/pull/31887 Thanks to [~rspitzer] for suggesting AtomicInteger > Nested higher-order functions broken in DSL > --- > > Key: SPARK-34794 > URL: https://issues.apache.org/jira/browse/SPARK-34794 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: 3.1.1 >Reporter: Daniel Solow >Priority: Major > > In Spark 3, if I have: > {code:java} > val df = Seq( > (Seq(1,2,3), Seq("a", "b", "c")) > ).toDF("numbers", "letters") > {code} > and I want to take the cross product of these two arrays, I can do the > following in SQL: > {code:java} > df.selectExpr(""" > FLATTEN( > TRANSFORM( > numbers, > number -> TRANSFORM( > letters, > letter -> (number AS number, letter AS letter) > ) > ) > ) AS zipped > """).show(false) > ++ > |zipped | > ++ > |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]| > ++ > {code} > This works fine. But when I try the equivalent using the scala DSL, the > result is wrong: > {code:java} > df.select( > f.flatten( > f.transform( > $"numbers", > (number: Column) => { f.transform( > $"letters", > (letter: Column) => { f.struct( > number.as("number"), > letter.as("letter") > ) } > ) } > ) > ).as("zipped") > ).show(10, false) > ++ > |zipped | > ++ > |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]| > ++ > {code} > Note that the numbers are not included in the output. The explain for this > second version is: > {code:java} > == Parsed Logical Plan == > 'Project [flatten(transform('numbers, lambdafunction(transform('letters, > lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, > NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, > false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Analyzed Logical Plan == > zipped: array> > Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, > lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda > x#446, false)), lambda x#445, false))) AS zipped#444] > +- Project [_1#303 AS numbers#308, _2#304 AS letters#309] >+- LocalRelation [_1#303, _2#304] > == Optimized Logical Plan == > LocalRelation [zipped#444] > == Physical Plan == > LocalTableScan [zipped#444] > {code} > Seems like variable name x is hardcoded. And sure enough: > https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error
[ https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304538#comment-17304538 ] Apache Spark commented on SPARK-34596: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/31888 > NewInstance.doGenCode should not throw malformed class name error > - > > Key: SPARK-34596 > URL: https://issues.apache.org/jira/browse/SPARK-34596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.2, 3.1.0 >Reporter: Kris Mok >Priority: Major > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > Similar to SPARK-32238 and SPARK-32999, the use of > {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic > because Scala classes may trigger {{java.lang.InternalError: Malformed class > name}}. > This happens more often when using nested classes in Scala (or declaring > classes in Scala REPL which implies class nesting). > Note that on newer versions of JDK the underlying malformed class name no > longer reproduces (fixed in the JDK by > https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less > of an issue there. But on JDK8u this problem still exists so we still have to > fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error
[ https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304539#comment-17304539 ] Apache Spark commented on SPARK-34596: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/31888 > NewInstance.doGenCode should not throw malformed class name error > - > > Key: SPARK-34596 > URL: https://issues.apache.org/jira/browse/SPARK-34596 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7, 3.0.2, 3.1.0 >Reporter: Kris Mok >Priority: Major > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > Similar to SPARK-32238 and SPARK-32999, the use of > {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic > because Scala classes may trigger {{java.lang.InternalError: Malformed class > name}}. > This happens more often when using nested classes in Scala (or declaring > classes in Scala REPL which implies class nesting). > Note that on newer versions of JDK the underlying malformed class name no > longer reproduces (fixed in the JDK by > https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less > of an issue there. But on JDK8u this problem still exists so we still have to > fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34781) Eliminate LEFT SEMI/ANTI join to its left child side with AQE
[ https://issues.apache.org/jira/browse/SPARK-34781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-34781. -- Fix Version/s: 3.2.0 Assignee: Cheng Su Resolution: Fixed Resolved by https://github.com/apache/spark/pull/31873 > Eliminate LEFT SEMI/ANTI join to its left child side with AQE > - > > Key: SPARK-34781 > URL: https://issues.apache.org/jira/browse/SPARK-34781 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Trivial > Fix For: 3.2.0 > > > In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases > for LEFT SEMI and LEFT ANI joins: > # Join is left semi join, join right side is non-empty and condition is > empty. Eliminate join to its left side. > # Join is left anti join, join right side is empty. Eliminate join to its > left side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE
Cheng Su created SPARK-34796: Summary: Codegen compilation error for query with LIMIT operator and without AQE Key: SPARK-34796 URL: https://issues.apache.org/jira/browse/SPARK-34796 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.1.0, 3.2.0 Reporter: Cheng Su Example (reproduced in unit test): {code:java} test("failed limit query") { withTable("left_table", "empty_right_table", "output_table") { spark.range(5).toDF("k").write.saveAsTable("left_table") spark.range(0).toDF("k").write.saveAsTable("empty_right_table") withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { spark.sql("CREATE TABLE output_table (k INT) USING parquet") spark.sql( s""" |INSERT INTO TABLE output_table |SELECT t1.k FROM left_table t1 |JOIN empty_right_table t2 |ON t1.k = t2.k |LIMIT 3 |""".stripMargin) } } } {code} Result: {code:java} 17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17:46:01.540 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) at org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:384) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1445) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:384) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1312) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:833) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:410) at org.codehaus.janino.UnitCompi
[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE
[ https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304550#comment-17304550 ] Cheng Su commented on SPARK-34796: -- FYI I am working on a PR to fix it now. > Codegen compilation error for query with LIMIT operator and without AQE > --- > > Key: SPARK-34796 > URL: https://issues.apache.org/jira/browse/SPARK-34796 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Cheng Su >Priority: Blocker > > Example (reproduced in unit test): > > {code:java} > test("failed limit query") { > withTable("left_table", "empty_right_table", "output_table") { > spark.range(5).toDF("k").write.saveAsTable("left_table") > spark.range(0).toDF("k").write.saveAsTable("empty_right_table") > withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { > spark.sql("CREATE TABLE output_table (k INT) USING parquet") > spark.sql( > s""" > |INSERT INTO TABLE output_table > |SELECT t1.k FROM left_table t1 > |JOIN empty_right_table t2 > |ON t1.k = t2.k > |LIMIT 3 > |""".stripMargin) > } > } > } > {code} > Result: > {code:java} > 17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable > to load native-hadoop library for your platform... using builtin-java classes > where applicable > 17:46:01.540 ERROR > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an > rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', > Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at > org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at > org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575) > at > org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) at > org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at > org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717) > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at > org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at > org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at > org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at > org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at > org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at > org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) > at > org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) > at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at > org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at > org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at > org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at > org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) > at > org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) > at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at > org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at > org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at > org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at > org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) > at > org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) > at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at > org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at > org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) > at > org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at > org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at > org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at > org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392) > at > org.codehaus.janin
[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE
[ https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-34796: - Description: Example (reproduced in unit test): {code:java} test("failed limit query") { withTable("left_table", "empty_right_table", "output_table") { spark.range(5).toDF("k").write.saveAsTable("left_table") spark.range(0).toDF("k").write.saveAsTable("empty_right_table") withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { spark.sql("CREATE TABLE output_table (k INT) USING parquet") spark.sql( s""" |INSERT INTO TABLE output_table |SELECT t1.k FROM left_table t1 |JOIN empty_right_table t2 |ON t1.k = t2.k |LIMIT 3 |""".stripMargin) } } } {code} Result: {code:java} 17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17:46:01.540 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) at org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392) at org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:384) at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1445) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:384) at org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1312) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:833) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:410) at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:389) at org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDecl
[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE
[ https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Su updated SPARK-34796: - Description: Example (reproduced in unit test): {code:java} test("failed limit query") { withTable("left_table", "empty_right_table", "output_table") { spark.range(5).toDF("k").write.saveAsTable("left_table") spark.range(0).toDF("k").write.saveAsTable("empty_right_table") withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { spark.sql("CREATE TABLE output_table (k INT) USING parquet") spark.sql( s""" |INSERT INTO TABLE output_table |SELECT t1.k FROM left_table t1 |JOIN empty_right_table t2 |ON t1.k = t2.k |LIMIT 3 |""".stripMargin) } } } {code} Result: https://gist.github.com/c21/ea760c75b546d903247582be656d9d66 was: Example (reproduced in unit test): {code:java} test("failed limit query") { withTable("left_table", "empty_right_table", "output_table") { spark.range(5).toDF("k").write.saveAsTable("left_table") spark.range(0).toDF("k").write.saveAsTable("empty_right_table") withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { spark.sql("CREATE TABLE output_table (k INT) USING parquet") spark.sql( s""" |INSERT INTO TABLE output_table |SELECT t1.k FROM left_table t1 |JOIN empty_right_table t2 |ON t1.k = t2.k |LIMIT 3 |""".stripMargin) } } } {code} Result: {code:java} 17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17:46:01.540 ERROR org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575) at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717) at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008) at org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986) at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) at org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) at org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at
[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE
[ https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304554#comment-17304554 ] Hyukjin Kwon commented on SPARK-34796: -- [~chengsu] I am lowering the priority because we enable AQE by default now and users get affected less. > Codegen compilation error for query with LIMIT operator and without AQE > --- > > Key: SPARK-34796 > URL: https://issues.apache.org/jira/browse/SPARK-34796 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Cheng Su >Priority: Blocker > > Example (reproduced in unit test): > > {code:java} > test("failed limit query") { > withTable("left_table", "empty_right_table", "output_table") { > spark.range(5).toDF("k").write.saveAsTable("left_table") > spark.range(0).toDF("k").write.saveAsTable("empty_right_table") > > withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { > spark.sql("CREATE TABLE output_table (k INT) USING parquet") > spark.sql( > s""" > |INSERT INTO TABLE output_table > |SELECT t1.k FROM left_table t1 > |JOIN empty_right_table t2 > |ON t1.k = t2.k > |LIMIT 3 > |""".stripMargin) > } > } > } > {code} > Result: > > https://gist.github.com/c21/ea760c75b546d903247582be656d9d66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE
[ https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-34796: - Priority: Critical (was: Blocker) > Codegen compilation error for query with LIMIT operator and without AQE > --- > > Key: SPARK-34796 > URL: https://issues.apache.org/jira/browse/SPARK-34796 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Cheng Su >Priority: Critical > > Example (reproduced in unit test): > > {code:java} > test("failed limit query") { > withTable("left_table", "empty_right_table", "output_table") { > spark.range(5).toDF("k").write.saveAsTable("left_table") > spark.range(0).toDF("k").write.saveAsTable("empty_right_table") > > withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { > spark.sql("CREATE TABLE output_table (k INT) USING parquet") > spark.sql( > s""" > |INSERT INTO TABLE output_table > |SELECT t1.k FROM left_table t1 > |JOIN empty_right_table t2 > |ON t1.k = t2.k > |LIMIT 3 > |""".stripMargin) > } > } > } > {code} > Result: > > https://gist.github.com/c21/ea760c75b546d903247582be656d9d66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304555#comment-17304555 ] Hyukjin Kwon commented on SPARK-34791: -- It's possibly a regression from SPARK-29777 ... yeah it would be great if we can confirm if it works with the latest versions. > SparkR throws node stack overflow > - > > Key: SPARK-34791 > URL: https://issues.apache.org/jira/browse/SPARK-34791 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 3.0.1 >Reporter: obfuscated_dvlper >Priority: Major > > SparkR throws "node stack overflow" error upon running code (sample below) on > R-4.0.2 with Spark 3.0.1. > Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) > {code:java} > source('sample.R') > myclsr = myclosure_func() > myclsr$get_some_date('2021-01-01') > ## spark.lapply throws node stack overflow > result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { > source('sample.R') > another_closure = myclosure_func() > return(another_closure$get_some_date(rdate)) > }) > {code} > Sample.R > {code:java} > ## util function, which calls itself > getPreviousBusinessDate <- function(asofdate) { > asdt <- asofdate; > asdt <- as.Date(asofdate)-1; > wd <- format(as.Date(asdt),"%A") > if(wd == "Saturday" | wd == "Sunday") { > return (getPreviousBusinessDate(asdt)); > } > return (asdt); > } > ## closure which calls util function > myclosure_func = function() { > myclosure = list() > get_some_date = function (random_date) { > return (getPreviousBusinessDate(random_date)) > } > myclosure$get_some_date = get_some_date > return(myclosure) > } > {code} > This seems to have caused by sourcing sample.R twice. Once before invoking > Spark session and another within Spark session. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304558#comment-17304558 ] Hyukjin Kwon commented on SPARK-34791: -- cc [~falaki] FYI > SparkR throws node stack overflow > - > > Key: SPARK-34791 > URL: https://issues.apache.org/jira/browse/SPARK-34791 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 3.0.1 >Reporter: obfuscated_dvlper >Priority: Major > > SparkR throws "node stack overflow" error upon running code (sample below) on > R-4.0.2 with Spark 3.0.1. > Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) > {code:java} > source('sample.R') > myclsr = myclosure_func() > myclsr$get_some_date('2021-01-01') > ## spark.lapply throws node stack overflow > result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { > source('sample.R') > another_closure = myclosure_func() > return(another_closure$get_some_date(rdate)) > }) > {code} > Sample.R > {code:java} > ## util function, which calls itself > getPreviousBusinessDate <- function(asofdate) { > asdt <- asofdate; > asdt <- as.Date(asofdate)-1; > wd <- format(as.Date(asdt),"%A") > if(wd == "Saturday" | wd == "Sunday") { > return (getPreviousBusinessDate(asdt)); > } > return (asdt); > } > ## closure which calls util function > myclosure_func = function() { > myclosure = list() > get_some_date = function (random_date) { > return (getPreviousBusinessDate(random_date)) > } > myclosure$get_some_date = get_some_date > return(myclosure) > } > {code} > This seems to have caused by sourcing sample.R twice. Once before invoking > Spark session and another within Spark session. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow
[ https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304556#comment-17304556 ] Hyukjin Kwon commented on SPARK-34791: -- See also https://github.com/apache/spark/pull/26429#discussion_r346103050 > SparkR throws node stack overflow > - > > Key: SPARK-34791 > URL: https://issues.apache.org/jira/browse/SPARK-34791 > Project: Spark > Issue Type: Question > Components: SparkR >Affects Versions: 3.0.1 >Reporter: obfuscated_dvlper >Priority: Major > > SparkR throws "node stack overflow" error upon running code (sample below) on > R-4.0.2 with Spark 3.0.1. > Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5) > {code:java} > source('sample.R') > myclsr = myclosure_func() > myclsr$get_some_date('2021-01-01') > ## spark.lapply throws node stack overflow > result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) { > source('sample.R') > another_closure = myclosure_func() > return(another_closure$get_some_date(rdate)) > }) > {code} > Sample.R > {code:java} > ## util function, which calls itself > getPreviousBusinessDate <- function(asofdate) { > asdt <- asofdate; > asdt <- as.Date(asofdate)-1; > wd <- format(as.Date(asdt),"%A") > if(wd == "Saturday" | wd == "Sunday") { > return (getPreviousBusinessDate(asdt)); > } > return (asdt); > } > ## closure which calls util function > myclosure_func = function() { > myclosure = list() > get_some_date = function (random_date) { > return (getPreviousBusinessDate(random_date)) > } > myclosure$get_some_date = get_some_date > return(myclosure) > } > {code} > This seems to have caused by sourcing sample.R twice. Once before invoking > Spark session and another within Spark session. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE
[ https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304559#comment-17304559 ] Cheng Su commented on SPARK-34796: -- [~hyukjin.kwon] - yeah I agree with it. Thanks for correcting priority here. > Codegen compilation error for query with LIMIT operator and without AQE > --- > > Key: SPARK-34796 > URL: https://issues.apache.org/jira/browse/SPARK-34796 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Cheng Su >Priority: Critical > > Example (reproduced in unit test): > > {code:java} > test("failed limit query") { > withTable("left_table", "empty_right_table", "output_table") { > spark.range(5).toDF("k").write.saveAsTable("left_table") > spark.range(0).toDF("k").write.saveAsTable("empty_right_table") > > withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") { > spark.sql("CREATE TABLE output_table (k INT) USING parquet") > spark.sql( > s""" > |INSERT INTO TABLE output_table > |SELECT t1.k FROM left_table t1 > |JOIN empty_right_table t2 > |ON t1.k = t2.k > |LIMIT 3 > |""".stripMargin) > } > } > } > {code} > Result: > > https://gist.github.com/c21/ea760c75b546d903247582be656d9d66 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31661) Document usage of blockSize
[ https://issues.apache.org/jira/browse/SPARK-31661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31661: Assignee: zhengruifeng > Document usage of blockSize > --- > > Key: SPARK-31661 > URL: https://issues.apache.org/jira/browse/SPARK-31661 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31976) use MemoryUsage to control the size of block
[ https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31976: Assignee: zhengruifeng > use MemoryUsage to control the size of block > > > Key: SPARK-31976 > URL: https://issues.apache.org/jira/browse/SPARK-31976 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > According to the performance test in > https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is > mainly related to the nnz of block. > So it maybe reasonable to control the size of block by memory usage, instead > of number of rows. > > note1: param blockSize had already used in ALS and MLP to stack vectors > (expected to be dense); > note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models; > > There may be two ways to impl: > 1, compute the sparsity of input vectors ahead of train (this can be computed > with other statistics computation, maybe no extra pass), and infer a > reasonable number of vectors to stack; > 2, stack the input vectors adaptively, by monitoring the memory usage in a > block; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used
[ https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304561#comment-17304561 ] Hyukjin Kwon commented on SPARK-34780: -- cc [~sunchao] [~maxgekk] FYI > Cached Table (parquet) with old Configs Used > > > Key: SPARK-34780 > URL: https://issues.apache.org/jira/browse/SPARK-34780 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.1 >Reporter: Michael Chen >Priority: Major > > When a dataframe is cached, the logical plan can contain copies of the spark > session meaning the SQLConfs are stored. Then if a different dataframe can > replace parts of it's logical plan with a cached logical plan, the cached > SQLConfs will be used for the evaluation of the cached logical plan. This is > because HadoopFsRelation ignores sparkSession for equality checks (introduced > in https://issues.apache.org/jira/browse/SPARK-17358). > {code:java} > test("cache uses old SQLConf") { > import testImplicits._ > withTempDir { dir => > val tableDir = dir.getAbsoluteFile + "/table" > val df = Seq("a").toDF("key") > df.write.parquet(tableDir) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1Stats = spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10") > val df2 = spark.read.parquet(tableDir).select("key") > df2.cache() > val compression10Stats = df2.queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1") > val compression1StatsWithCache = > spark.read.parquet(tableDir).select("key"). > queryExecution.optimizedPlan.collect { > case l: LogicalRelation => l > case m: InMemoryRelation => m > }.map(_.computeStats()) > // I expect these stats to be the same because file compression factor is > the same > assert(compression1Stats == compression1StatsWithCache) > // Instead, we can see the file compression factor is being cached and > used along with > // the logical plan > assert(compression10Stats == compression1StatsWithCache) > } > }{code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34782) Not equal functions (!=, <>) are missing in SQL doc
[ https://issues.apache.org/jira/browse/SPARK-34782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304560#comment-17304560 ] Hyukjin Kwon commented on SPARK-34782: -- Duplicate of SPARK-34747? > Not equal functions (!=, <>) are missing in SQL doc > > > Key: SPARK-34782 > URL: https://issues.apache.org/jira/browse/SPARK-34782 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: wuyi >Priority: Major > > https://spark.apache.org/docs/latest/api/sql/search.html?q=%3C%3E -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34747) Add virtual operators to the built-in function document.
[ https://issues.apache.org/jira/browse/SPARK-34747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34747. -- Fix Version/s: 3.0.3 3.1.2 3.2.0 Resolution: Fixed Fixed in https://github.com/apache/spark/pull/31841 > Add virtual operators to the built-in function document. > > > Key: SPARK-34747 > URL: https://issues.apache.org/jira/browse/SPARK-34747 > Project: Spark > Issue Type: Bug > Components: docs, SQL >Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0, 3.1.2, 3.0.3 > > > After SPARK-34697, DESCRIBE FUNCTION and SHOW FUNCTIONS can describe/show > built-in operators including the following virtual operators. > * != > * <> > * between > * case > * || > But they are still absent from the built-in functions document. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34754) sparksql 'add jar' not support hdfs ha mode in k8s
[ https://issues.apache.org/jira/browse/SPARK-34754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304568#comment-17304568 ] Hyukjin Kwon commented on SPARK-34754: -- [~lithiumlee-_-] can you test if it works in higher versions? K8S support just became GA from Spark 3.1. > sparksql 'add jar' not support hdfs ha mode in k8s > -- > > Key: SPARK-34754 > URL: https://issues.apache.org/jira/browse/SPARK-34754 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 2.4.7 >Reporter: lithiumlee-_- >Priority: Major > > Submit app to K8S, the executors meet exception > "java.net.UnknownHostException: xx". > The udf jar uri using hdfs ha style, but the exception stack show > "...*createNonHAProxy*..." > > hql: > {code:java} > // code placeholder > add jar hdfs://xx/test.jar; > create temporary function test_udf as 'com.xxx.xxx'; > create table test.test_udf as > select test_udf('1') name_1; > {code} > > > exception: > {code:java} > // code placeholder > TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.89.44, executor > 1): java.lang.IllegalArgumentException: java.net.UnknownHostException: xx > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:439) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:321) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:696) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:636) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2796) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2830) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2812) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390) > at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1866) > at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:721) > at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:816) > at > org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:808) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:808) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.UnknownHostException: xx > ... 28 more > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34600) Support user defined types in Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-34600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304569#comment-17304569 ] Hyukjin Kwon commented on SPARK-34600: -- [~eddyxu] if this ticket is an umbrella ticket, let's create another JIRA and move your PR (https://github.com/apache/spark/pull/31735) to that JIRA. umbrella JIRA doesn't usually have a PR against it. > Support user defined types in Pandas UDF > > > Key: SPARK-34600 > URL: https://issues.apache.org/jira/browse/SPARK-34600 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 3.0.2, 3.1.1 >Reporter: Lei (Eddy) Xu >Priority: Major > Labels: pandas, sql, udf > > Because Pandas UDF uses pyarrow to passing data, it does not currently > support UserDefinedTypes, as what normal python udf does. > For example: > {code:python} > class BoxType(UserDefinedType): > @classmethod > def sqlType(cls) -> StructType: > return StructType( > fields=[ > StructField("xmin", DoubleType(), False), > StructField("ymin", DoubleType(), False), > StructField("xmax", DoubleType(), False), > StructField("ymax", DoubleType(), False), > ] > ) > @pandas_udf( > returnType=StructType([StructField("boxes", ArrayType(Box()))] > ) > def pandas_pf(s: pd.DataFrame) -> pd.DataFrame: >yield s > {code} > The logs show > {code} > try: > to_arrow_type(self._returnType_placeholder) > except TypeError: > > raise NotImplementedError( > "Invalid return type with scalar Pandas UDFs: %s is " > E NotImplementedError: Invalid return type with scalar > Pandas UDFs: StructType(List(StructField(boxes,ArrayType(Box,true),true))) is > not supported > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
[ https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304571#comment-17304571 ] Hyukjin Kwon commented on SPARK-34776: -- cc [~viirya] FYI > Catalyst error on on certain struct operation (Couldn't find _gen_alias_) > - > > Key: SPARK-34776 > URL: https://issues.apache.org/jira/browse/SPARK-34776 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: Spark 3.1.1 > Scala 2.12.10 >Reporter: Daniel Solow >Priority: Major > > When I run the following: > {code:java} > import org.apache.spark.sql.{functions => f} > import org.apache.spark.sql.expressions.Window > val df = Seq( > ("t1", "123", "bob"), > ("t1", "456", "bob"), > ("t2", "123", "sam") > ).toDF("type", "value", "name") > val test = df.select( > $"*", > f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", > $"name")).as("count"), $"name").as("name_count") > ).select( > $"*", > f.max($"name_count").over(Window.partitionBy($"type", > $"value")).as("best_name") > ) > test.printSchema > {code} > I get the following schema, which is fine: > {code:java} > root > |-- type: string (nullable = true) > |-- value: string (nullable = true) > |-- name: string (nullable = true) > |-- name_count: struct (nullable = false) > ||-- count: long (nullable = false) > ||-- name: string (nullable = true) > |-- best_name: struct (nullable = true) > ||-- count: long (nullable = false) > ||-- name: string (nullable = true) > {code} > However when I get a subfield of the best_name struct, I get an error: > {code:java} > test.select($"best_name.name").show(10, false) > org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding > attribute, tree: _gen_alias_3458#3458 > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35) > at > org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589) > at > org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at >