[jira] [Commented] (SPARK-31709) Proper base path for location when it is a relative path
[ https://issues.apache.org/jira/browse/SPARK-31709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489528#comment-17489528 ] Apache Spark commented on SPARK-31709: -- User 'bozhang2820' has created a pull request for this issue: https://github.com/apache/spark/pull/35462 > Proper base path for location when it is a relative path > > > Key: SPARK-31709 > URL: https://issues.apache.org/jira/browse/SPARK-31709 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > Currently, the user home directory is used as the base path for the database > and table locations when their location is specified with a relative path, > e.g. > {code:sql} > > set spark.sql.warehouse.dir; > spark.sql.warehouse.dir > file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/ > spark-sql> create database loctest location 'loctestdbdir'; > spark-sql> desc database loctest; > Database Name loctest > Comment > Location > file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir > Owner kentyao > spark-sql> create table loctest(id int) location 'loctestdbdir'; > spark-sql> desc formatted loctest; > idint NULL > # Detailed Table Information > Database default > Table loctest > Owner kentyao > Created Time Thu May 14 16:29:05 CST 2020 > Last Access UNKNOWN > Created BySpark 3.1.0-SNAPSHOT > Type EXTERNAL > Provider parquet > Location > file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir > Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > InputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > OutputFormat org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > {code} > The user home is not always warehouse-related, unchangeable in runtime, and > shared both by database and table as the parent directory. Meanwhile, we use > the table path as the parent directory for relative partition locations. > the config `spark.sql.warehouse.dir` represents the default location for > managed databases and tables. For databases, the case above seems not to > follow its semantics. For tables it is right but here I suggest enriching its > meaning that is also for external tables with relative paths for locations. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36145) Remove Python 3.6 support in codebase and CI
[ https://issues.apache.org/jira/browse/SPARK-36145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36145. -- Assignee: Maciej Szymkiewicz Resolution: Done > Remove Python 3.6 support in codebase and CI > > > Key: SPARK-36145 > URL: https://issues.apache.org/jira/browse/SPARK-36145 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Critical > > Python 3.6 is deprecated via SPARK-35938 at Apache Spark 3.2. We should > remove it in Spark 3.3. > This JIRA also target to all the changes in CI and development not only > user-facing changes. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489523#comment-17489523 ] Willi Raschkowski commented on SPARK-38166: --- Attaching driver logs: [^driver.log] Notable lines are probably: {code:java} ... INFO [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). INFO [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) INFO [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead. at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663) ... Caused by: org.apache.spark.ExecutorDeadException: The relative remote executor(Id: 2), which maintains the block data to fetch is dead. at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) ... INFO [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting ShuffleMapStage 5 (writeAndRead at CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) due to fetch failure INFO [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: Resubmitting failed stages ... {code} > Duplicates after task failure in dropDuplicates and repartition > --- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. >Reporter: Willi Raschkowski >Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
[ https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-38166: -- Attachment: driver.log > Duplicates after task failure in dropDuplicates and repartition > --- > > Key: SPARK-38166 > URL: https://issues.apache.org/jira/browse/SPARK-38166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 > Environment: Cluster runs on K8s. AQE is enabled. >Reporter: Willi Raschkowski >Priority: Major > Labels: correctness > Attachments: driver.log > > > We're seeing duplicates after running the following > {code} > def compute_shipments(shipments): > shipments = shipments.dropDuplicates(["ship_trck_num"]) > shipments = shipments.repartition(4) > return shipments > {code} > and observing lost executors (OOMs) and task retries in the repartition stage. > We're seeing this reliably in one of our pipelines. But I haven't managed to > reproduce outside of that pipeline. I'll attach driver logs and the > notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38139) ml.recommendation.ALS doctests failures
[ https://issues.apache.org/jira/browse/SPARK-38139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489521#comment-17489521 ] zhengruifeng commented on SPARK-38139: -- I think it is ok to adjust the tol in this case > ml.recommendation.ALS doctests failures > --- > > Key: SPARK-38139 > URL: https://issues.apache.org/jira/browse/SPARK-38139 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Priority: Major > > In my dev setups, ml.recommendation:ALS test consistently converges to value > lower than expected and fails with: > {code:python} > File "/path/to/spark/python/pyspark/ml/recommendation.py", line 322, in > __main__.ALS > Failed example: > predictions[0] > Expected: > Row(user=0, item=2, newPrediction=0.69291...) > Got: > Row(user=0, item=2, newPrediction=0.6929099559783936) > {code} > In can correct for that, but it creates some noise, so if anyone else > experiences this, we could drop a digit from the results > {code} > diff --git a/python/pyspark/ml/recommendation.py > b/python/pyspark/ml/recommendation.py > index f0628fb922..b8e2a6097d 100644 > --- a/python/pyspark/ml/recommendation.py > +++ b/python/pyspark/ml/recommendation.py > @@ -320,7 +320,7 @@ class ALS(JavaEstimator, _ALSParams, JavaMLWritable, > JavaMLReadable): > >>> test = spark.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", > "item"]) > >>> predictions = sorted(model.transform(test).collect(), key=lambda r: > r[0]) > >>> predictions[0] > -Row(user=0, item=2, newPrediction=0.69291...) > +Row(user=0, item=2, newPrediction=0.6929...) > >>> predictions[1] > Row(user=1, item=0, newPrediction=3.47356...) > >>> predictions[2] > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results
[ https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489518#comment-17489518 ] zhengruifeng commented on SPARK-34160: -- you can get a sparse vector by calling vector.{color:#ffc66d}compressed{color} > pyspark.ml.stat.Summarizer should allow sparse vector results > - > > Key: SPARK-34160 > URL: https://issues.apache.org/jira/browse/SPARK-34160 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.1 >Reporter: Ophir Yoktan >Priority: Major > > currently pyspark.ml.stat.Summarizer will return a dense vector, even if the > input is sparse. > the Summarizer should either deduce the relevant type from the input, or add > a parameter that forces sparse output > code to reproduce: > {{import pyspark}} > {{from pyspark.sql.functions import col}} > {{from pyspark.ml.stat import Summarizer}} > {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = > pyspark.SparkContext.getOrCreate()}} > {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( > SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}} > {{print(df.head())}} > {{print(df.select(Summarizer.mean(col('v'))).head())}} > ouput: > {{Row(v=SparseVector(100, \{1: 1.0})) }} > {{Row(mean(v)=DenseVector([0.0, 1.0,}} > {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results
[ https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-34160. -- Resolution: Not A Problem > pyspark.ml.stat.Summarizer should allow sparse vector results > - > > Key: SPARK-34160 > URL: https://issues.apache.org/jira/browse/SPARK-34160 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 3.0.1 >Reporter: Ophir Yoktan >Priority: Major > > currently pyspark.ml.stat.Summarizer will return a dense vector, even if the > input is sparse. > the Summarizer should either deduce the relevant type from the input, or add > a parameter that forces sparse output > code to reproduce: > {{import pyspark}} > {{from pyspark.sql.functions import col}} > {{from pyspark.ml.stat import Summarizer}} > {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = > pyspark.SparkContext.getOrCreate()}} > {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( > SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}} > {{print(df.head())}} > {{print(df.select(Summarizer.mean(col('v'))).head())}} > ouput: > {{Row(v=SparseVector(100, \{1: 1.0})) }} > {{Row(mean(v)=DenseVector([0.0, 1.0,}} > {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, > 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34452) OneVsRest with GBTClassifier throws InternalCompilerException in 3.1.0
[ https://issues.apache.org/jira/browse/SPARK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489517#comment-17489517 ] zhengruifeng commented on SPARK-34452: -- I can not reproduce this issue in 3.1.2, could you please provide a simple example? {code:java} scala> import org.apache.spark.ml.classification._ import org.apache.spark.ml.classification._ scala> scala> val df = spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt") 22/02/09 20:38:21 WARN LibSVMFileFormat: 'numFeatures' option not specified, determining the number of features by going though the input. If you know the number in advance, please specify it via 'numFeatures' option to avoid the extra scan. df: org.apache.spark.sql.DataFrame = [label: double, features: vector] scala> scala> val classifier = new GBTClassifier().setMaxIter(2) classifier: org.apache.spark.ml.classification.GBTClassifier = gbtc_e9ae5159908e scala> val ovr = new OneVsRest().setClassifier(classifier) ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_466495ea9392 scala> val ovrm = ovr.fit(df) ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: uid=oneVsRest_466495ea9392, classifier=gbtc_e9ae5159908e, numClasses=3, numFeatures=4 scala> ovrm.transform(df).show 22/02/09 20:38:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 22/02/09 20:38:27 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS +-+++--+ |label| features| rawPrediction|prediction| +-+++--+ | 1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 1.0| | 1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 1.0| | 1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 1.0| | 1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 1.0| | 0.0|(4,[0,1,2,3],[0.1...|[1.04768116880884...| 0.0| | 1.0|(4,[0,2,3],[-0.83...|[-1.0476811688088...| 1.0| | 2.0|(4,[0,1,2,3],[-1|[-1.0476811688088...| 2.0| | 2.0|(4,[0,1,2,3],[-1|[-1.0476811688088...| 2.0| | 1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 1.0| | 0.0|(4,[0,2,3],[0.611...|[1.04768116880884...| 0.0| | 0.0|(4,[0,1,2,3],[0.2...|[1.04768116880884...| 0.0| | 1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 1.0| | 1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 1.0| | 2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 2.0| | 2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 2.0| | 2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 2.0| | 1.0|(4,[0,2,3],[-0.94...|[-1.0476811688088...| 1.0| | 2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 2.0| | 0.0|(4,[0,1,2,3],[0.1...|[1.04768116880884...| 0.0| | 2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...| 2.0| +-+++--+ only showing top 20 rows{code} > OneVsRest with GBTClassifier throws InternalCompilerException in 3.1.0 > -- > > Key: SPARK-34452 > URL: https://issues.apache.org/jira/browse/SPARK-34452 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.1.0 >Reporter: Danijel Zečević >Priority: Major > > It looks like a problem with user defined function from OneVsRestModel. > Log: > {code} > 2021-02-17 13:24:17.517 ERROR 711498 --- [818.0 (TID 818)] > o.a.s.s.c.e.codegen.CodeGenerator: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > ~[janino-3.0.8.jar:na] > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > ~[janino-3.0.8.jar:na] > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > ~[janino-3.0.8.jar:na] > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > ~[janino-3.0.8.jar:na] > at > org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > ~[janino-3.0.8.jar:na] > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > ~[janino-3.0.8.jar:na] > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:82) > ~[commons-compiler-3.1.2.jar:na] > at >
[jira] [Created] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition
Willi Raschkowski created SPARK-38166: - Summary: Duplicates after task failure in dropDuplicates and repartition Key: SPARK-38166 URL: https://issues.apache.org/jira/browse/SPARK-38166 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2 Environment: Cluster runs on K8s. AQE is enabled. Reporter: Willi Raschkowski We're seeing duplicates after running the following {code} def compute_shipments(shipments): shipments = shipments.dropDuplicates(["ship_trck_num"]) shipments = shipments.repartition(4) return shipments {code} and observing lost executors (OOMs) and task retries in the repartition stage. We're seeing this reliably in one of our pipelines. But I haven't managed to reproduce outside of that pipeline. I'll attach driver logs and the notionalized input data - maybe you have ideas. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38165) private classes fail at runtime in scala 2.12.13+
[ https://issues.apache.org/jira/browse/SPARK-38165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Everson updated SPARK-38165: --- Description: h2. reproduction steps {code:java} git clone g...@github.com:everson/spark-codegen-bug.git sbt +test {code} h2. problem Starting with Scala 2.12.13, Spark code (tried 3.1.x and 3.2.x versions) referring to case classes members fail at runtime. See discussion on [https://github.com/scala/bug/issues/12533] for exact internal details from scala contributors, but the gist that starting with Scala 2.12.13, inner classes visibility rules changed via https://github.com/scala/scala/pull/9131 and it appears that Spark CodeGen assumes they are public. In a complex project, the error looks like: {code:java} [error] Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 8: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 8: Private member cannot be accessed from type "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection". [error] at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) [error] at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) [error] at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) [error] at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) [error] at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620) [error] at org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362) [error] at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349) [error] at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) [error] at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) [error] at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) [error] at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351) [error] at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205) [error] at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274) [error] at org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147) [error] at org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326) [error] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) [error] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) [error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) [error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) [error] at org.apache.spark.scheduler.Task.run(Task.scala:131) [error] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) [error] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) [error] at
[jira] [Updated] (SPARK-38165) private classes fail at runtime in scala 2.12.13+
[ https://issues.apache.org/jira/browse/SPARK-38165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Johnny Everson updated SPARK-38165: --- Description: h2. reproduction steps {code:java} git clone g...@github.com:everson/spark-codegen-bug.git sbt +test {code} h2. problem Starting with Scala 2.12.13, Spark code (tried 3.1.x and 3.2.x versions) referring to private members fail at runtime. Also reported in [https://github.com/scala/bug/issues/12533] In a complex project, the error looks like: {code:java} [error] Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 8: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 8: Private member cannot be accessed from type "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection". [error] at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) [error] at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) [error] at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) [error] at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) [error] at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620) [error] at org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362) [error] at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349) [error] at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) [error] at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) [error] at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) [error] at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351) [error] at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205) [error] at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274) [error] at org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147) [error] at org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326) [error] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) [error] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) [error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) [error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) [error] at org.apache.spark.scheduler.Task.run(Task.scala:131) [error] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) [error] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) [error] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) [error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [error] at
[jira] [Assigned] (SPARK-38164) New SQL function: try_subtract and try_multiply
[ https://issues.apache.org/jira/browse/SPARK-38164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38164: Assignee: Apache Spark (was: Gengliang Wang) > New SQL function: try_subtract and try_multiply > --- > > Key: SPARK-38164 > URL: https://issues.apache.org/jira/browse/SPARK-38164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38164) New SQL function: try_subtract and try_multiply
[ https://issues.apache.org/jira/browse/SPARK-38164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489490#comment-17489490 ] Apache Spark commented on SPARK-38164: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/35461 > New SQL function: try_subtract and try_multiply > --- > > Key: SPARK-38164 > URL: https://issues.apache.org/jira/browse/SPARK-38164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38164) New SQL function: try_subtract and try_multiply
[ https://issues.apache.org/jira/browse/SPARK-38164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38164: Assignee: Gengliang Wang (was: Apache Spark) > New SQL function: try_subtract and try_multiply > --- > > Key: SPARK-38164 > URL: https://issues.apache.org/jira/browse/SPARK-38164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38165) private classes fail at runtime in scala 2.12.13+
Johnny Everson created SPARK-38165: -- Summary: private classes fail at runtime in scala 2.12.13+ Key: SPARK-38165 URL: https://issues.apache.org/jira/browse/SPARK-38165 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Environment: Tested in using JVM 8, 11 on scala versions 2.12.12 (works), 12.12.13 to 12.12.15 and 12.13.7 to 12.13.8 Reporter: Johnny Everson h2. reproduction steps {code} git clone g...@github.com:everson/spark-codegen-bug.git sbt +test {code} h2. problem Starting with Scala 2.12.13, Spark code referring to private members fail at runtime. Also reported in https://github.com/scala/bug/issues/12533 In a complex project, the error looks like: {code} [error] Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 8: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, Column 8: Private member cannot be accessed from type "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection". [error] at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) [error] at org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) [error] at org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) [error] at org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) [error] at org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620) [error] at org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362) [error] at org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349) [error] at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) [error] at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) [error] at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) [error] at org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351) [error] at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205) [error] at org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277) [error] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274) [error] at org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147) [error] at org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326) [error] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) [error] at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) [error] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) [error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) [error] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) [error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) [error] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) [error] at org.apache.spark.scheduler.Task.run(Task.scala:131) [error] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) [error] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439) [error] at
[jira] [Created] (SPARK-38164) New SQL function: try_subtract and try_multiply
Gengliang Wang created SPARK-38164: -- Summary: New SQL function: try_subtract and try_multiply Key: SPARK-38164 URL: https://issues.apache.org/jira/browse/SPARK-38164 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder
Max Gekk created SPARK-38163: Summary: Preserve the error class of `AnalysisException` while constructing of function builder Key: SPARK-38163 URL: https://issues.apache.org/jira/browse/SPARK-38163 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: Max Gekk Assignee: Max Gekk When the cause exception is `AnalysisException` at https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132, Spark loses info about the error class. Need to preserve the info. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder
[ https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489482#comment-17489482 ] Max Gekk commented on SPARK-38163: -- I am working on the fix. > Preserve the error class of `AnalysisException` while constructing of > function builder > -- > > Key: SPARK-38163 > URL: https://issues.apache.org/jira/browse/SPARK-38163 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > When the cause exception is `AnalysisException` at > https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132, > Spark loses info about the error class. Need to preserve the info. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend
[ https://issues.apache.org/jira/browse/SPARK-38160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38160: Assignee: (was: Apache Spark) > Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed > happend > --- > > Key: SPARK-38160 > URL: https://issues.apache.org/jira/browse/SPARK-38160 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > > When we do shuffle on indeterminate expressions such as rand, and > ShuffleFetchFailed happend, we may get incorrent result since it only retries > failed map tasks. > We try to fix this by retry all upstream map tasks in this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend
[ https://issues.apache.org/jira/browse/SPARK-38160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489433#comment-17489433 ] Apache Spark commented on SPARK-38160: -- User 'WangGuangxin' has created a pull request for this issue: https://github.com/apache/spark/pull/35460 > Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed > happend > --- > > Key: SPARK-38160 > URL: https://issues.apache.org/jira/browse/SPARK-38160 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > > When we do shuffle on indeterminate expressions such as rand, and > ShuffleFetchFailed happend, we may get incorrent result since it only retries > failed map tasks. > We try to fix this by retry all upstream map tasks in this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend
[ https://issues.apache.org/jira/browse/SPARK-38160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38160: Assignee: Apache Spark > Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed > happend > --- > > Key: SPARK-38160 > URL: https://issues.apache.org/jira/browse/SPARK-38160 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > > When we do shuffle on indeterminate expressions such as rand, and > ShuffleFetchFailed happend, we may get incorrent result since it only retries > failed map tasks. > We try to fix this by retry all upstream map tasks in this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38162) Remove distinct in aggregate if its child is empty
XiDuo You created SPARK-38162: - Summary: Remove distinct in aggregate if its child is empty Key: SPARK-38162 URL: https://issues.apache.org/jira/browse/SPARK-38162 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: XiDuo You We can not propagate empty through aggregate if it does not contain grouping expression. But for the aggregate which contains distinct aggregate expression, we can remove distinct if its child is empty. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37952) Add missing statements to ALTER TABLE document.
[ https://issues.apache.org/jira/browse/SPARK-37952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37952: --- Assignee: Yuto Akutsu > Add missing statements to ALTER TABLE document. > --- > > Key: SPARK-37952 > URL: https://issues.apache.org/jira/browse/SPARK-37952 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Yuto Akutsu >Assignee: Yuto Akutsu >Priority: Minor > > Add some missing statements to the ALTER TABLE documents (which are mainly > supported with v2 table). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37952) Add missing statements to ALTER TABLE document.
[ https://issues.apache.org/jira/browse/SPARK-37952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37952. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35239 [https://github.com/apache/spark/pull/35239] > Add missing statements to ALTER TABLE document. > --- > > Key: SPARK-37952 > URL: https://issues.apache.org/jira/browse/SPARK-37952 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Yuto Akutsu >Assignee: Yuto Akutsu >Priority: Minor > Fix For: 3.3.0 > > > Add some missing statements to the ALTER TABLE documents (which are mainly > supported with v2 table). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe
gaokui created SPARK-38161: -- Summary: when clean data hope to spilt one dataframe or dataset to two dataframe Key: SPARK-38161 URL: https://issues.apache.org/jira/browse/SPARK-38161 Project: Spark Issue Type: New Feature Components: Block Manager Affects Versions: 3.2.1 Reporter: gaokui when I am processing data clean, I meet such scene. one coloumn need judge by empy or null condition. so I do it right now similar code as following: df1= dataframe.filter("coloumn=null") df2= dataframe.filter("coloumn=!null") and then write df1 and df2 into hdfs parquet file. but when i have thousand condition. every job need more stage. I hope dataframe can filter by one condition once and not twice. and that can generate two dataframe. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38150) Update comment of RelationConversions
[ https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38150: --- Assignee: angerszhu > Update comment of RelationConversions > - > > Key: SPARK-38150 > URL: https://issues.apache.org/jira/browse/SPARK-38150 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current comment of RelationConversions is not correct -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38150) Update comment of RelationConversions
[ https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38150. - Resolution: Fixed Issue resolved by pull request 35453 [https://github.com/apache/spark/pull/35453] > Update comment of RelationConversions > - > > Key: SPARK-38150 > URL: https://issues.apache.org/jira/browse/SPARK-38150 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0, 3.2.1 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current comment of RelationConversions is not correct -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value
[ https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-38120: --- Assignee: Khalid Mammadov > HiveExternalCatalog.listPartitions is failing when partition column name is > upper case and dot in partition value > - > > Key: SPARK-38120 > URL: https://issues.apache.org/jira/browse/SPARK-38120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.1 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > HiveExternalCatalog.listPartitions method call is failing when a partition > column name is upper case and partition value contains dot. It's related to > this change > [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23] > The test casein that PR does not produce the issue as partition column name > is lower case. > > Below how to reproduce the issue: > scala> import org.apache.spark.sql.catalyst.TableIdentifier > import org.apache.spark.sql.catalyst.TableIdentifier > scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY > (partCol1 STRING, partCol2 STRING)") > scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = > 'i.j') VALUES (100, 'John')") > scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), > Some(Map("partCol2" -> "i.j"))).foreach(println) > java.util.NoSuchElementException: key not found: partcol2 > at scala.collection.immutable.Map$Map2.apply(Map.scala:227) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202) > at scala.collection.immutable.Map$Map1.forall(Map.scala:196) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251) > ... 47 elided -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value
[ https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38120. - Fix Version/s: 3.3.0 3.2.2 Resolution: Fixed Issue resolved by pull request 35409 [https://github.com/apache/spark/pull/35409] > HiveExternalCatalog.listPartitions is failing when partition column name is > upper case and dot in partition value > - > > Key: SPARK-38120 > URL: https://issues.apache.org/jira/browse/SPARK-38120 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.1.2, 3.2.1 >Reporter: Khalid Mammadov >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > HiveExternalCatalog.listPartitions method call is failing when a partition > column name is upper case and partition value contains dot. It's related to > this change > [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23] > The test casein that PR does not produce the issue as partition column name > is lower case. > > Below how to reproduce the issue: > scala> import org.apache.spark.sql.catalyst.TableIdentifier > import org.apache.spark.sql.catalyst.TableIdentifier > scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY > (partCol1 STRING, partCol2 STRING)") > scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = > 'i.j') VALUES (100, 'John')") > scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), > Some(Map("partCol2" -> "i.j"))).foreach(println) > java.util.NoSuchElementException: key not found: partcol2 > at scala.collection.immutable.Map$Map2.apply(Map.scala:227) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202) > at scala.collection.immutable.Map$Map1.forall(Map.scala:196) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312) > at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303) > at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297) > at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108) > at scala.collection.TraversableLike.filter(TraversableLike.scala:395) > at scala.collection.TraversableLike.filter$(TraversableLike.scala:395) > at scala.collection.AbstractTraversable.filter(Traversable.scala:108) > at > org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251) > ... 47 elided -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend
EdisonWang created SPARK-38160: -- Summary: Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend Key: SPARK-38160 URL: https://issues.apache.org/jira/browse/SPARK-38160 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: EdisonWang When we do shuffle on indeterminate expressions such as rand, and ShuffleFetchFailed happend, we may get incorrent result since it only retries failed map tasks. We try to fix this by retry all upstream map tasks in this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38159) Minor refactor of MetadataAttribute unapply method
[ https://issues.apache.org/jira/browse/SPARK-38159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38159: Assignee: (was: Apache Spark) > Minor refactor of MetadataAttribute unapply method > -- > > Key: SPARK-38159 > URL: https://issues.apache.org/jira/browse/SPARK-38159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38159) Minor refactor of MetadataAttribute unapply method
[ https://issues.apache.org/jira/browse/SPARK-38159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38159: Assignee: Apache Spark > Minor refactor of MetadataAttribute unapply method > -- > > Key: SPARK-38159 > URL: https://issues.apache.org/jira/browse/SPARK-38159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38159) Minor refactor of MetadataAttribute unapply method
[ https://issues.apache.org/jira/browse/SPARK-38159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489359#comment-17489359 ] Apache Spark commented on SPARK-38159: -- User 'Yaohua628' has created a pull request for this issue: https://github.com/apache/spark/pull/35459 > Minor refactor of MetadataAttribute unapply method > -- > > Key: SPARK-38159 > URL: https://issues.apache.org/jira/browse/SPARK-38159 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.1 >Reporter: Yaohua Zhao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38159) Minor refactor of MetadataAttribute unapply method
Yaohua Zhao created SPARK-38159: --- Summary: Minor refactor of MetadataAttribute unapply method Key: SPARK-38159 URL: https://issues.apache.org/jira/browse/SPARK-38159 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.1 Reporter: Yaohua Zhao -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37770) Performance improvements for ColumnVector `putByteArray`
[ https://issues.apache.org/jira/browse/SPARK-37770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yaohua Zhao resolved SPARK-37770. - Resolution: Fixed > Performance improvements for ColumnVector `putByteArray` > > > Key: SPARK-37770 > URL: https://issues.apache.org/jira/browse/SPARK-37770 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yaohua Zhao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38056) Structured streaming not working in history server when using LevelDB
[ https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-38056. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35356 [https://github.com/apache/spark/pull/35356] > Structured streaming not working in history server when using LevelDB > - > > Key: SPARK-38056 > URL: https://issues.apache.org/jira/browse/SPARK-38056 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Web UI >Affects Versions: 3.1.2, 3.2.0 >Reporter: wy >Assignee: wy >Priority: Major > Fix For: 3.3.0 > > Attachments: local-1643373518829 > > > In > [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68], > structured streaming support is added to history server. However this does > not work when spark.history.store.path is set to save app info using LevelDB. > This is because one of the keys of StreamingQueryData, runId, is UUID type, > which is not supported by LevelDB. When replaying event log file in history > server, StreamingQueryStatusListener will throw an exception when writing > info to the store, saying "java.lang.IllegalArgumentException: Type > java.util.UUID not allowed as key.". > Example event log is provided in attachments. When opening it in history > server with spark.history.store.path set to somewhere, no structured > streaming info is available. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38056) Structured streaming not working in history server when using LevelDB
[ https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-38056: Assignee: wy > Structured streaming not working in history server when using LevelDB > - > > Key: SPARK-38056 > URL: https://issues.apache.org/jira/browse/SPARK-38056 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Web UI >Affects Versions: 3.1.2, 3.2.0 >Reporter: wy >Assignee: wy >Priority: Major > Attachments: local-1643373518829 > > > In > [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68], > structured streaming support is added to history server. However this does > not work when spark.history.store.path is set to save app info using LevelDB. > This is because one of the keys of StreamingQueryData, runId, is UUID type, > which is not supported by LevelDB. When replaying event log file in history > server, StreamingQueryStatusListener will throw an exception when writing > info to the store, saying "java.lang.IllegalArgumentException: Type > java.util.UUID not allowed as key.". > Example event log is provided in attachments. When opening it in history > server with spark.history.store.path set to somewhere, no structured > streaming info is available. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org