[jira] [Commented] (SPARK-31709) Proper base path for location when it is a relative path

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489528#comment-17489528
 ] 

Apache Spark commented on SPARK-31709:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/35462

> Proper base path for location when it is a relative path
> 
>
> Key: SPARK-31709
> URL: https://issues.apache.org/jira/browse/SPARK-31709
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, the user home directory is used as the base path for the database 
> and table locations when their location is specified with a relative path, 
> e.g.
> {code:sql}
> > set spark.sql.warehouse.dir;
> spark.sql.warehouse.dir   
> file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/
> spark-sql> create database loctest location 'loctestdbdir';
> spark-sql> desc database loctest;
> Database Name loctest
> Comment
> Location  
> file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
> Owner kentyao
> spark-sql> create table loctest(id int) location 'loctestdbdir';
> spark-sql> desc formatted loctest;
> idint NULL
> # Detailed Table Information
> Database  default
> Table loctest
> Owner kentyao
> Created Time  Thu May 14 16:29:05 CST 2020
> Last Access   UNKNOWN
> Created BySpark 3.1.0-SNAPSHOT
> Type  EXTERNAL
> Provider  parquet
> Location  
> file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
> Serde Library org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
> InputFormat   org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
> OutputFormat  org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
> {code}
> The user home is not always warehouse-related, unchangeable in runtime, and 
> shared both by database and table as the parent directory. Meanwhile, we use 
> the table path as the parent directory for relative partition locations.
> the config `spark.sql.warehouse.dir` represents the default location for 
> managed databases and tables. For databases, the case above seems not to 
> follow its semantics. For tables it is right but here I suggest enriching its 
> meaning that is also for external tables with relative paths for locations.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36145) Remove Python 3.6 support in codebase and CI

2022-02-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36145.
--
  Assignee: Maciej Szymkiewicz
Resolution: Done

> Remove Python 3.6 support in codebase and CI
> 
>
> Key: SPARK-36145
> URL: https://issues.apache.org/jira/browse/SPARK-36145
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Maciej Szymkiewicz
>Priority: Critical
>
> Python 3.6 is deprecated via SPARK-35938 at Apache Spark 3.2. We should 
> remove it in Spark 3.3.
> This JIRA also target to all the changes in CI and development not only 
> user-facing changes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489523#comment-17489523
 ] 

Willi Raschkowski commented on SPARK-38166:
---

Attaching driver logs:  [^driver.log]

Notable lines are probably:
{code:java}
...
INFO  [2021-11-11T23:04:13.68737Z] org.apache.spark.scheduler.TaskSetManager: 
Task 1.1 in stage 6.0 (TID 60) failed, but the task will not be re-executed 
(either because the task failed with a shuffle data fetch failure, so the 
previous stage needs to be re-run, or because a different copy of the task has 
already succeeded).
INFO  [2021-11-11T23:04:13.687562Z] org.apache.spark.scheduler.DAGScheduler: 
Marking ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) as 
failed due to a fetch failure from ShuffleMapStage 5 (writeAndRead at 
CustomSaveDatasetCommand.scala:218)
INFO  [2021-11-11T23:04:13.688643Z] org.apache.spark.scheduler.DAGScheduler: 
ResultStage 6 (writeAndRead at CustomSaveDatasetCommand.scala:218) failed in 
1012.545 s due to org.apache.spark.shuffle.FetchFailedException: The relative 
remote executor(Id: 2), which maintains the block data to fetch is dead.
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:748)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:663)
...
Caused by: org.apache.spark.ExecutorDeadException: The relative remote 
executor(Id: 2), which maintains the block data to fetch is dead.
at 
org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:132)
at 
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
...

INFO  [2021-11-11T23:04:13.690385Z] org.apache.spark.scheduler.DAGScheduler: 
Resubmitting ShuffleMapStage 5 (writeAndRead at 
CustomSaveDatasetCommand.scala:218) and ResultStage 6 (writeAndRead at 
CustomSaveDatasetCommand.scala:218) due to fetch failure
INFO  [2021-11-11T23:04:13.894248Z] org.apache.spark.scheduler.DAGScheduler: 
Resubmitting failed stages
...
{code}

> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs and the 
> notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Willi Raschkowski updated SPARK-38166:
--
Attachment: driver.log

> Duplicates after task failure in dropDuplicates and repartition
> ---
>
> Key: SPARK-38166
> URL: https://issues.apache.org/jira/browse/SPARK-38166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2
> Environment: Cluster runs on K8s. AQE is enabled.
>Reporter: Willi Raschkowski
>Priority: Major
>  Labels: correctness
> Attachments: driver.log
>
>
> We're seeing duplicates after running the following 
> {code}
> def compute_shipments(shipments):
> shipments = shipments.dropDuplicates(["ship_trck_num"])
> shipments = shipments.repartition(4)
> return shipments
> {code}
> and observing lost executors (OOMs) and task retries in the repartition stage.
> We're seeing this reliably in one of our pipelines. But I haven't managed to 
> reproduce outside of that pipeline. I'll attach driver logs and the 
> notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38139) ml.recommendation.ALS doctests failures

2022-02-09 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489521#comment-17489521
 ] 

zhengruifeng commented on SPARK-38139:
--

I think it is ok to adjust the tol in this case

> ml.recommendation.ALS doctests failures
> ---
>
> Key: SPARK-38139
> URL: https://issues.apache.org/jira/browse/SPARK-38139
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> In my dev setups, ml.recommendation:ALS test consistently converges to value 
> lower than expected and fails with:
> {code:python}
> File "/path/to/spark/python/pyspark/ml/recommendation.py", line 322, in 
> __main__.ALS
> Failed example:
> predictions[0]
> Expected:
> Row(user=0, item=2, newPrediction=0.69291...)
> Got:
> Row(user=0, item=2, newPrediction=0.6929099559783936)
> {code}
> In can correct for that, but it creates some noise, so if anyone else 
> experiences this, we could drop  a digit from the results
> {code}
> diff --git a/python/pyspark/ml/recommendation.py 
> b/python/pyspark/ml/recommendation.py
> index f0628fb922..b8e2a6097d 100644
> --- a/python/pyspark/ml/recommendation.py
> +++ b/python/pyspark/ml/recommendation.py
> @@ -320,7 +320,7 @@ class ALS(JavaEstimator, _ALSParams, JavaMLWritable, 
> JavaMLReadable):
>  >>> test = spark.createDataFrame([(0, 2), (1, 0), (2, 0)], ["user", 
> "item"])
>  >>> predictions = sorted(model.transform(test).collect(), key=lambda r: 
> r[0])
>  >>> predictions[0]
> -Row(user=0, item=2, newPrediction=0.69291...)
> +Row(user=0, item=2, newPrediction=0.6929...)
>  >>> predictions[1]
>  Row(user=1, item=0, newPrediction=3.47356...)
>  >>> predictions[2]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results

2022-02-09 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489518#comment-17489518
 ] 

zhengruifeng commented on SPARK-34160:
--

you can get a sparse vector by calling vector.{color:#ffc66d}compressed{color}

> pyspark.ml.stat.Summarizer should allow sparse vector results
> -
>
> Key: SPARK-34160
> URL: https://issues.apache.org/jira/browse/SPARK-34160
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.1
>Reporter: Ophir Yoktan
>Priority: Major
>
> currently pyspark.ml.stat.Summarizer will return a dense vector, even if the 
> input is sparse.
> the Summarizer should either deduce the relevant type from the input, or add 
> a parameter that forces sparse output
> code to reproduce:
> {{import pyspark}}
> {{from pyspark.sql.functions import col}}
> {{from pyspark.ml.stat import Summarizer}}
> {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = 
> pyspark.SparkContext.getOrCreate()}}
> {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( 
> SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
> {{print(df.head())}}
> {{print(df.select(Summarizer.mean(col('v'))).head())}}
> ouput:
> {{Row(v=SparseVector(100, \{1: 1.0})) }}
> {{Row(mean(v)=DenseVector([0.0, 1.0,}}
> {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34160) pyspark.ml.stat.Summarizer should allow sparse vector results

2022-02-09 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-34160.
--
Resolution: Not A Problem

> pyspark.ml.stat.Summarizer should allow sparse vector results
> -
>
> Key: SPARK-34160
> URL: https://issues.apache.org/jira/browse/SPARK-34160
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.1
>Reporter: Ophir Yoktan
>Priority: Major
>
> currently pyspark.ml.stat.Summarizer will return a dense vector, even if the 
> input is sparse.
> the Summarizer should either deduce the relevant type from the input, or add 
> a parameter that forces sparse output
> code to reproduce:
> {{import pyspark}}
> {{from pyspark.sql.functions import col}}
> {{from pyspark.ml.stat import Summarizer}}
> {{from pyspark.ml.linalg import SparseVector, DenseVector}}{{sc = 
> pyspark.SparkContext.getOrCreate()}}
> {{sql_context = pyspark.SQLContext(sc)}}{{df = sc.parallelize([ ( 
> SparseVector(100, \{1: 1.0}),)]).toDF(['v'])}}
> {{print(df.head())}}
> {{print(df.select(Summarizer.mean(col('v'))).head())}}
> ouput:
> {{Row(v=SparseVector(100, \{1: 1.0})) }}
> {{Row(mean(v)=DenseVector([0.0, 1.0,}}
> {{0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 
> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]))}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34452) OneVsRest with GBTClassifier throws InternalCompilerException in 3.1.0

2022-02-09 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489517#comment-17489517
 ] 

zhengruifeng commented on SPARK-34452:
--

I can not reproduce this issue in 3.1.2, could you please provide a simple 
example?

 
{code:java}
scala> import org.apache.spark.ml.classification._
import org.apache.spark.ml.classification._

scala> scala> val df = 
spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt")
22/02/09 20:38:21 WARN LibSVMFileFormat: 'numFeatures' option not specified, 
determining the number of features by going though the input. If you know the 
number in advance, please specify it via 'numFeatures' option to avoid the 
extra scan.
df: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> scala> val classifier = new GBTClassifier().setMaxIter(2)
classifier: org.apache.spark.ml.classification.GBTClassifier = gbtc_e9ae5159908e

scala> val ovr = new OneVsRest().setClassifier(classifier)
ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_466495ea9392

scala> val ovrm = ovr.fit(df)
ovrm: org.apache.spark.ml.classification.OneVsRestModel = OneVsRestModel: 
uid=oneVsRest_466495ea9392, classifier=gbtc_e9ae5159908e, numClasses=3, 
numFeatures=4

scala> ovrm.transform(df).show
22/02/09 20:38:27 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
22/02/09 20:38:27 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
+-+++--+
|label|            features|       rawPrediction|prediction|
+-+++--+
|  1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       1.0|
|  1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       1.0|
|  1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       1.0|
|  1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       1.0|
|  0.0|(4,[0,1,2,3],[0.1...|[1.04768116880884...|       0.0|
|  1.0|(4,[0,2,3],[-0.83...|[-1.0476811688088...|       1.0|
|  2.0|(4,[0,1,2,3],[-1|[-1.0476811688088...|       2.0|
|  2.0|(4,[0,1,2,3],[-1|[-1.0476811688088...|       2.0|
|  1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       1.0|
|  0.0|(4,[0,2,3],[0.611...|[1.04768116880884...|       0.0|
|  0.0|(4,[0,1,2,3],[0.2...|[1.04768116880884...|       0.0|
|  1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       1.0|
|  1.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       1.0|
|  2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       2.0|
|  2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       2.0|
|  2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       2.0|
|  1.0|(4,[0,2,3],[-0.94...|[-1.0476811688088...|       1.0|
|  2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       2.0|
|  0.0|(4,[0,1,2,3],[0.1...|[1.04768116880884...|       0.0|
|  2.0|(4,[0,1,2,3],[-0|[-1.0476811688088...|       2.0|
+-+++--+
only showing top 20 rows{code}

> OneVsRest with GBTClassifier throws InternalCompilerException in 3.1.0
> --
>
> Key: SPARK-34452
> URL: https://issues.apache.org/jira/browse/SPARK-34452
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Danijel Zečević
>Priority: Major
>
> It looks like a problem with user defined function from OneVsRestModel.
> Log:
> {code}
> 2021-02-17 13:24:17.517 ERROR 711498 --- [818.0 (TID 818)] 
> o.a.s.s.c.e.codegen.CodeGenerator: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) 
> ~[janino-3.0.8.jar:na]
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) 
> ~[janino-3.0.8.jar:na]
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
>  ~[janino-3.0.8.jar:na]
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
>  ~[janino-3.0.8.jar:na]
>   at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) 
> ~[janino-3.0.8.jar:na]
>   at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) 
> ~[janino-3.0.8.jar:na]
>   at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:82) 
> ~[commons-compiler-3.1.2.jar:na]
>   at 
> 

[jira] [Created] (SPARK-38166) Duplicates after task failure in dropDuplicates and repartition

2022-02-09 Thread Willi Raschkowski (Jira)
Willi Raschkowski created SPARK-38166:
-

 Summary: Duplicates after task failure in dropDuplicates and 
repartition
 Key: SPARK-38166
 URL: https://issues.apache.org/jira/browse/SPARK-38166
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2
 Environment: Cluster runs on K8s. AQE is enabled.
Reporter: Willi Raschkowski


We're seeing duplicates after running the following 

{code}
def compute_shipments(shipments):
shipments = shipments.dropDuplicates(["ship_trck_num"])
shipments = shipments.repartition(4)
return shipments
{code}

and observing lost executors (OOMs) and task retries in the repartition stage.

We're seeing this reliably in one of our pipelines. But I haven't managed to 
reproduce outside of that pipeline. I'll attach driver logs and the 
notionalized input data - maybe you have ideas.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38165) private classes fail at runtime in scala 2.12.13+

2022-02-09 Thread Johnny Everson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Everson updated SPARK-38165:
---
Description: 
h2. reproduction steps
{code:java}
git clone g...@github.com:everson/spark-codegen-bug.git
sbt +test
{code}
h2. problem

Starting with Scala 2.12.13, Spark code (tried 3.1.x and 3.2.x versions) 
referring to case classes members fail at runtime.

See discussion on [https://github.com/scala/bug/issues/12533] for exact 
internal details from scala contributors, but the gist that starting with Scala 
2.12.13, inner classes visibility rules changed via 
https://github.com/scala/scala/pull/9131 and it appears that Spark CodeGen 
assumes they are public.

In a complex project, the error looks like:
{code:java}
[error]
Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent 
failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): 
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, 
Column 8: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 63, Column 8: Private member cannot be accessed 
from type 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection".
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
[error] at 
org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
[error] at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
[error] at 
org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
[error] at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
[error] at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274)
[error] at 
org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147)
[error] at 
org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[error] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[error] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:131)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[error] at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[error] at 

[jira] [Updated] (SPARK-38165) private classes fail at runtime in scala 2.12.13+

2022-02-09 Thread Johnny Everson (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Johnny Everson updated SPARK-38165:
---
Description: 
h2. reproduction steps
{code:java}
git clone g...@github.com:everson/spark-codegen-bug.git
sbt +test
{code}
h2. problem

Starting with Scala 2.12.13, Spark code (tried 3.1.x and 3.2.x versions) 
referring to private members fail at runtime.

Also reported in [https://github.com/scala/bug/issues/12533]

In a complex project, the error looks like:
{code:java}
[error]
Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent 
failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): 
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, 
Column 8: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 63, Column 8: Private member cannot be accessed 
from type 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection".
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
[error] at 
org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
[error] at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
[error] at 
org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
[error] at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
[error] at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274)
[error] at 
org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147)
[error] at 
org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[error] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[error] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:131)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[error] at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] at 

[jira] [Assigned] (SPARK-38164) New SQL function: try_subtract and try_multiply

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38164:


Assignee: Apache Spark  (was: Gengliang Wang)

> New SQL function: try_subtract and try_multiply
> ---
>
> Key: SPARK-38164
> URL: https://issues.apache.org/jira/browse/SPARK-38164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38164) New SQL function: try_subtract and try_multiply

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489490#comment-17489490
 ] 

Apache Spark commented on SPARK-38164:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35461

> New SQL function: try_subtract and try_multiply
> ---
>
> Key: SPARK-38164
> URL: https://issues.apache.org/jira/browse/SPARK-38164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38164) New SQL function: try_subtract and try_multiply

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38164:


Assignee: Gengliang Wang  (was: Apache Spark)

> New SQL function: try_subtract and try_multiply
> ---
>
> Key: SPARK-38164
> URL: https://issues.apache.org/jira/browse/SPARK-38164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38165) private classes fail at runtime in scala 2.12.13+

2022-02-09 Thread Johnny Everson (Jira)
Johnny Everson created SPARK-38165:
--

 Summary: private classes fail at runtime in scala 2.12.13+
 Key: SPARK-38165
 URL: https://issues.apache.org/jira/browse/SPARK-38165
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
 Environment: Tested in using JVM 8, 11 on scala versions 2.12.12 
(works), 12.12.13 to 12.12.15 and 12.13.7 to 12.13.8
Reporter: Johnny Everson


h2. reproduction steps

{code}
git clone g...@github.com:everson/spark-codegen-bug.git
sbt +test
{code}

h2. problem

Starting with Scala 2.12.13, Spark code referring to private members fail at 
runtime. 

Also reported in https://github.com/scala/bug/issues/12533

In a complex project, the error looks like:

{code}
[error]
Success(SparkFailures(NonEmpty[Unknown(org.apache.spark.SparkException: Job 
aborted due to stage failure: Task 1 in stage 2.0 failed 1 times, most recent 
failure: Lost task 1.0 in stage 2.0 (TID 3) (192.168.0.80 executor driver): 
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 63, 
Column 8: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 63, Column 8: Private member cannot be accessed 
from type 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection".
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
[error] at 
org.sparkproject.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
[error] at 
org.sparkproject.guava.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
[error] at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.waitForValue(LocalCache.java:3620)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.waitForLoadingValue(LocalCache.java:2362)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2349)
[error] at 
org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
[error] at 
org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000)
[error] at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
[error] at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1351)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:205)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateSafeProjection$.create(GenerateSafeProjection.scala:39)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1277)
[error] at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1274)
[error] at 
org.apache.spark.sql.execution.ObjectOperator$.deserializeRowToObject(objects.scala:147)
[error] at 
org.apache.spark.sql.execution.AppendColumnsExec.$anonfun$doExecute$12(objects.scala:326)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
[error] at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
[error] at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
[error] at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
[error] at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
[error] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
[error] at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
[error] at org.apache.spark.scheduler.Task.run(Task.scala:131)
[error] at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
[error] at 
org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
[error] at 

[jira] [Created] (SPARK-38164) New SQL function: try_subtract and try_multiply

2022-02-09 Thread Gengliang Wang (Jira)
Gengliang Wang created SPARK-38164:
--

 Summary: New SQL function: try_subtract and try_multiply
 Key: SPARK-38164
 URL: https://issues.apache.org/jira/browse/SPARK-38164
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Max Gekk (Jira)
Max Gekk created SPARK-38163:


 Summary: Preserve the error class of `AnalysisException` while 
constructing of function builder
 Key: SPARK-38163
 URL: https://issues.apache.org/jira/browse/SPARK-38163
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Max Gekk
Assignee: Max Gekk


When the cause exception is `AnalysisException` at
https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
 Spark loses info about the error class. Need to preserve the info.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38163) Preserve the error class of `AnalysisException` while constructing of function builder

2022-02-09 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489482#comment-17489482
 ] 

Max Gekk commented on SPARK-38163:
--

I am working on the fix.

> Preserve the error class of `AnalysisException` while constructing of 
> function builder
> --
>
> Key: SPARK-38163
> URL: https://issues.apache.org/jira/browse/SPARK-38163
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> When the cause exception is `AnalysisException` at
> https://github.com/apache/spark/blob/9c02dd4035c9412ca03e5a5f4721ee223953c004/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala#L132,
>  Spark loses info about the error class. Need to preserve the info.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38160:


Assignee: (was: Apache Spark)

> Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed 
> happend
> ---
>
> Key: SPARK-38160
> URL: https://issues.apache.org/jira/browse/SPARK-38160
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
>
> When we do shuffle on indeterminate expressions such as rand, and 
> ShuffleFetchFailed happend, we may get incorrent result since it only retries 
> failed map tasks.
> We try to fix this by retry all upstream map tasks in this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489433#comment-17489433
 ] 

Apache Spark commented on SPARK-38160:
--

User 'WangGuangxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/35460

> Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed 
> happend
> ---
>
> Key: SPARK-38160
> URL: https://issues.apache.org/jira/browse/SPARK-38160
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
>
> When we do shuffle on indeterminate expressions such as rand, and 
> ShuffleFetchFailed happend, we may get incorrent result since it only retries 
> failed map tasks.
> We try to fix this by retry all upstream map tasks in this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38160:


Assignee: Apache Spark

> Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed 
> happend
> ---
>
> Key: SPARK-38160
> URL: https://issues.apache.org/jira/browse/SPARK-38160
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> When we do shuffle on indeterminate expressions such as rand, and 
> ShuffleFetchFailed happend, we may get incorrent result since it only retries 
> failed map tasks.
> We try to fix this by retry all upstream map tasks in this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38162) Remove distinct in aggregate if its child is empty

2022-02-09 Thread XiDuo You (Jira)
XiDuo You created SPARK-38162:
-

 Summary: Remove distinct in aggregate if its child is empty
 Key: SPARK-38162
 URL: https://issues.apache.org/jira/browse/SPARK-38162
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


We can not propagate empty through aggregate if it does not contain grouping 
expression. But for the aggregate which contains distinct aggregate expression, 
we can remove distinct if its child is empty.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37952) Add missing statements to ALTER TABLE document.

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37952:
---

Assignee: Yuto Akutsu

> Add missing statements to ALTER TABLE document.
> ---
>
> Key: SPARK-37952
> URL: https://issues.apache.org/jira/browse/SPARK-37952
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Yuto Akutsu
>Assignee: Yuto Akutsu
>Priority: Minor
>
> Add some missing statements to the ALTER TABLE documents (which are mainly 
> supported with v2 table).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37952) Add missing statements to ALTER TABLE document.

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37952.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35239
[https://github.com/apache/spark/pull/35239]

> Add missing statements to ALTER TABLE document.
> ---
>
> Key: SPARK-37952
> URL: https://issues.apache.org/jira/browse/SPARK-37952
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Yuto Akutsu
>Assignee: Yuto Akutsu
>Priority: Minor
> Fix For: 3.3.0
>
>
> Add some missing statements to the ALTER TABLE documents (which are mainly 
> supported with v2 table).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38161) when clean data hope to spilt one dataframe or dataset to two dataframe

2022-02-09 Thread gaokui (Jira)
gaokui created SPARK-38161:
--

 Summary: when clean data hope to spilt one dataframe or dataset  
to two dataframe
 Key: SPARK-38161
 URL: https://issues.apache.org/jira/browse/SPARK-38161
 Project: Spark
  Issue Type: New Feature
  Components: Block Manager
Affects Versions: 3.2.1
Reporter: gaokui


when I am  processing  data clean, I meet such scene.

one coloumn need judge by empy or null condition.

so I do it right now similar code as following:

df1= dataframe.filter("coloumn=null")

df2= dataframe.filter("coloumn=!null")

and then write df1 and df2 into hdfs parquet file.

but when i have thousand condition. every job need more stage.

I hope dataframe can filter by one condition once and not twice. and that can 
generate two dataframe.

 

 

      



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38150) Update comment of RelationConversions

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38150:
---

Assignee: angerszhu

> Update comment of RelationConversions
> -
>
> Key: SPARK-38150
> URL: https://issues.apache.org/jira/browse/SPARK-38150
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current comment of RelationConversions is not correct



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38150) Update comment of RelationConversions

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38150.
-
Resolution: Fixed

Issue resolved by pull request 35453
[https://github.com/apache/spark/pull/35453]

> Update comment of RelationConversions
> -
>
> Key: SPARK-38150
> URL: https://issues.apache.org/jira/browse/SPARK-38150
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Current comment of RelationConversions is not correct



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38120:
---

Assignee: Khalid Mammadov

> HiveExternalCatalog.listPartitions is failing when partition column name is 
> upper case and dot in partition value
> -
>
> Key: SPARK-38120
> URL: https://issues.apache.org/jira/browse/SPARK-38120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> HiveExternalCatalog.listPartitions method call is failing when a partition 
> column name is upper case and partition value contains dot. It's related to 
> this change 
> [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23]
> The test casein that PR does not produce the issue as partition column name 
> is lower case.
>  
> Below how to reproduce the issue:
> scala> import org.apache.spark.sql.catalyst.TableIdentifier
> import org.apache.spark.sql.catalyst.TableIdentifier
> scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
> (partCol1 STRING, partCol2 STRING)")
> scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 
> 'i.j') VALUES (100, 'John')")                               
> scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
> Some(Map("partCol2" -> "i.j"))).foreach(println)
> java.util.NoSuchElementException: key not found: partcol2
>   at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
>   at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
>   ... 47 elided



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38120) HiveExternalCatalog.listPartitions is failing when partition column name is upper case and dot in partition value

2022-02-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38120.
-
Fix Version/s: 3.3.0
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 35409
[https://github.com/apache/spark/pull/35409]

> HiveExternalCatalog.listPartitions is failing when partition column name is 
> upper case and dot in partition value
> -
>
> Key: SPARK-38120
> URL: https://issues.apache.org/jira/browse/SPARK-38120
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.1
>Reporter: Khalid Mammadov
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> HiveExternalCatalog.listPartitions method call is failing when a partition 
> column name is upper case and partition value contains dot. It's related to 
> this change 
> [https://github.com/apache/spark/commit/f18b905f6cace7686ef169fda7de474079d0af23]
> The test casein that PR does not produce the issue as partition column name 
> is lower case.
>  
> Below how to reproduce the issue:
> scala> import org.apache.spark.sql.catalyst.TableIdentifier
> import org.apache.spark.sql.catalyst.TableIdentifier
> scala> spark.sql("CREATE TABLE customer(id INT, name STRING) PARTITIONED BY 
> (partCol1 STRING, partCol2 STRING)")
> scala> spark.sql("INSERT INTO customer PARTITION (partCol1 = 'CA', partCol2 = 
> 'i.j') VALUES (100, 'John')")                               
> scala> spark.sessionState.catalog.listPartitions(TableIdentifier("customer"), 
> Some(Map("partCol2" -> "i.j"))).foreach(println)
> java.util.NoSuchElementException: key not found: partcol2
>   at scala.collection.immutable.Map$Map2.apply(Map.scala:227)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1(ExternalCatalogUtils.scala:205)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$isPartialPartitionSpec$1$adapted(ExternalCatalogUtils.scala:202)
>   at scala.collection.immutable.Map$Map1.forall(Map.scala:196)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.isPartialPartitionSpec(ExternalCatalogUtils.scala:202)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$6$adapted(HiveExternalCatalog.scala:1312)
>   at 
> scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:304)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:303)
>   at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:297)
>   at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
>   at scala.collection.TraversableLike.filter(TraversableLike.scala:395)
>   at scala.collection.TraversableLike.filter$(TraversableLike.scala:395)
>   at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$listPartitions$1(HiveExternalCatalog.scala:1312)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClientWrappingException(HiveExternalCatalog.scala:114)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
>   at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1296)
>   at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitions(ExternalCatalogWithListener.scala:254)
>   at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitions(SessionCatalog.scala:1251)
>   ... 47 elided



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

2022-02-09 Thread EdisonWang (Jira)
EdisonWang created SPARK-38160:
--

 Summary: Shuffle by rand could lead to incorrect answers when 
ShuffleFetchFailed happend
 Key: SPARK-38160
 URL: https://issues.apache.org/jira/browse/SPARK-38160
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: EdisonWang


When we do shuffle on indeterminate expressions such as rand, and 
ShuffleFetchFailed happend, we may get incorrent result since it only retries 
failed map tasks.

We try to fix this by retry all upstream map tasks in this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38159) Minor refactor of MetadataAttribute unapply method

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38159:


Assignee: (was: Apache Spark)

> Minor refactor of MetadataAttribute unapply method
> --
>
> Key: SPARK-38159
> URL: https://issues.apache.org/jira/browse/SPARK-38159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38159) Minor refactor of MetadataAttribute unapply method

2022-02-09 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38159:


Assignee: Apache Spark

> Minor refactor of MetadataAttribute unapply method
> --
>
> Key: SPARK-38159
> URL: https://issues.apache.org/jira/browse/SPARK-38159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38159) Minor refactor of MetadataAttribute unapply method

2022-02-09 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17489359#comment-17489359
 ] 

Apache Spark commented on SPARK-38159:
--

User 'Yaohua628' has created a pull request for this issue:
https://github.com/apache/spark/pull/35459

> Minor refactor of MetadataAttribute unapply method
> --
>
> Key: SPARK-38159
> URL: https://issues.apache.org/jira/browse/SPARK-38159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Yaohua Zhao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38159) Minor refactor of MetadataAttribute unapply method

2022-02-09 Thread Yaohua Zhao (Jira)
Yaohua Zhao created SPARK-38159:
---

 Summary: Minor refactor of MetadataAttribute unapply method
 Key: SPARK-38159
 URL: https://issues.apache.org/jira/browse/SPARK-38159
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.1
Reporter: Yaohua Zhao






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37770) Performance improvements for ColumnVector `putByteArray`

2022-02-09 Thread Yaohua Zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yaohua Zhao resolved SPARK-37770.
-
Resolution: Fixed

> Performance improvements for ColumnVector `putByteArray`
> 
>
> Key: SPARK-37770
> URL: https://issues.apache.org/jira/browse/SPARK-37770
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yaohua Zhao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-09 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-38056.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35356
[https://github.com/apache/spark/pull/35356]

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Fix For: 3.3.0
>
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38056) Structured streaming not working in history server when using LevelDB

2022-02-09 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-38056:


Assignee: wy

> Structured streaming not working in history server when using LevelDB
> -
>
> Key: SPARK-38056
> URL: https://issues.apache.org/jira/browse/SPARK-38056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming, Web UI
>Affects Versions: 3.1.2, 3.2.0
>Reporter: wy
>Assignee: wy
>Priority: Major
> Attachments: local-1643373518829
>
>
> In 
> [SPARK-31953|https://github.com/apache/spark/commit/4f9667035886a67e6c9a4e8fad2efa390e87ca68],
>  structured streaming support is added to history server. However this does 
> not work when spark.history.store.path is set to save app info using LevelDB.
> This is because one of the keys of StreamingQueryData, runId,  is UUID type, 
> which is not supported by LevelDB. When replaying event log file in history 
> server, StreamingQueryStatusListener will throw an exception when writing 
> info to the store, saying "java.lang.IllegalArgumentException: Type 
> java.util.UUID not allowed as key.".
> Example event log is provided in attachments. When opening it in history 
> server with spark.history.store.path set to somewhere, no structured 
> streaming info is available.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2