[jira] [Commented] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184931#comment-17184931
 ] 

Hyukjin Kwon commented on SPARK-32702:
--

[~gemelen], you can take a quickest path for you to upgrade MiMa plugin, and 
file another JIRA to follow up I believe.

> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
> Attachments: core.txt, graphx.txt, mllib-local.txt, mllib.txt, 
> sql.txt, streaming-kafka-0-10.txt, streaming.txt
>
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
> plugin) needs to be upgraded too.
> After changes in build file to apply plugin changes between versions 0.3.0 
> and 0.7.0 binary incompatiblity errors are found during 
> `mimaReportBinaryIssues` task run.
> In summary there are of two types:
> unclear, if should be checked or excluded
> {code:java}
> [error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
> analyzing binary compatibility 
> [error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
> mimaPreviousArtifacts not set, not analyzing binary compatibility
> {code}
> and found compatibility errors
> {code:java}
> [error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 
> potential problems (filtered 2) 
> [error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
> (filtered 3) 
> [error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
> compatibility check against 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
> problems (filtered 1) 
> [error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
> (filtered 1451) 
> [error] (streaming / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 
> potential problems (filtered 4) 
> [error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
> (filtered 579) 
> [error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
> (filtered 1047)
> {code}
> I could not take a decision on my own in those cases, so help is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32636) AsyncEventQueue: Exception scala.Some cannot be cast to java.lang.String

2020-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184929#comment-17184929
 ] 

Hyukjin Kwon commented on SPARK-32636:
--

I think this is likely from the mismatched Jackson version. How did you 
reproduce this [~abubakarj]?

>  AsyncEventQueue: Exception scala.Some cannot be cast to java.lang.String
> -
>
> Key: SPARK-32636
> URL: https://issues.apache.org/jira/browse/SPARK-32636
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.0
>Reporter: Muhammad Abubakar
>Priority: Major
> Attachments: err.log
>
>
> Spark 3.0.0. Hadoop 3.2. Hive 2.3,7(built-in)
>  
> Java Exception occurs when try to run a memory-intensive job. Although enough 
> resources are available on the machine. But the actual exception doesn't look 
> like occurred because of memory issue.
> {code:java}
> java.lang.ClassCastException: scala.Some cannot be cast to 
> java.lang.Stringjava.lang.ClassCastException: scala.Some cannot be cast to 
> java.lang.String at org.json4s.JsonDSL.pair2jvalue(JsonDSL.scala:82) at 
> org.json4s.JsonDSL.pair2jvalue$(JsonDSL.scala:82) at 
> org.json4s.JsonDSL$.pair2jvalue(JsonDSL.scala:64) at 
> org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:309) at 
> org.apache.spark.util.JsonProtocol$.taskStartToJson(JsonProtocol.scala:131) 
> at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:75) 
> at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:97)
>  at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskStart(EventLoggingListener.scala:114)
>  at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:41)
>  at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at 
> org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>  at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at 
> scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:100)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:96)
>  at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319) at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:96)
> ## A fatal error has been detected by the Java Runtime Environment:##  
> SIGSEGV (0xb) at pc=0x7f9d2ec0cc6b, pid=30234, tid=0x7f9174a9e700## 
> JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 
> 1.8.0_101-b13)# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed 
> mode linux-amd64 )# Problematic frame:# V  [libjvm.so+0x7c9c6b]20/08/11 
> 23:31:38 ERROR AsyncEventQueue: Listener AppStatusListener threw an 
> exceptionjava.lang.ClassCastException: scala.Some cannot be cast to 
> java.lang.String at 
> org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:665) 
> at org.apache.spark.status.LiveTask.doUpdate(LiveEntity.scala:209) at 
> org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51) at 
> org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1088)
>  at 
> org.apache.spark.status.AppStatusListener.liveUpdate(AppStatusListener.scala:1101)
>  at 
> org.apache.spark.status.AppStatusListener.onTaskStart(AppStatusListener.scala:512)
>  at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:41)
>  at 
> org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>  at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at 
> org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at 
> org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:105)
>  at 
> org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:105)
>  at scala.runtime.java8.JFunction0$mcJ$sp.apply(JFunction0$mcJ$sp.java:23) at 
> scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at 
> 

[jira] [Commented] (SPARK-32673) Pyspark/cloudpickle.py - no module named 'wfdb'

2020-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184928#comment-17184928
 ] 

Hyukjin Kwon commented on SPARK-32673:
--

Yeah, it does look specific to Databricks'. It should be best to use the 
channel for Databricks unless you are able to reproduce in the plain Apache 
Spark 3.0.

> Pyspark/cloudpickle.py - no module named 'wfdb'
> ---
>
> Key: SPARK-32673
> URL: https://issues.apache.org/jira/browse/SPARK-32673
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Sandy Su
>Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Running Spark in a Databricks notebook.
>  
> Ran into this issue when executing a cell:
> (1) Spark Jobs
> {code}
> SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 
> 4 times, most recent failure: Lost task 0.3 in stage 17.0 (TID 68, 
> 10.139.64.5, executor 0): org.apache.spark.api.python.PythonException: 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb' During handling 
> of the above exception, another exception occurred: Traceback (most recent 
> call last): File "/databricks/spark/python/pyspark/worker.py", line 644, in 
> main func, profiler, deserializer, serializer = read_udfs(pickleSer, infile, 
> eval_type) File "/databricks/spark/python/pyspark/worker.py", line 463, in 
> read_udfs udfs.append(read_single_udf(pickleSer, infile, eval_type, 
> runner_conf, udf_index=i)) File "/databricks/spark/python/pyspark/worker.py", 
> line 254, in read_single_udf f, return_type = read_command(pickleSer, infile) 
> File "/databricks/spark/python/pyspark/worker.py", line 74, in read_command 
> command = serializer._read_with_length(file) File 
> "/databricks/spark/python/pyspark/serializers.py", line 180, in 
> _read_with_length raise SerializationError("Caused by " + 
> traceback.format_exc()) pyspark.serializers.SerializationError: Caused by 
> Traceback (most recent call last): File 
> "/databricks/spark/python/pyspark/serializers.py", line 177, in 
> _read_with_length return self.loads(obj) File 
> "/databricks/spark/python/pyspark/serializers.py", line 466, in loads return 
> pickle.loads(obj, encoding=encoding) File 
> "/databricks/spark/python/pyspark/cloudpickle.py", line 1110, in subimport 
> __import__(name) ModuleNotFoundError: No module named 'wfdb'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31936) Implement ScriptTransform in sql/core

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31936:


Assignee: angerszhu

> Implement ScriptTransform in sql/core
> -
>
> Key: SPARK-31936
> URL: https://issues.apache.org/jira/browse/SPARK-31936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32692) Support INSERT OVERWRITE DIR cross cluster

2020-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184927#comment-17184927
 ] 

Hyukjin Kwon commented on SPARK-32692:
--

ping [~angerszhuuu]

> Support INSERT OVERWRITE DIR cross cluster
> --
>
> Key: SPARK-32692
> URL: https://issues.apache.org/jira/browse/SPARK-32692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32694) Pushdown cast to data sources

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32694.
--
Resolution: Duplicate

> Pushdown cast to data sources
> -
>
> Key: SPARK-32694
> URL: https://issues.apache.org/jira/browse/SPARK-32694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently we don't support pushing down cast to data source (see 
> [link|http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSql-Casting-of-Predicate-Literals-tp29956p30035.html]
>  for a discussion). For instance, in the following code snippet:
> {code}
> scala> case class Person(name: String, age: Short)
> scala> Seq(Person("John", 32), Person("David", 25), Person("Mike", 
> 18)).toDS().write.parquet("/tmp/person.parquet")
> scala> val personDS = spark.read.parquet("/tmp/person.parquet")
> scala> personDS.createOrReplaceTempView("person")
> scala> spark.sql("SELECT * FROM person where age < 30")
> {code}
> The predicate won't be pushed down to Parquet data source because in 
> {{DataSourceStrategy}}, {{PushableColumnBase}} only handles a few limited 
> cases such as {{Attribute}} and {{GetStructField}}. Potentially we can handle 
> {{Cast}} here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32697) Direct Date and timestamp format data insertion fails

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32697.
--
Resolution: Not A Problem

> Direct Date and timestamp format data insertion fails
> -
>
> Key: SPARK-32697
> URL: https://issues.apache.org/jira/browse/SPARK-32697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
>Reporter: Chetan Bhat
>Priority: Minor
>
> Direct Date and timestamp format data tried to be inserted but the data 
> insertion fails as shown below.
> spark-sql> create table test(no timestamp) stored as parquet;
> Time taken: 0.561 seconds
> spark-sql> insert into test select '1979-04-27 00:00:00';
> Error in query: Cannot write incompatible data to table '`default`.`test`':
> - Cannot safely cast 'no': string to timestamp;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32697) Direct Date and timestamp format data insertion fails

2020-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184925#comment-17184925
 ] 

Hyukjin Kwon commented on SPARK-32697:
--

you can explicitly cast.

{code}
create table test(no timestamp) stored as parquet;
insert into test select cast('1979-04-27 00:00:00' as timestamp);
{code}


> Direct Date and timestamp format data insertion fails
> -
>
> Key: SPARK-32697
> URL: https://issues.apache.org/jira/browse/SPARK-32697
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
>Reporter: Chetan Bhat
>Priority: Minor
>
> Direct Date and timestamp format data tried to be inserted but the data 
> insertion fails as shown below.
> spark-sql> create table test(no timestamp) stored as parquet;
> Time taken: 0.561 seconds
> spark-sql> insert into test select '1979-04-27 00:00:00';
> Error in query: Cannot write incompatible data to table '`default`.`test`':
> - Cannot safely cast 'no': string to timestamp;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32699) Add percentage of missingness to df.summary()

2020-08-25 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184924#comment-17184924
 ] 

Hyukjin Kwon commented on SPARK-32699:
--

+1 for Sean's. Let's don't add every details into it. It's just a summary

> Add percentage of missingness to df.summary()
> -
>
> Key: SPARK-32699
> URL: https://issues.apache.org/jira/browse/SPARK-32699
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Chengyin Eng
>Priority: Minor
>
>  
> h2. In df.summary(), we are returned counts of non-nulls for each column. It 
> would be really helpful to have a percentage of non-nulls, since percentage 
> of missingness is often an indicator for data scientists to discard the 
> columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32699) Add percentage of missingness to df.summary()

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32699.
--
Resolution: Won't Fix

> Add percentage of missingness to df.summary()
> -
>
> Key: SPARK-32699
> URL: https://issues.apache.org/jira/browse/SPARK-32699
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Chengyin Eng
>Priority: Minor
>
>  
> h2. In df.summary(), we are returned counts of non-nulls for each column. It 
> would be really helpful to have a percentage of non-nulls, since percentage 
> of missingness is often an indicator for data scientists to discard the 
> columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32695) Add 'build' and 'project/build.properties' into cache key of SBT and Zinc

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32695.
--
Fix Version/s: 3.1.0
   2.4.7
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29536
[https://github.com/apache/spark/pull/29536]

> Add 'build' and 'project/build.properties' into cache key of SBT and Zinc
> -
>
> Key: SPARK-32695
> URL: https://issues.apache.org/jira/browse/SPARK-32695
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.1, 2.4.7, 3.1.0
>
>
> The version information is included in {{build}} and 
> {{project/build.properties}}. We should better don't cache in this case. 
> Looks like it can cause a build failure given 
> https://github.com/apache/spark/pull/29286#issuecomment-679368436



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32695) Add 'build' and 'project/build.properties' into cache key of SBT and Zinc

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32695:


Assignee: Hyukjin Kwon

> Add 'build' and 'project/build.properties' into cache key of SBT and Zinc
> -
>
> Key: SPARK-32695
> URL: https://issues.apache.org/jira/browse/SPARK-32695
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> The version information is included in {{build}} and 
> {{project/build.properties}}. We should better don't cache in this case. 
> Looks like it can cause a build failure given 
> https://github.com/apache/spark/pull/29286#issuecomment-679368436



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32182) Getting Started - Quickstart

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32182.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29491
[https://github.com/apache/spark/pull/29491]

> Getting Started - Quickstart
> 
>
> Key: SPARK-32182
> URL: https://issues.apache.org/jira/browse/SPARK-32182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Example:
> https://koalas.readthedocs.io/en/latest/getting_started/10min.html
> https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html
> https://pandas.pydata.org/docs/getting_started/10min.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32182) Getting Started - Quickstart

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32182:


Assignee: Hyukjin Kwon

> Getting Started - Quickstart
> 
>
> Key: SPARK-32182
> URL: https://issues.apache.org/jira/browse/SPARK-32182
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Example:
> https://koalas.readthedocs.io/en/latest/getting_started/10min.html
> https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html
> https://pandas.pydata.org/docs/getting_started/10min.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32204) Binder Integration

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32204:


Assignee: Hyukjin Kwon

> Binder Integration
> --
>
> Key: SPARK-32204
> URL: https://issues.apache.org/jira/browse/SPARK-32204
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> For example,
> https://github.com/databricks/koalas
> https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32204) Binder Integration

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32204.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29491
[https://github.com/apache/spark/pull/29491]

> Binder Integration
> --
>
> Key: SPARK-32204
> URL: https://issues.apache.org/jira/browse/SPARK-32204
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> For example,
> https://github.com/databricks/koalas
> https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32700) select from table TABLESAMPLE gives wrong resultset.

2020-08-25 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32700.
--
Resolution: Invalid

> select from table TABLESAMPLE gives wrong resultset.
> 
>
> Key: SPARK-32700
> URL: https://issues.apache.org/jira/browse/SPARK-32700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
>Reporter: Chetan Bhat
>Priority: Minor
>
> create table test(id int,name string) stored as parquet;
> insert into test values 
> (5,'Alex'),(8,'Lucy'),(2,'Mary'),(4,'Fred'),(1,'Lisa'),(9,'Eric'),(10,'Adam'),(6,'Mark'),(7,'Lily'),(3,'Evan');
> SELECT * FROM test TABLESAMPLE (50 PERCENT); --> output is giving only 3 rows.
> spark-sql> SELECT * FROM test TABLESAMPLE (50 PERCENT);
> 5 Alex
> 10 Adam
> 4 Fred
>  
> Expected as per the link is 5 rows 
> -->[https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html]
>  
> Also the bucket parameter for select from table TABLESAMPLE gives wrong 
> resultset.
> spark-sql> SELECT * FROM test TABLESAMPLE (BUCKET 4 OUT OF 10);
> 5 Alex
> 8 Lucy
> 9 Eric
> 1 Lisa
> 3 Evan
> Expected is 4 records.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32700) select from table TABLESAMPLE gives wrong resultset.

2020-08-25 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184883#comment-17184883
 ] 

Takeshi Yamamuro commented on SPARK-32700:
--

+1 on the Sean comment. I'll close this.

> select from table TABLESAMPLE gives wrong resultset.
> 
>
> Key: SPARK-32700
> URL: https://issues.apache.org/jira/browse/SPARK-32700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
>Reporter: Chetan Bhat
>Priority: Minor
>
> create table test(id int,name string) stored as parquet;
> insert into test values 
> (5,'Alex'),(8,'Lucy'),(2,'Mary'),(4,'Fred'),(1,'Lisa'),(9,'Eric'),(10,'Adam'),(6,'Mark'),(7,'Lily'),(3,'Evan');
> SELECT * FROM test TABLESAMPLE (50 PERCENT); --> output is giving only 3 rows.
> spark-sql> SELECT * FROM test TABLESAMPLE (50 PERCENT);
> 5 Alex
> 10 Adam
> 4 Fred
>  
> Expected as per the link is 5 rows 
> -->[https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html]
>  
> Also the bucket parameter for select from table TABLESAMPLE gives wrong 
> resultset.
> spark-sql> SELECT * FROM test TABLESAMPLE (BUCKET 4 OUT OF 10);
> 5 Alex
> 8 Lucy
> 9 Eric
> 1 Lisa
> 3 Evan
> Expected is 4 records.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32466) Add support to catch SparkPlan regression base on TPC-DS queries

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184856#comment-17184856
 ] 

Apache Spark commented on SPARK-32466:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/29546

> Add support to catch SparkPlan regression base on TPC-DS queries
> 
>
> Key: SPARK-32466
> URL: https://issues.apache.org/jira/browse/SPARK-32466
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> Nowadays, Spark is getting more and more complex. Any changes might cause 
> regression unintentionally. Spark already has some benchmark to catch the 
> performance regression. But, yet, it doesn't have a way to detect the 
> regression inside SparkPlan. It would be good if we could find some possible 
> regression early during the compile phase before the runtime phase.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32620) Reset the numPartitions metric when DPP is enabled

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-32620:
---

Assignee: Yuming Wang

> Reset the numPartitions metric when DPP is enabled
> --
>
> Key: SPARK-32620
> URL: https://issues.apache.org/jira/browse/SPARK-32620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> This pr reset the {{numPartitions}} metric when DPP is enabled. Otherwise, it 
> is always a [static 
> value|https://github.com/apache/spark/blob/18cac6a9f0bf4a6d449393f1ee84004623b3c893/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L215].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32620) Reset the numPartitions metric when DPP is enabled

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-32620.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29436
[https://github.com/apache/spark/pull/29436]

> Reset the numPartitions metric when DPP is enabled
> --
>
> Key: SPARK-32620
> URL: https://issues.apache.org/jira/browse/SPARK-32620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> This pr reset the {{numPartitions}} metric when DPP is enabled. Otherwise, it 
> is always a [static 
> value|https://github.com/apache/spark/blob/18cac6a9f0bf4a6d449393f1ee84004623b3c893/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L215].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Description: 
DPP has data issue when pruning on non-atomic type. for example:
{noformat}
 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df2")

spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
struct(df2.k) AND df2.id < 2").show
{noformat}


 It should return two records, but it returns empty.

  was:
DPP has data issue when pruning on non-atomic type. for example:
{noformat}
 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df2")

spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
struct(df2.k) AND df2.id < 2").show
{noformat}


 It should return two records, but it returns null.


> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
>
> DPP has data issue when pruning on non-atomic type. for example:
> {noformat}
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
> spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
> {noformat}
>  It should return two records, but it returns empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Labels: correctness  (was: )

> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>  Labels: correctness
>
> DPP has data issue when pruning on non-atomic type. for example:
> {noformat}
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
> spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
> {noformat}
>  It should return two records, but it returns null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Description: 
DPP has data issue when pruning on non-atomic type. for example:
{noformat}
 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df2")

spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
struct(df2.k) AND df2.id < 2").show
{noformat}


 It should return two records, but it returns null.

  was:
DPP has data issue when pruning on non-atomic type. for example:
{noformat}
 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df2")

spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
 spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
 spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
struct(df2.k) AND df2.id < 2").show
{noformat}


 It should return two records, but it returns null.


> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> DPP has data issue when pruning on non-atomic type. for example:
> {noformat}
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
> spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
> {noformat}
>  It should return two records, but it returns null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Description: 
DPP has data issue when pruning on non-atomic type. for example:
{noformat}
 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df2")

spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
 spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
 spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
struct(df2.k) AND df2.id < 2").show
{noformat}


 It should return two records, but it returns null.

  was:
Fix data issue when adding DPP on non-atomic type. for example:
 ```scala
 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df2")

spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
 spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
 spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
struct(df2.k) AND df2.id < 2").show
 ```
 It should return two records, but it returns null.


> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> DPP has data issue when pruning on non-atomic type. for example:
> {noformat}
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
>  spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
>  spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
> {noformat}
>  It should return two records, but it returns null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Description: 
Fix data issue when adding DPP on non-atomic type. for example:
 ```scala
 spark.range(1000)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df1");

spark.range(100)
 .select(col("id"), col("id").as("k"))
 .write
 .partitionBy("k")
 .format("parquet")
 .mode("overwrite")
 .saveAsTable("df2")

spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
 spark.sql("set 
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
 spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
struct(df2.k) AND df2.id < 2").show
 ```
 It should return two records, but it returns null.

  was:
{{Set}}'s contains function has better performance:
https://www.baeldung.com/java-hashset-arraylist-contains-performance


> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Fix data issue when adding DPP on non-atomic type. for example:
>  ```scala
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
>  spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
>  spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
>  ```
>  It should return two records, but it returns null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32704) Logging plan changes for execution

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32704:


Assignee: (was: Apache Spark)

> Logging plan changes for execution
> --
>
> Key: SPARK-32704
> URL: https://issues.apache.org/jira/browse/SPARK-32704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> Since we only log plan changes for analyzer/optimizer now, this ticket 
> targets adding code to log plan changes in the preparation phase in 
> QueryExecution for execution.
> {code}
> scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
> scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
> ...
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)
>   +- *(1) Range (0, 10, step=1, splits=4)
>  
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Result of Batch Preparations ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32704) Logging plan changes for execution

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184824#comment-17184824
 ] 

Apache Spark commented on SPARK-32704:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29544

> Logging plan changes for execution
> --
>
> Key: SPARK-32704
> URL: https://issues.apache.org/jira/browse/SPARK-32704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> Since we only log plan changes for analyzer/optimizer now, this ticket 
> targets adding code to log plan changes in the preparation phase in 
> QueryExecution for execution.
> {code}
> scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
> scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
> ...
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)
>   +- *(1) Range (0, 10, step=1, splits=4)
>  
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Result of Batch Preparations ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32704) Logging plan changes for execution

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184823#comment-17184823
 ] 

Apache Spark commented on SPARK-32704:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/29544

> Logging plan changes for execution
> --
>
> Key: SPARK-32704
> URL: https://issues.apache.org/jira/browse/SPARK-32704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> Since we only log plan changes for analyzer/optimizer now, this ticket 
> targets adding code to log plan changes in the preparation phase in 
> QueryExecution for execution.
> {code}
> scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
> scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
> ...
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)
>   +- *(1) Range (0, 10, step=1, splits=4)
>  
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Result of Batch Preparations ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32704) Logging plan changes for execution

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32704:


Assignee: Apache Spark

> Logging plan changes for execution
> --
>
> Key: SPARK-32704
> URL: https://issues.apache.org/jira/browse/SPARK-32704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> Since we only log plan changes for analyzer/optimizer now, this ticket 
> targets adding code to log plan changes in the preparation phase in 
> QueryExecution for execution.
> {code}
> scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
> scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
> ...
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)
>   +- *(1) Range (0, 10, step=1, splits=4)
>  
> 20/08/26 09:32:36 WARN PlanChangeLogger: 
> === Result of Batch Preparations ===
> !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, 
> count#23L])  *(1) HashAggregate(keys=[id#19L], 
> functions=[count(1)], output=[id#19L, count#23L])
> !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], 
> output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
> functions=[partial_count(1)], output=[id#19L, count#27L])
> !   +- Range (0, 10, step=1, splits=4)  
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32704) Logging plan changes for execution

2020-08-25 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-32704:


 Summary: Logging plan changes for execution
 Key: SPARK-32704
 URL: https://issues.apache.org/jira/browse/SPARK-32704
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


Since we only log plan changes for analyzer/optimizer now, this ticket targets 
adding code to log plan changes in the preparation phase in QueryExecution for 
execution.

{code}
scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
...
20/08/26 09:32:36 WARN PlanChangeLogger: 
=== Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) 
 *(1) HashAggregate(keys=[id#19L], functions=[count(1)], 
output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, 
count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)  
+- *(1) Range (0, 10, step=1, splits=4)
 
20/08/26 09:32:36 WARN PlanChangeLogger: 
=== Result of Batch Preparations ===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) 
 *(1) HashAggregate(keys=[id#19L], functions=[count(1)], 
output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, 
count#27L])   +- *(1) HashAggregate(keys=[id#19L], 
functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)  
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32516) path option is treated differently for 'format("parquet").load(path)' vs. 'parquet(path)'

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184772#comment-17184772
 ] 

Apache Spark commented on SPARK-32516:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/29543

> path option is treated differently for 'format("parquet").load(path)' vs. 
> 'parquet(path)'
> -
>
> Key: SPARK-32516
> URL: https://issues.apache.org/jira/browse/SPARK-32516
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.1.0
>
>
> When data is read, "path" option is treated differently depending on how 
> dataframe is created:
> {code:java}
> scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")
>   
>   
> scala> spark.read.option("path", 
> "/tmp/test").format("parquet").load("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> +-+
> scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
> +-+
> |value|
> +-+
> |1|
> |1|
> +-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32703:


Assignee: (was: Apache Spark)

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32703:


Assignee: Apache Spark

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184755#comment-17184755
 ] 

Apache Spark commented on SPARK-32703:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29542

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32701) mapreduce.fileoutputcommitter.algorithm.version default depends on runtime environment

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184730#comment-17184730
 ] 

Apache Spark commented on SPARK-32701:
--

User 'waleedfateem' has created a pull request for this issue:
https://github.com/apache/spark/pull/29541

> mapreduce.fileoutputcommitter.algorithm.version default depends on runtime 
> environment
> --
>
> Key: SPARK-32701
> URL: https://issues.apache.org/jira/browse/SPARK-32701
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Waleed Fateem
>Priority: Major
>
> When someone reads the documentation in its current state, the assumption is 
> that the default value of 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 and that's 
> not entirely accurate. 
> Spark doesn't explicitly set this configuration and instead is inherited from 
> Hadoop's FileOutputCommitter class. The default value is 1 until Hadoop 3.0 
> where this changed to 2.
> I'm proposing that we clarify that this value's default will depend on the 
> Hadoop version in a user's runtime environment, where:
> 1 for < Hadoop 3.0
> 2 for >= Hadoop 3.0
> There are also plans to revert this default again to v1 so might also be 
> useful to reference this JIRA:
> https://issues.apache.org/jira/browse/MAPREDUCE-7282
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32701) mapreduce.fileoutputcommitter.algorithm.version default depends on runtime environment

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32701:


Assignee: Apache Spark

> mapreduce.fileoutputcommitter.algorithm.version default depends on runtime 
> environment
> --
>
> Key: SPARK-32701
> URL: https://issues.apache.org/jira/browse/SPARK-32701
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Waleed Fateem
>Assignee: Apache Spark
>Priority: Major
>
> When someone reads the documentation in its current state, the assumption is 
> that the default value of 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 and that's 
> not entirely accurate. 
> Spark doesn't explicitly set this configuration and instead is inherited from 
> Hadoop's FileOutputCommitter class. The default value is 1 until Hadoop 3.0 
> where this changed to 2.
> I'm proposing that we clarify that this value's default will depend on the 
> Hadoop version in a user's runtime environment, where:
> 1 for < Hadoop 3.0
> 2 for >= Hadoop 3.0
> There are also plans to revert this default again to v1 so might also be 
> useful to reference this JIRA:
> https://issues.apache.org/jira/browse/MAPREDUCE-7282
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32701) mapreduce.fileoutputcommitter.algorithm.version default depends on runtime environment

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184729#comment-17184729
 ] 

Apache Spark commented on SPARK-32701:
--

User 'waleedfateem' has created a pull request for this issue:
https://github.com/apache/spark/pull/29541

> mapreduce.fileoutputcommitter.algorithm.version default depends on runtime 
> environment
> --
>
> Key: SPARK-32701
> URL: https://issues.apache.org/jira/browse/SPARK-32701
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Waleed Fateem
>Priority: Major
>
> When someone reads the documentation in its current state, the assumption is 
> that the default value of 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 and that's 
> not entirely accurate. 
> Spark doesn't explicitly set this configuration and instead is inherited from 
> Hadoop's FileOutputCommitter class. The default value is 1 until Hadoop 3.0 
> where this changed to 2.
> I'm proposing that we clarify that this value's default will depend on the 
> Hadoop version in a user's runtime environment, where:
> 1 for < Hadoop 3.0
> 2 for >= Hadoop 3.0
> There are also plans to revert this default again to v1 so might also be 
> useful to reference this JIRA:
> https://issues.apache.org/jira/browse/MAPREDUCE-7282
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32701) mapreduce.fileoutputcommitter.algorithm.version default depends on runtime environment

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32701:


Assignee: (was: Apache Spark)

> mapreduce.fileoutputcommitter.algorithm.version default depends on runtime 
> environment
> --
>
> Key: SPARK-32701
> URL: https://issues.apache.org/jira/browse/SPARK-32701
> Project: Spark
>  Issue Type: Bug
>  Components: docs, Documentation
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Waleed Fateem
>Priority: Major
>
> When someone reads the documentation in its current state, the assumption is 
> that the default value of 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 and that's 
> not entirely accurate. 
> Spark doesn't explicitly set this configuration and instead is inherited from 
> Hadoop's FileOutputCommitter class. The default value is 1 until Hadoop 3.0 
> where this changed to 2.
> I'm proposing that we clarify that this value's default will depend on the 
> Hadoop version in a user's runtime environment, where:
> 1 for < Hadoop 3.0
> 2 for >= Hadoop 3.0
> There are also plans to revert this default again to v1 so might also be 
> useful to reference this JIRA:
> https://issues.apache.org/jira/browse/MAPREDUCE-7282
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Denis Pyshev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184724#comment-17184724
 ] 

Denis Pyshev commented on SPARK-32702:
--

This is definitely better (check against 3.0.0):
{code:java}
[error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
mimaPreviousArtifacts not set, not analyzing binary compatibility 
[error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility
 
[error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-graphx_2.12:3.0.0! Found 1 potential problems 
(filtered 1) 
[error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
check against org.apache.spark:spark-mllib-local_2.12:3.0.0! Found 2 potential 
problems (filtered 1) 
[error] (streaming / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-streaming_2.12:3.0.0! Found 3 potential problems 
(filtered 1) 
[error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-sql_2.12:3.0.0! Found 11 potential problems 
(filtered 341) 
[error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-mllib_2.12:3.0.0! Found 81 potential problems 
(filtered 496) 
[error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-core_2.12:3.0.0! Found 146 potential problems 
(filtered 846)
{code}

> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
> Attachments: core.txt, graphx.txt, mllib-local.txt, mllib.txt, 
> sql.txt, streaming-kafka-0-10.txt, streaming.txt
>
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
> plugin) needs to be upgraded too.
> After changes in build file to apply plugin changes between versions 0.3.0 
> and 0.7.0 binary incompatiblity errors are found during 
> `mimaReportBinaryIssues` task run.
> In summary there are of two types:
> unclear, if should be checked or excluded
> {code:java}
> [error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
> analyzing binary compatibility 
> [error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
> mimaPreviousArtifacts not set, not analyzing binary compatibility
> {code}
> and found compatibility errors
> {code:java}
> [error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 
> potential problems (filtered 2) 
> [error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
> (filtered 3) 
> [error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
> compatibility check against 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
> problems (filtered 1) 
> [error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
> (filtered 1451) 
> [error] (streaming / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 
> potential problems (filtered 4) 
> [error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
> (filtered 579) 
> [error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
> (filtered 1047)
> {code}
> I could not take a decision on my own in those cases, so help is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Denis Pyshev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184716#comment-17184716
 ] 

Denis Pyshev edited comment on SPARK-32702 at 8/25/20, 8:07 PM:


>  The "not analyzing binary compatibility" messages look as expected.

There are other modules, that are excluded from analysis in settings, while 
those are not listed and thus are tried. I assume that I could add them to list 
of exclusions then, right?

> Hm, one strange thing here is that it seems to be comparing vs 2.4.0
 It is per setting in master 
[https://github.com/apache/spark/blob/master/project/MimaBuild.scala#L91] (and 
at the point when I started changeset in PR). Will try to run against 3.0.0.

 


was (Author: gemelen):
>  The "not analyzing binary compatibility" messages look as expected.

There are other modules, that are excluded for analysis in settings, while 
those are not listed and thus are tried. I assume that I could add them to list 
of exclusions then, right?

> Hm, one strange thing here is that it seems to be comparing vs 2.4.0
It is per setting in master 
[https://github.com/apache/spark/blob/master/project/MimaBuild.scala#L91] (and 
at the point when I started changeset in PR). Will try to run against 3.0.0.

 

> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
> Attachments: core.txt, graphx.txt, mllib-local.txt, mllib.txt, 
> sql.txt, streaming-kafka-0-10.txt, streaming.txt
>
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
> plugin) needs to be upgraded too.
> After changes in build file to apply plugin changes between versions 0.3.0 
> and 0.7.0 binary incompatiblity errors are found during 
> `mimaReportBinaryIssues` task run.
> In summary there are of two types:
> unclear, if should be checked or excluded
> {code:java}
> [error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
> analyzing binary compatibility 
> [error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
> mimaPreviousArtifacts not set, not analyzing binary compatibility
> {code}
> and found compatibility errors
> {code:java}
> [error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 
> potential problems (filtered 2) 
> [error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
> (filtered 3) 
> [error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
> compatibility check against 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
> problems (filtered 1) 
> [error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
> (filtered 1451) 
> [error] (streaming / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 
> potential problems (filtered 4) 
> [error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
> (filtered 579) 
> [error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
> (filtered 1047)
> {code}
> I could not take a decision on my own in those cases, so help is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Denis Pyshev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184716#comment-17184716
 ] 

Denis Pyshev commented on SPARK-32702:
--

>  The "not analyzing binary compatibility" messages look as expected.

There are other modules, that are excluded for analysis in settings, while 
those are not listed and thus are tried. I assume that I could add them to list 
of exclusions then, right?

> Hm, one strange thing here is that it seems to be comparing vs 2.4.0
It is per setting in master 
[https://github.com/apache/spark/blob/master/project/MimaBuild.scala#L91] (and 
at the point when I started changeset in PR). Will try to run against 3.0.0.

 

> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
> Attachments: core.txt, graphx.txt, mllib-local.txt, mllib.txt, 
> sql.txt, streaming-kafka-0-10.txt, streaming.txt
>
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
> plugin) needs to be upgraded too.
> After changes in build file to apply plugin changes between versions 0.3.0 
> and 0.7.0 binary incompatiblity errors are found during 
> `mimaReportBinaryIssues` task run.
> In summary there are of two types:
> unclear, if should be checked or excluded
> {code:java}
> [error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
> analyzing binary compatibility 
> [error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
> mimaPreviousArtifacts not set, not analyzing binary compatibility
> {code}
> and found compatibility errors
> {code:java}
> [error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 
> potential problems (filtered 2) 
> [error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
> (filtered 3) 
> [error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
> compatibility check against 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
> problems (filtered 1) 
> [error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
> (filtered 1451) 
> [error] (streaming / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 
> potential problems (filtered 4) 
> [error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
> (filtered 579) 
> [error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
> (filtered 1047)
> {code}
> I could not take a decision on my own in those cases, so help is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2020-08-25 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32703:
-
Description: Parquet vectorized reader still uses the old API for 
{{filterRowGroups}} and only filters on statistics. It should switch to the new 
API and do dictionary filtering as well.  (was: Dictionary filtering was 
disabled in SPARK-26677 due to a Parquet bug in 1.10.0 but was fixed 1.10.1. 
Since we've upgraded to the latter version, we should consider to re-enable 
this feature.)

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> Parquet vectorized reader still uses the old API for {{filterRowGroups}} and 
> only filters on statistics. It should switch to the new API and do dictionary 
> filtering as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184708#comment-17184708
 ] 

Sean R. Owen commented on SPARK-32702:
--

The "not analyzing binary compatibility" messages look as expected.

So I think MiMa is stricter now or something?
Presumably, hopefully, anything the old version didn't catch isn't a real issue.

Hm, one strange thing here is that it seems to be comparing vs 2.4.0. 
MimaBuild.scala should at least be comparing vs 3.0.0 at this point in master. 
Maybe that helps reduce the change its looking at, thus the new additional 
errors, so I think that's valid to fix.

I think we'd have to review the new warnings and double-check them, yeah, but, 
I'd assume they're false positives. Therefore it'd be OK to suppress them in a 
PR and take a look. But is this saying there are thousands more? wow. We may 
need to bulk turn off some rules or something.

PS I think it would also be OK to remove the sections of MiMaExcludes.scala 
pertaining to, say, Spark 2. Those aren't relevant in the build at master, and, 
may accidentally retain an exclusion that is no longer valid, covering up an 
actual change.

> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
> Attachments: core.txt, graphx.txt, mllib-local.txt, mllib.txt, 
> sql.txt, streaming-kafka-0-10.txt, streaming.txt
>
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
> plugin) needs to be upgraded too.
> After changes in build file to apply plugin changes between versions 0.3.0 
> and 0.7.0 binary incompatiblity errors are found during 
> `mimaReportBinaryIssues` task run.
> In summary there are of two types:
> unclear, if should be checked or excluded
> {code:java}
> [error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
> analyzing binary compatibility 
> [error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
> mimaPreviousArtifacts not set, not analyzing binary compatibility
> {code}
> and found compatibility errors
> {code:java}
> [error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 
> potential problems (filtered 2) 
> [error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
> (filtered 3) 
> [error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
> compatibility check against 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
> problems (filtered 1) 
> [error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
> (filtered 1451) 
> [error] (streaming / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 
> potential problems (filtered 4) 
> [error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
> (filtered 579) 
> [error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
> (filtered 1047)
> {code}
> I could not take a decision on my own in those cases, so help is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32703) Enable dictionary filtering for Parquet vectorized reader

2020-08-25 Thread Chao Sun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-32703:
-
Summary: Enable dictionary filtering for Parquet vectorized reader  (was: 
Re-enable dictionary filtering for Parquet)

> Enable dictionary filtering for Parquet vectorized reader
> -
>
> Key: SPARK-32703
> URL: https://issues.apache.org/jira/browse/SPARK-32703
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> Dictionary filtering was disabled in SPARK-26677 due to a Parquet bug in 
> 1.10.0 but was fixed 1.10.1. Since we've upgraded to the latter version, we 
> should consider to re-enable this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32703) Re-enable dictionary filtering for Parquet

2020-08-25 Thread Chao Sun (Jira)
Chao Sun created SPARK-32703:


 Summary: Re-enable dictionary filtering for Parquet
 Key: SPARK-32703
 URL: https://issues.apache.org/jira/browse/SPARK-32703
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Chao Sun


Dictionary filtering was disabled in SPARK-26677 due to a Parquet bug in 1.10.0 
but was fixed 1.10.1. Since we've upgraded to the latter version, we should 
consider to re-enable this feature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Denis Pyshev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Pyshev updated SPARK-32702:
-
Attachment: core.txt
mllib.txt
streaming.txt
sql.txt
streaming-kafka-0-10.txt
graphx.txt
mllib-local.txt

> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
> Attachments: core.txt, graphx.txt, mllib-local.txt, mllib.txt, 
> sql.txt, streaming-kafka-0-10.txt, streaming.txt
>
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
> plugin) needs to be upgraded too.
> After changes in build file to apply plugin changes between versions 0.3.0 
> and 0.7.0 binary incompatiblity errors are found during 
> `mimaReportBinaryIssues` task run.
> In summary there are of two types:
> unclear, if should be checked or excluded
> {code:java}
> [error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, 
> not analyzing binary compatibility 
> [error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
> analyzing binary compatibility 
> [error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
> mimaPreviousArtifacts not set, not analyzing binary compatibility
> {code}
> and found compatibility errors
> {code:java}
> [error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 
> potential problems (filtered 2) 
> [error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
> (filtered 3) 
> [error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
> compatibility check against 
> org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
> problems (filtered 1) 
> [error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
> (filtered 1451) 
> [error] (streaming / mimaReportBinaryIssues) Failed binary compatibility 
> check against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 
> potential problems (filtered 4) 
> [error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
> (filtered 579) 
> [error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
> against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
> (filtered 1047)
> {code}
> I could not take a decision on my own in those cases, so help is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Denis Pyshev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Pyshev updated SPARK-32702:
-
Description: 
As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
plugin) needs to be upgraded too.

After changes in build file to apply plugin changes between versions 0.3.0 and 
0.7.0 binary incompatiblity errors are found during `mimaReportBinaryIssues` 
task run.

In summary there are of two types:

unclear, if should be checked or excluded
{code:java}
[error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
mimaPreviousArtifacts not set, not analyzing binary compatibility
{code}
and found compatibility errors
{code:java}
[error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 potential 
problems (filtered 2) 
[error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
(filtered 3) 
[error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
compatibility check against 
org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
problems (filtered 1) 
[error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
(filtered 1451) 
[error] (streaming / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 potential 
problems (filtered 4) 
[error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
(filtered 579) 
[error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
(filtered 1047)
{code}
I could not take a decision on my own in those cases, so help is needed.

  was:
As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
plugin) needs to be upgraded too.

After changes in build file to apply plugin changes between versions 0.3.0 and 
0.7.0 binary incompatiblity errors are found during `` run.

In summary there are of two types:

unclear, if should be checked or excluded
{code:java}
[error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
mimaPreviousArtifacts not set, not analyzing binary compatibility
{code}
and found compatibility errors
{code:java}
[error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 potential 
problems (filtered 2) 
[error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
(filtered 3) 
[error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
compatibility check against 
org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
problems (filtered 1) 
[error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
(filtered 1451) 
[error] (streaming / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 potential 
problems (filtered 4) 
[error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
(filtered 579) 
[error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
(filtered 1047)
{code}
I could not take a decision on my own in those cases, so help is needed.


> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of 

[jira] [Updated] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Denis Pyshev (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Pyshev updated SPARK-32702:
-
Description: 
As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
plugin) needs to be upgraded too.

After changes in build file to apply plugin changes between versions 0.3.0 and 
0.7.0 binary incompatiblity errors are found during `` run.

In summary there are of two types:

unclear, if should be checked or excluded
{code:java}
[error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (assembly / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility 
[error] (streaming-kafka-0-10-assembly / mimaReportBinaryIssues) 
mimaPreviousArtifacts not set, not analyzing binary compatibility
{code}
and found compatibility errors
{code:java}
[error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 potential 
problems (filtered 2) 
[error] (graphx / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 potential problems 
(filtered 3) 
[error] (streaming-kafka-0-10 / mimaReportBinaryIssues) Failed binary 
compatibility check against 
org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
problems (filtered 1) 
[error] (sql / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 potential problems 
(filtered 1451) 
[error] (streaming / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 potential 
problems (filtered 4) 
[error] (mllib / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 potential problems 
(filtered 579) 
[error] (core / mimaReportBinaryIssues) Failed binary compatibility check 
against org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems 
(filtered 1047)
{code}
I could not take a decision on my own in those cases, so help is needed.

  was:
As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
plugin) needs to be upgraded too.

After changes in build file to apply plugin changes between versions 0.3.0 and 
0.7.0 binary incompatiblity errors are found during `` run.

In summary there are of two types:

unclear, if should be checked or excluded
{code:java}
[error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility [error] (assembly / mimaReportBinaryIssues) 
mimaPreviousArtifacts not set, not analyzing binary compatibility [error] 
(tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not analyzing 
binary compatibility [error] (streaming-kafka-0-10-assembly / 
mimaReportBinaryIssues) mimaPreviousArtifacts not set, not analyzing binary 
compatibility
{code}
and found compatibility errors
{code:java}
[error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 potential 
problems (filtered 2) [error] (graphx / mimaReportBinaryIssues) Failed binary 
compatibility check against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 
potential problems (filtered 3) [error] (streaming-kafka-0-10 / 
mimaReportBinaryIssues) Failed binary compatibility check against 
org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
problems (filtered 1) [error] (sql / mimaReportBinaryIssues) Failed binary 
compatibility check against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 
potential problems (filtered 1451) [error] (streaming / mimaReportBinaryIssues) 
Failed binary compatibility check against 
org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 potential problems 
(filtered 4) [error] (mllib / mimaReportBinaryIssues) Failed binary 
compatibility check against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 
potential problems (filtered 579) [error] (core / mimaReportBinaryIssues) 
Failed binary compatibility check against 
org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems (filtered 
1047)
{code}
I could not take a decision on my own in those cases, so help is needed.


> Update MiMa plugin
> --
>
> Key: SPARK-32702
> URL: https://issues.apache.org/jira/browse/SPARK-32702
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Denis Pyshev
>Priority: Minor
>
> As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
> plugin) needs to be 

[jira] [Created] (SPARK-32702) Update MiMa plugin

2020-08-25 Thread Denis Pyshev (Jira)
Denis Pyshev created SPARK-32702:


 Summary: Update MiMa plugin
 Key: SPARK-32702
 URL: https://issues.apache.org/jira/browse/SPARK-32702
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 3.1.0
Reporter: Denis Pyshev


As a part of upgrade to SBT 1.x it was found that MiMa (in form of sbt-mima 
plugin) needs to be upgraded too.

After changes in build file to apply plugin changes between versions 0.3.0 and 
0.7.0 binary incompatiblity errors are found during `` run.

In summary there are of two types:

unclear, if should be checked or excluded
{code:java}
[error] (examples / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not 
analyzing binary compatibility [error] (assembly / mimaReportBinaryIssues) 
mimaPreviousArtifacts not set, not analyzing binary compatibility [error] 
(tools / mimaReportBinaryIssues) mimaPreviousArtifacts not set, not analyzing 
binary compatibility [error] (streaming-kafka-0-10-assembly / 
mimaReportBinaryIssues) mimaPreviousArtifacts not set, not analyzing binary 
compatibility
{code}
and found compatibility errors
{code:java}
[error] (mllib-local / mimaReportBinaryIssues) Failed binary compatibility 
check against org.apache.spark:spark-mllib-local_2.12:2.4.0! Found 2 potential 
problems (filtered 2) [error] (graphx / mimaReportBinaryIssues) Failed binary 
compatibility check against org.apache.spark:spark-graphx_2.12:2.4.0! Found 3 
potential problems (filtered 3) [error] (streaming-kafka-0-10 / 
mimaReportBinaryIssues) Failed binary compatibility check against 
org.apache.spark:spark-streaming-kafka-0-10_2.12:2.4.0! Found 6 potential 
problems (filtered 1) [error] (sql / mimaReportBinaryIssues) Failed binary 
compatibility check against org.apache.spark:spark-sql_2.12:2.4.0! Found 24 
potential problems (filtered 1451) [error] (streaming / mimaReportBinaryIssues) 
Failed binary compatibility check against 
org.apache.spark:spark-streaming_2.12:2.4.0! Found 124 potential problems 
(filtered 4) [error] (mllib / mimaReportBinaryIssues) Failed binary 
compatibility check against org.apache.spark:spark-mllib_2.12:2.4.0! Found 636 
potential problems (filtered 579) [error] (core / mimaReportBinaryIssues) 
Failed binary compatibility check against 
org.apache.spark:spark-core_2.12:2.4.0! Found 1030 potential problems (filtered 
1047)
{code}
I could not take a decision on my own in those cases, so help is needed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32701) mapreduce.fileoutputcommitter.algorithm.version default depends on runtime environment

2020-08-25 Thread Waleed Fateem (Jira)
Waleed Fateem created SPARK-32701:
-

 Summary: mapreduce.fileoutputcommitter.algorithm.version default 
depends on runtime environment
 Key: SPARK-32701
 URL: https://issues.apache.org/jira/browse/SPARK-32701
 Project: Spark
  Issue Type: Bug
  Components: docs, Documentation
Affects Versions: 3.0.0, 2.4.0
Reporter: Waleed Fateem


When someone reads the documentation in its current state, the assumption is 
that the default value of 
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 and that's 
not entirely accurate. 

Spark doesn't explicitly set this configuration and instead is inherited from 
Hadoop's FileOutputCommitter class. The default value is 1 until Hadoop 3.0 
where this changed to 2.

I'm proposing that we clarify that this value's default will depend on the 
Hadoop version in a user's runtime environment, where:

1 for < Hadoop 3.0
2 for >= Hadoop 3.0

There are also plans to revert this default again to v1 so might also be useful 
to reference this JIRA:
https://issues.apache.org/jira/browse/MAPREDUCE-7282

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2020-08-25 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184623#comment-17184623
 ] 

Cheng Su commented on SPARK-26164:
--

Just update - discussed with [~cloud_fan], this Jira is still valid, and I am 
working on a new PR on latest master. Thanks.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32694) Pushdown cast to data sources

2020-08-25 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184614#comment-17184614
 ] 

Chao Sun commented on SPARK-32694:
--

Thanks [~rakson] for the pointer! didn't know there were multiple attempts on 
this issue in the past. I'll think more on what is the best approach to move 
this forward.

> Pushdown cast to data sources
> -
>
> Key: SPARK-32694
> URL: https://issues.apache.org/jira/browse/SPARK-32694
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Priority: Major
>
> Currently we don't support pushing down cast to data source (see 
> [link|http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSql-Casting-of-Predicate-Literals-tp29956p30035.html]
>  for a discussion). For instance, in the following code snippet:
> {code}
> scala> case class Person(name: String, age: Short)
> scala> Seq(Person("John", 32), Person("David", 25), Person("Mike", 
> 18)).toDS().write.parquet("/tmp/person.parquet")
> scala> val personDS = spark.read.parquet("/tmp/person.parquet")
> scala> personDS.createOrReplaceTempView("person")
> scala> spark.sql("SELECT * FROM person where age < 30")
> {code}
> The predicate won't be pushed down to Parquet data source because in 
> {{DataSourceStrategy}}, {{PushableColumnBase}} only handles a few limited 
> cases such as {{Attribute}} and {{GetStructField}}. Potentially we can handle 
> {{Cast}} here as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32699) Add percentage of missingness to df.summary()

2020-08-25 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184176#comment-17184176
 ] 

Sean R. Owen commented on SPARK-32699:
--

Pretty easy to compute this if needed, no?

> Add percentage of missingness to df.summary()
> -
>
> Key: SPARK-32699
> URL: https://issues.apache.org/jira/browse/SPARK-32699
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Chengyin Eng
>Priority: Major
>
>  
> h2. In df.summary(), we are returned counts of non-nulls for each column. It 
> would be really helpful to have a percentage of non-nulls, since percentage 
> of missingness is often an indicator for data scientists to discard the 
> columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32699) Add percentage of missingness to df.summary()

2020-08-25 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-32699:
-
Priority: Minor  (was: Major)

> Add percentage of missingness to df.summary()
> -
>
> Key: SPARK-32699
> URL: https://issues.apache.org/jira/browse/SPARK-32699
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Chengyin Eng
>Priority: Minor
>
>  
> h2. In df.summary(), we are returned counts of non-nulls for each column. It 
> would be really helpful to have a percentage of non-nulls, since percentage 
> of missingness is often an indicator for data scientists to discard the 
> columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32700) select from table TABLESAMPLE gives wrong resultset.

2020-08-25 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184175#comment-17184175
 ] 

Sean R. Owen commented on SPARK-32700:
--

I don't think you're guaranteed to get exactly 50% here. With a small data set, 
due to random sampling, it could be even 3 with not-small probability.

> select from table TABLESAMPLE gives wrong resultset.
> 
>
> Key: SPARK-32700
> URL: https://issues.apache.org/jira/browse/SPARK-32700
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Spark 3.0.0
>Reporter: Chetan Bhat
>Priority: Minor
>
> create table test(id int,name string) stored as parquet;
> insert into test values 
> (5,'Alex'),(8,'Lucy'),(2,'Mary'),(4,'Fred'),(1,'Lisa'),(9,'Eric'),(10,'Adam'),(6,'Mark'),(7,'Lily'),(3,'Evan');
> SELECT * FROM test TABLESAMPLE (50 PERCENT); --> output is giving only 3 rows.
> spark-sql> SELECT * FROM test TABLESAMPLE (50 PERCENT);
> 5 Alex
> 10 Adam
> 4 Fred
>  
> Expected as per the link is 5 rows 
> -->[https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html]
>  
> Also the bucket parameter for select from table TABLESAMPLE gives wrong 
> resultset.
> spark-sql> SELECT * FROM test TABLESAMPLE (BUCKET 4 OUT OF 10);
> 5 Alex
> 8 Lucy
> 9 Eric
> 1 Lisa
> 3 Evan
> Expected is 4 records.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Target Version/s: 3.0.1

> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{Set}}'s contains function has better performance:
> https://www.baeldung.com/java-hashset-arraylist-contains-performance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Issue Type: Bug  (was: Improvement)

> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{Set}}'s contains function has better performance:
> https://www.baeldung.com/java-hashset-arraylist-contains-performance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-25 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-32659:

Summary: Fix the data issue of inserted DPP on non-atomic type  (was: 
Replace Array with Set in InSubqueryExec)

> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{Set}}'s contains function has better performance:
> https://www.baeldung.com/java-hashset-arraylist-contains-performance



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-08-25 Thread Micah Kornfield (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184144#comment-17184144
 ] 

Micah Kornfield commented on SPARK-32037:
-

FWIW one argument for using 'denylist'  is it is the shortest option (which 
only matters when all other things are equal).

 

An argument against 'blocklist' is it is easy to have regressions via typo.

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32614.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29516
[https://github.com/apache/spark/pull/29516]

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.3, 2.4.5, 3.0.0
>Reporter: Chandan Ray
>Assignee: Sean R. Owen
>Priority: Minor
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);*
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable this feature in spark-csv by 
> adding a parameter in spark csvoptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32614) Support for treating the line as valid record if it starts with \u0000 or null character, or starts with any character mentioned as comment

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32614:


Assignee: Sean R. Owen

> Support for treating the line as valid record if it starts with \u or 
> null character, or starts with any character mentioned as comment
> ---
>
> Key: SPARK-32614
> URL: https://issues.apache.org/jira/browse/SPARK-32614
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.3, 2.4.5, 3.0.0
>Reporter: Chandan Ray
>Assignee: Sean R. Owen
>Priority: Minor
>  Labels: correctness
> Attachments: screenshot-1.png
>
>
> In most of the data ware housing scenarios files does not have comment 
> records and every line needs to be treated as a valid record even though it 
> starts with default comment character as \u or null character.Though user 
> can set any comment character other than \u, but there is a chance the 
> actual record can start with those characters.
> Currently for the below piece of code and the given testdata where first row 
> starts with null \u
> character it will throw the below error.
> *eg: val df = 
> spark.read.option("delimiter",",").csv("file:/E:/Data/Testdata.dat");
>   df.show(false);*
> *+TestData+*
>  
>  !screenshot-1.png! 
> Internal state when error was thrown: line=1, column=0, record=0, charIndex=7
>   at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
>   at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:552)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.inferFromDataset(CSVDataSource.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.csv.TextInputCSVDataSource$.infer(CSVDataSource.scala:148)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVDataSource.inferSchema(CSVDataSource.scala:62)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:57)
> *Note:*
> Though its the limitation of the univocity parser and the workaround is to 
> provide any other comment character by mentioning .option("comment","#"), but 
> if my actual data starts with this character then the particular row will be 
> discarded.
> Currently I pushed the code in univocity parser to handle this scenario as 
> part of the below PR
> https://github.com/uniVocity/univocity-parsers/pull/412
> please accept the jira so that we can enable this feature in spark-csv by 
> adding a parameter in spark csvoptions.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32691:
--
Environment: ARM64  (was: ARM)

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Priority: Major
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32691:
--
Environment: ARM

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM
>Reporter: huangtianhua
>Priority: Major
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32691:
--
Component/s: Spark Core

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
>Reporter: huangtianhua
>Priority: Major
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-25 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-32691:
--
Issue Type: Bug  (was: Test)

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: huangtianhua
>Priority: Major
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184122#comment-17184122
 ] 

Dongjoon Hyun commented on SPARK-32691:
---

Thank you for reporting, [~huangtianhua]. Yes. I suspects "with replication as 
stream" code path is related to this on ARM.

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: huangtianhua
>Priority: Major
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-25 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184122#comment-17184122
 ] 

Dongjoon Hyun edited comment on SPARK-32691 at 8/25/20, 3:18 PM:
-

Thank you for reporting, [~huangtianhua]. Yes. I suspect "with replication as 
stream" code path is related to this on ARM.


was (Author: dongjoon):
Thank you for reporting, [~huangtianhua]. Yes. I suspects "with replication as 
stream" code path is related to this on ARM.

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: huangtianhua
>Priority: Major
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32700) select from table TABLESAMPLE gives wrong resultset.

2020-08-25 Thread Chetan Bhat (Jira)
Chetan Bhat created SPARK-32700:
---

 Summary: select from table TABLESAMPLE gives wrong resultset.
 Key: SPARK-32700
 URL: https://issues.apache.org/jira/browse/SPARK-32700
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Spark 3.0.0
Reporter: Chetan Bhat


create table test(id int,name string) stored as parquet;
insert into test values 
(5,'Alex'),(8,'Lucy'),(2,'Mary'),(4,'Fred'),(1,'Lisa'),(9,'Eric'),(10,'Adam'),(6,'Mark'),(7,'Lily'),(3,'Evan');
SELECT * FROM test TABLESAMPLE (50 PERCENT); --> output is giving only 3 rows.
spark-sql> SELECT * FROM test TABLESAMPLE (50 PERCENT);
5 Alex
10 Adam
4 Fred

 

Expected as per the link is 5 rows 
-->[https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-sampling.html]

 

Also the bucket parameter for select from table TABLESAMPLE gives wrong 
resultset.
spark-sql> SELECT * FROM test TABLESAMPLE (BUCKET 4 OUT OF 10);
5 Alex
8 Lucy
9 Eric
1 Lisa
3 Evan

Expected is 4 records.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32110) -0.0 vs 0.0 is inconsistent

2020-08-25 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184101#comment-17184101
 ] 

Takeshi Yamamuro commented on SPARK-32110:
--

We cannot provide a boolean configuration (turning it off by default to avoid 
the breaking change) for normalizing it in write path? If we set true to it, we 
enable write path normalization then disable read path normalization (e.g., 
`NormalizeFloatingNumbers`). IMHO the write path normalization looks 
straightforward (and lower overhead?), and it might be still meaningful for 
users who want to use Spark like a database system.


> -0.0 vs 0.0 is inconsistent
> ---
>
> Key: SPARK-32110
> URL: https://issues.apache.org/jira/browse/SPARK-32110
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Robert Joseph Evans
>Priority: Major
>
> This is related to SPARK-26021 where some things were fixed but there is 
> still a lot that is not consistent.
> When parsing SQL {{-0.0}} is turned into {{0.0}}. This can produce quick 
> results that appear to be correct but are totally inconsistent for the same 
> operators.
> {code:java}
> scala> import spark.implicits._
> import spark.implicits._
> scala> spark.sql("SELECT 0.0 = -0.0").collect
> res0: Array[org.apache.spark.sql.Row] = Array([true])
> scala> Seq((0.0, -0.0)).toDF("a", "b").selectExpr("a = b").collect
> res1: Array[org.apache.spark.sql.Row] = Array([false])
> {code}
> This also shows up in sorts
> {code:java}
> scala> Seq((0.0, -100.0), (-0.0, 100.0), (0.0, 100.0), (-0.0, 
> -100.0)).toDF("a", "b").orderBy("a", "b").collect
> res2: Array[org.apache.spark.sql.Row] = Array([-0.0,-100.0], [-0.0,100.0], 
> [0.0,-100.0], [0.0,100.0])
> {code}
> But not for a equi-join or for an aggregate
> {code:java}
> scala> Seq((0.0, -0.0)).toDF("a", "b").join(Seq((-0.0, 0.0)).toDF("r_a", 
> "r_b"), $"a" === $"r_a").collect
> res3: Array[org.apache.spark.sql.Row] = Array([0.0,-0.0,-0.0,0.0])
> scala> Seq((0.0, 1.0), (-0.0, 1.0)).toDF("a", "b").groupBy("a").count.collect
> res6: Array[org.apache.spark.sql.Row] = Array([0.0,2])
> {code}
> This can lead to some very odd results. Like an equi-join with a filter that 
> logically should do nothing, but ends up filtering the result to nothing.
> {code:java}
> scala> Seq((0.0, -0.0)).toDF("a", "b").join(Seq((-0.0, 0.0)).toDF("r_a", 
> "r_b"), $"a" === $"r_a" && $"a" <= $"r_a").collect
> res8: Array[org.apache.spark.sql.Row] = Array()
> scala> Seq((0.0, -0.0)).toDF("a", "b").join(Seq((-0.0, 0.0)).toDF("r_a", 
> "r_b"), $"a" === $"r_a").collect
> res9: Array[org.apache.spark.sql.Row] = Array([0.0,-0.0,-0.0,0.0])
> {code}
> Hive never normalizes -0.0 to 0.0 so this results in non-ieee complaint 
> behavior everywhere, but at least it is consistently odd.
> MySQL, Oracle, Postgres, and SQLite all appear to normalize the {{-0.0}} to 
> {{0.0}}.
> The root cause of this appears to be that the java implementation of 
> {{Double.compare}} and {{Float.compare}} for open JDK places {{-0.0}} < 
> {{0.0}}.
> This is not documented in the java docs but it is clearly documented in the 
> code, so it is not a "bug" that java is going to fix.
> [https://github.com/openjdk/jdk/blob/a0a0539b0d3f9b6809c9759e697bfafd7b138ec1/src/java.base/share/classes/java/lang/Double.java#L1022-L1035]
> It is also consistent with what is in the java docs for {{Double.equals}}
>  
> [https://docs.oracle.com/javase/8/docs/api/java/lang/Double.html#equals-java.lang.Object-]
> To be clear I am filing this mostly to document the current state rather than 
> to think it needs to be fixed ASAP. It is a rare corner case, but ended up 
> being really frustrating for me to debug what was happening.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32699) Add percentage of missingness to df.summary()

2020-08-25 Thread Chengyin Eng (Jira)
Chengyin Eng created SPARK-32699:


 Summary: Add percentage of missingness to df.summary()
 Key: SPARK-32699
 URL: https://issues.apache.org/jira/browse/SPARK-32699
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 3.0.0
Reporter: Chengyin Eng


 
h2. In df.summary(), we are returned counts of non-nulls for each column. It 
would be really helpful to have a percentage of non-nulls, since percentage of 
missingness is often an indicator for data scientists to discard the columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies

2020-08-25 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31167:
-
Description: Ideally, we should have a single place to track Python 
development dependencies and reuse it in all the relevant places: developer 
docs, Dockerfile, and GitHub CI. Where appropriate, we should pin dependencies 
to ensure a reproducible Python environment.  (was: Ideally, we should have a 
single place to track Python development dependencies and reuse it in all the 
relevant places: developer docs, Dockerfile, )

> Refactor how we track Python test/build dependencies
> 
>
> Key: SPARK-31167
> URL: https://issues.apache.org/jira/browse/SPARK-31167
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Ideally, we should have a single place to track Python development 
> dependencies and reuse it in all the relevant places: developer docs, 
> Dockerfile, and GitHub CI. Where appropriate, we should pin dependencies to 
> ensure a reproducible Python environment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31167) Refactor how we track Python test/build dependencies

2020-08-25 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31167:
-
Description: Ideally, we should have a single place to track Python 
development dependencies and reuse it in all the relevant places: developer 
docs, Dockerfile, 

> Refactor how we track Python test/build dependencies
> 
>
> Key: SPARK-31167
> URL: https://issues.apache.org/jira/browse/SPARK-31167
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Ideally, we should have a single place to track Python development 
> dependencies and reuse it in all the relevant places: developer docs, 
> Dockerfile, 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32664) Getting local shuffle block clutters the executor logs

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32664.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29527
[https://github.com/apache/spark/pull/29527]

> Getting local shuffle block clutters the executor logs
> --
>
> Key: SPARK-32664
> URL: https://issues.apache.org/jira/browse/SPARK-32664
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Daniel Moore
>Priority: Trivial
> Fix For: 3.1.0
>
>
> The below log statement in {{BlockManager.getLocalBlockData}} should be at 
> debug level
> {code:java}
> logInfo(s"Getting local shuffle block ${blockId}")
> {code}
> Currently, the executor logs get cluttered with this
> {code:java}
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6103_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6132_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6137_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6312_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6323_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6402_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6413_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6694_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6709_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6753_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6822_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6894_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6913_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7052_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7073_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7167_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7194_4964
> {code}
> This was added with SPARK-20629.
> cc. [~holden]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32664) Getting local shuffle block clutters the executor logs

2020-08-25 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32664:


Assignee: Daniel Moore

> Getting local shuffle block clutters the executor logs
> --
>
> Key: SPARK-32664
> URL: https://issues.apache.org/jira/browse/SPARK-32664
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: Chandni Singh
>Assignee: Daniel Moore
>Priority: Trivial
>
> The below log statement in {{BlockManager.getLocalBlockData}} should be at 
> debug level
> {code:java}
> logInfo(s"Getting local shuffle block ${blockId}")
> {code}
> Currently, the executor logs get cluttered with this
> {code:java}
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6103_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6132_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6137_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6312_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6323_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6402_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6413_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6694_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6709_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6753_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6822_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6894_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_6913_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7052_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7073_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7167_4964
> 20/08/20 02:07:52 INFO storage.BlockManager: Getting local shuffle block 
> shuffle_0_7194_4964
> {code}
> This was added with SPARK-20629.
> cc. [~holden]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32037) Rename blacklisting feature to avoid language with racist connotation

2020-08-25 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184059#comment-17184059
 ] 

Thomas Graves commented on SPARK-32037:
---

I started a thread on dev to get feedback: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Renaming-blacklisting-feature-input-td29950.html]

> Rename blacklisting feature to avoid language with racist connotation
> -
>
> Key: SPARK-32037
> URL: https://issues.apache.org/jira/browse/SPARK-32037
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Minor
>
> As per [discussion on the Spark dev 
> list|https://lists.apache.org/thread.html/rf6b2cdcba4d3875350517a2339619e5d54e12e66626a88553f9fe275%40%3Cdev.spark.apache.org%3E],
>  it will be beneficial to remove references to problematic language that can 
> alienate potential community members. One such reference is "blacklist". 
> While it seems to me that there is some valid debate as to whether this term 
> has racist origins, the cultural connotations are inescapable in today's 
> world.
> I've created a separate task, SPARK-32036, to remove references outside of 
> this feature. Given the large surface area of this feature and the 
> public-facing UI / configs / etc., more care will need to be taken here.
> I'd like to start by opening up debate on what the best replacement name 
> would be. Reject-/deny-/ignore-/block-list are common replacements for 
> "blacklist", but I'm not sure that any of them work well for this situation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32333) Drop references to Master

2020-08-25 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184051#comment-17184051
 ] 

Thomas Graves commented on SPARK-32333:
---

I send email to the dev list to get feedback, some other suggestions in that 
email here:
A few name possibilities:
 - ApplicationManager
 - StandaloneClusterManager
 - Coordinator
 - Primary
 - Controller
 
That chain can be found here: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Removing-references-to-Master-td29948.html]

> Drop references to Master
> -
>
> Key: SPARK-32333
> URL: https://issues.apache.org/jira/browse/SPARK-32333
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Major
>
> We have a lot of references to "master" in the code base. It will be 
> beneficial to remove references to problematic language that can alienate 
> potential community members. 
> SPARK-32004 removed references to slave
>  
> Here is a IETF draft to fix up some of the most egregious examples
> (master/slave, whitelist/backlist) with proposed alternatives.
> https://tools.ietf.org/id/draft-knodel-terminology-00.html#rfc.section.1.1.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32107) Dask faster than Spark with a lot less iterations and better accuracy

2020-08-25 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32107.
--
Resolution: Invalid

> Dask faster than Spark with a lot less iterations and better accuracy
> -
>
> Key: SPARK-32107
> URL: https://issues.apache.org/jira/browse/SPARK-32107
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.4.5
> Environment: Anaconda for Windows with PySpark 2.4.5
>Reporter: Julian
>Priority: Minor
>
> Hello,
> I'm benchmarking k-means clustering Dask versus Spark.
> Right now these are only benchmarks on my laptop, but I've some interesting 
> results and I'm looking for an explanation before I further benchmark this 
> algorithm on a cluster.
> I've logged the execution time, model cluster predictions, iterations. Both 
> benchmarks used the same data with 1.6 million rows.
> The questions are:
>  * Why does Spark need a lot more iterations than Dask?
>  * Why is clustering less accurate in Spark than in Dask?
> I'm unclear why those are different, because they both use the same 
> underlying algorithm and have more or less the same standard parameter.
> *Dask*
> KMeans( n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, 
> tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, 
> n_jobs=1, algorithm='full', init_max_iter=None, )
> *Spark*
>  I've set maxIter to 300 and reset the seed for every benchmark.
> KMeans( featuresCol='features', predictionCol='prediction', k=2, 
> initMode='k-means||', initSteps=2, tol=0.0001, maxIter=20, seed=None, 
> distanceMeasure='euclidean', )
> Here you can see the duration of execution of each k-means clustering 
> together with the iterations used to get a result. Spark is a lot slower than 
> Spark on the overall calculation, but needs also a lot more iterations. 
> Interestingly Spark is faster per iteration (the slope of a regression line) 
> and faster on initialization (the y-intercept of the regression line). For 
> the Spark benchmarks one can also make out a second line which I couldn't yet 
> explain.
> [!https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png!|https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png]
> The training data is equally spaced grid. The circles around the cluster 
> centers are the standard deviation. Clusters are overlapping and it is 
> impossible to get a hundred percent accuracy. The red markers are the 
> predicted cluster centers and the arrow shows their correspoding cluster 
> center. In this example the clustering is not correct. One cluster was on the 
> wrong spot and two predicted cluster centers share one cluster center. I can 
> make these plots for all models.
> [!https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png!|https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png]
> The graph on the right makes everything much weirder. Apperently the Spark 
> implementation is less accurate than the Dask implementation. Also you can 
> see the distribution of the duration and iterations much butter (These are 
> seaborn boxenplots).
> [!https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png!|https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png]
> I'm using Anaconda for Windows and PySpark 2.4.5 and Dask 2.5.2.
> I filed this issue for [Dask|https://github.com/dask/dask-ml/issues/686] and 
> [Spark|https://issues.apache.org/jira/browse/SPARK-32107].
> Best regards
>  Julian



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32107) Dask faster than Spark with a lot less iterations and better accuracy

2020-08-25 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184050#comment-17184050
 ] 

Takeshi Yamamuro commented on SPARK-32107:
--

Could you ask the question in the mailing list, first? 
[https://spark.apache.org/community.html]

Basically, this is not a place to do so.

> Dask faster than Spark with a lot less iterations and better accuracy
> -
>
> Key: SPARK-32107
> URL: https://issues.apache.org/jira/browse/SPARK-32107
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.4.5
> Environment: Anaconda for Windows with PySpark 2.4.5
>Reporter: Julian
>Priority: Minor
>
> Hello,
> I'm benchmarking k-means clustering Dask versus Spark.
> Right now these are only benchmarks on my laptop, but I've some interesting 
> results and I'm looking for an explanation before I further benchmark this 
> algorithm on a cluster.
> I've logged the execution time, model cluster predictions, iterations. Both 
> benchmarks used the same data with 1.6 million rows.
> The questions are:
>  * Why does Spark need a lot more iterations than Dask?
>  * Why is clustering less accurate in Spark than in Dask?
> I'm unclear why those are different, because they both use the same 
> underlying algorithm and have more or less the same standard parameter.
> *Dask*
> KMeans( n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, 
> tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, 
> n_jobs=1, algorithm='full', init_max_iter=None, )
> *Spark*
>  I've set maxIter to 300 and reset the seed for every benchmark.
> KMeans( featuresCol='features', predictionCol='prediction', k=2, 
> initMode='k-means||', initSteps=2, tol=0.0001, maxIter=20, seed=None, 
> distanceMeasure='euclidean', )
> Here you can see the duration of execution of each k-means clustering 
> together with the iterations used to get a result. Spark is a lot slower than 
> Spark on the overall calculation, but needs also a lot more iterations. 
> Interestingly Spark is faster per iteration (the slope of a regression line) 
> and faster on initialization (the y-intercept of the regression line). For 
> the Spark benchmarks one can also make out a second line which I couldn't yet 
> explain.
> [!https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png!|https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png]
> The training data is equally spaced grid. The circles around the cluster 
> centers are the standard deviation. Clusters are overlapping and it is 
> impossible to get a hundred percent accuracy. The red markers are the 
> predicted cluster centers and the arrow shows their correspoding cluster 
> center. In this example the clustering is not correct. One cluster was on the 
> wrong spot and two predicted cluster centers share one cluster center. I can 
> make these plots for all models.
> [!https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png!|https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png]
> The graph on the right makes everything much weirder. Apperently the Spark 
> implementation is less accurate than the Dask implementation. Also you can 
> see the distribution of the duration and iterations much butter (These are 
> seaborn boxenplots).
> [!https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png!|https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png]
> I'm using Anaconda for Windows and PySpark 2.4.5 and Dask 2.5.2.
> I filed this issue for [Dask|https://github.com/dask/dask-ml/issues/686] and 
> [Spark|https://issues.apache.org/jira/browse/SPARK-32107].
> Best regards
>  Julian



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Daeho Ro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184031#comment-17184031
 ] 

Daeho Ro edited comment on SPARK-32683 at 8/25/20, 1:27 PM:


I did not mean to change the doc but the source or recover the DateFormatter W, 
anyway, the function is gone (even before) and the documentation is now clear, 
not confused. 


was (Author: lamanus):
I did not mean to change the doc but the source or recover the DateFormatter W 
but anyway, the function is gone (even before) and the documentation is now 
clear, not confused. 

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Daeho Ro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184031#comment-17184031
 ] 

Daeho Ro commented on SPARK-32683:
--

I did not mean to change the doc but the source or recover the DateFormatter W 
but anyway, the function is gone (even before) and the documentation is now 
clear, not confused. 

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32683:
---

Assignee: Kent Yao

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Assignee: Kent Yao
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32683.
-
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29538
[https://github.com/apache/spark/pull/29538]

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32698:


Assignee: (was: Apache Spark)

> Do not fall back to default parallelism if the minimum number of coalesced 
> partitions is not set in AQE
> ---
>
> Key: SPARK-32698
> URL: https://issues.apache.org/jira/browse/SPARK-32698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Currently in AQE when coalescing shuffling partitions,
> {quote}We fall back to Spark default parallelism if the minimum number of 
> coalesced partitions is not set, so to avoid perf regressions compared to no 
> coalescing.
> {quote}
> From our experience, this has resulted in a lot of uncertainty of the number 
> of tasks after coalescing especially with dynamic allocation, and also lead 
> to many small output files. It's complex and hard to reason about.
> Hence, I'm proposing not falling back to the default parallelism but 
> coalescing towards the target size when the minimum number of coalesced 
> partitions is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32698:


Assignee: Apache Spark

> Do not fall back to default parallelism if the minimum number of coalesced 
> partitions is not set in AQE
> ---
>
> Key: SPARK-32698
> URL: https://issues.apache.org/jira/browse/SPARK-32698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently in AQE when coalescing shuffling partitions,
> {quote}We fall back to Spark default parallelism if the minimum number of 
> coalesced partitions is not set, so to avoid perf regressions compared to no 
> coalescing.
> {quote}
> From our experience, this has resulted in a lot of uncertainty of the number 
> of tasks after coalescing especially with dynamic allocation, and also lead 
> to many small output files. It's complex and hard to reason about.
> Hence, I'm proposing not falling back to the default parallelism but 
> coalescing towards the target size when the minimum number of coalesced 
> partitions is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32698:


Assignee: (was: Apache Spark)

> Do not fall back to default parallelism if the minimum number of coalesced 
> partitions is not set in AQE
> ---
>
> Key: SPARK-32698
> URL: https://issues.apache.org/jira/browse/SPARK-32698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Currently in AQE when coalescing shuffling partitions,
> {quote}We fall back to Spark default parallelism if the minimum number of 
> coalesced partitions is not set, so to avoid perf regressions compared to no 
> coalescing.
> {quote}
> From our experience, this has resulted in a lot of uncertainty of the number 
> of tasks after coalescing especially with dynamic allocation, and also lead 
> to many small output files. It's complex and hard to reason about.
> Hence, I'm proposing not falling back to the default parallelism but 
> coalescing towards the target size when the minimum number of coalesced 
> partitions is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184009#comment-17184009
 ] 

Apache Spark commented on SPARK-32698:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29540

> Do not fall back to default parallelism if the minimum number of coalesced 
> partitions is not set in AQE
> ---
>
> Key: SPARK-32698
> URL: https://issues.apache.org/jira/browse/SPARK-32698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Currently in AQE when coalescing shuffling partitions,
> {quote}We fall back to Spark default parallelism if the minimum number of 
> coalesced partitions is not set, so to avoid perf regressions compared to no 
> coalescing.
> {quote}
> From our experience, this has resulted in a lot of uncertainty of the number 
> of tasks after coalescing especially with dynamic allocation, and also lead 
> to many small output files. It's complex and hard to reason about.
> Hence, I'm proposing not falling back to the default parallelism but 
> coalescing towards the target size when the minimum number of coalesced 
> partitions is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32107) Dask faster than Spark with a lot less iterations and better accuracy

2020-08-25 Thread Julian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184007#comment-17184007
 ] 

Julian commented on SPARK-32107:


Dear Spark-Team,

 

will there be any effort to resolve this issue? Further investigation just 
confirmed the suspicion and has shown that the implementation of Spark is very 
unstable.

 

I want to give you the chance to react, before I publish the findings.

 

With best regards,

Julian

> Dask faster than Spark with a lot less iterations and better accuracy
> -
>
> Key: SPARK-32107
> URL: https://issues.apache.org/jira/browse/SPARK-32107
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.4.5
> Environment: Anaconda for Windows with PySpark 2.4.5
>Reporter: Julian
>Priority: Minor
>
> Hello,
> I'm benchmarking k-means clustering Dask versus Spark.
> Right now these are only benchmarks on my laptop, but I've some interesting 
> results and I'm looking for an explanation before I further benchmark this 
> algorithm on a cluster.
> I've logged the execution time, model cluster predictions, iterations. Both 
> benchmarks used the same data with 1.6 million rows.
> The questions are:
>  * Why does Spark need a lot more iterations than Dask?
>  * Why is clustering less accurate in Spark than in Dask?
> I'm unclear why those are different, because they both use the same 
> underlying algorithm and have more or less the same standard parameter.
> *Dask*
> KMeans( n_clusters=8, init='k-means||', oversampling_factor=2, max_iter=300, 
> tol=0.0001, precompute_distances='auto', random_state=None, copy_x=True, 
> n_jobs=1, algorithm='full', init_max_iter=None, )
> *Spark*
>  I've set maxIter to 300 and reset the seed for every benchmark.
> KMeans( featuresCol='features', predictionCol='prediction', k=2, 
> initMode='k-means||', initSteps=2, tol=0.0001, maxIter=20, seed=None, 
> distanceMeasure='euclidean', )
> Here you can see the duration of execution of each k-means clustering 
> together with the iterations used to get a result. Spark is a lot slower than 
> Spark on the overall calculation, but needs also a lot more iterations. 
> Interestingly Spark is faster per iteration (the slope of a regression line) 
> and faster on initialization (the y-intercept of the regression line). For 
> the Spark benchmarks one can also make out a second line which I couldn't yet 
> explain.
> [!https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png!|https://user-images.githubusercontent.com/31596773/85844596-4564af00-b7a3-11ea-90fb-9c525d9afaad.png]
> The training data is equally spaced grid. The circles around the cluster 
> centers are the standard deviation. Clusters are overlapping and it is 
> impossible to get a hundred percent accuracy. The red markers are the 
> predicted cluster centers and the arrow shows their correspoding cluster 
> center. In this example the clustering is not correct. One cluster was on the 
> wrong spot and two predicted cluster centers share one cluster center. I can 
> make these plots for all models.
> [!https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png!|https://user-images.githubusercontent.com/31596773/85845362-6974c000-b7a4-11ea-9709-4b32833fe238.png]
> The graph on the right makes everything much weirder. Apperently the Spark 
> implementation is less accurate than the Dask implementation. Also you can 
> see the distribution of the duration and iterations much butter (These are 
> seaborn boxenplots).
> [!https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png!|https://user-images.githubusercontent.com/31596773/85865158-c2088500-b7c5-11ea-83c2-dbd6808338a5.png]
> I'm using Anaconda for Windows and PySpark 2.4.5 and Dask 2.5.2.
> I filed this issue for [Dask|https://github.com/dask/dask-ml/issues/686] and 
> [Spark|https://issues.apache.org/jira/browse/SPARK-32107].
> Best regards
>  Julian



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2020-08-25 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-32698:
--

 Summary: Do not fall back to default parallelism if the minimum 
number of coalesced partitions is not set in AQE
 Key: SPARK-32698
 URL: https://issues.apache.org/jira/browse/SPARK-32698
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


Currently in AQE when coalescing shuffling partitions,
{quote}We fall back to Spark default parallelism if the minimum number of 
coalesced partitions is not set, so to avoid perf regressions compared to no 
coalescing.
{quote}
>From our experience, this has resulted in a lot of uncertainty of the number 
>of tasks after coalescing especially with dynamic allocation, and also lead to 
>many small output files. It's complex and hard to reason about.

Hence, I'm proposing not falling back to the default parallelism but coalescing 
towards the target size when the minimum number of coalesced partitions is not 
set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32697) Direct Date and timestamp format data insertion fails

2020-08-25 Thread Chetan Bhat (Jira)
Chetan Bhat created SPARK-32697:
---

 Summary: Direct Date and timestamp format data insertion fails
 Key: SPARK-32697
 URL: https://issues.apache.org/jira/browse/SPARK-32697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Spark 3.0.0
Reporter: Chetan Bhat


Direct Date and timestamp format data tried to be inserted but the data 
insertion fails as shown below.

spark-sql> create table test(no timestamp) stored as parquet;
Time taken: 0.561 seconds


spark-sql> insert into test select '1979-04-27 00:00:00';
Error in query: Cannot write incompatible data to table '`default`.`test`':
- Cannot safely cast 'no': string to timestamp;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32696) Get columns operation should handle interval column properly

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183911#comment-17183911
 ] 

Apache Spark commented on SPARK-32696:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29539

> Get columns operation should handle interval column properly
> 
>
> Key: SPARK-32696
> URL: https://issues.apache.org/jira/browse/SPARK-32696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> views can contain interval columns which should be handled properly via 
> SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32696) Get columns operation should handle interval column properly

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183910#comment-17183910
 ] 

Apache Spark commented on SPARK-32696:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29539

> Get columns operation should handle interval column properly
> 
>
> Key: SPARK-32696
> URL: https://issues.apache.org/jira/browse/SPARK-32696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> views can contain interval columns which should be handled properly via 
> SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32696) Get columns operation should handle interval column properly

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32696:


Assignee: Apache Spark

> Get columns operation should handle interval column properly
> 
>
> Key: SPARK-32696
> URL: https://issues.apache.org/jira/browse/SPARK-32696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> views can contain interval columns which should be handled properly via 
> SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32696) Get columns operation should handle interval column properly

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32696:


Assignee: (was: Apache Spark)

> Get columns operation should handle interval column properly
> 
>
> Key: SPARK-32696
> URL: https://issues.apache.org/jira/browse/SPARK-32696
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> views can contain interval columns which should be handled properly via 
> SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32696) Get columns operation should handle interval column properly

2020-08-25 Thread Kent Yao (Jira)
Kent Yao created SPARK-32696:


 Summary: Get columns operation should handle interval column 
properly
 Key: SPARK-32696
 URL: https://issues.apache.org/jira/browse/SPARK-32696
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


views can contain interval columns which should be handled properly via 
SparkGetColumnsOperation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32683:


Assignee: Apache Spark

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Assignee: Apache Spark
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32683:


Assignee: (was: Apache Spark)

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183770#comment-17183770
 ] 

Apache Spark commented on SPARK-32683:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29538

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32683) Datetime Pattern F not working as expected

2020-08-25 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183769#comment-17183769
 ] 

Apache Spark commented on SPARK-32683:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/29538

> Datetime Pattern F not working as expected
> --
>
> Key: SPARK-32683
> URL: https://issues.apache.org/jira/browse/SPARK-32683
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Windows 10 Pro
>  * with Jupyter Lab - Docker Image 
>  ** jupyter/all-spark-notebook:f1811928b3dd 
>  *** spark 3.0.0
>  *** python 3.8.5
>  *** openjdk 11.0.8
>Reporter: Daeho Ro
>Priority: Major
> Attachments: comment.png
>
>
> h3. Background
> From the 
> [documentation|https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html],
>  the pattern F should give a week of the month.
> |*Symbol*|*Meaning*|*Presentation*|*Example*|
> |F|week-of-month|number(1)|3|
> h3. Test Data
> Here is my test data, that is a csv file.
> {code:java}
> date
> 2020-08-01
> 2020-08-02
> 2020-08-03
> 2020-08-04
> 2020-08-05
> 2020-08-06
> 2020-08-07
> 2020-08-08
> 2020-08-09
> 2020-08-10 {code}
> h3. Steps to the bug
> I have tested in the scala spark 3.0.0 and pyspark 3.0.0:
> {code:java}
> // Spark
> df.withColumn("date", to_timestamp('date, "-MM-dd"))
>   .withColumn("week", date_format('date, "F")).show
> +---++
> |   date|week|
> +---++
> |2020-08-01 00:00:00|   1|
> |2020-08-02 00:00:00|   2|
> |2020-08-03 00:00:00|   3|
> |2020-08-04 00:00:00|   4|
> |2020-08-05 00:00:00|   5|
> |2020-08-06 00:00:00|   6|
> |2020-08-07 00:00:00|   7|
> |2020-08-08 00:00:00|   1|
> |2020-08-09 00:00:00|   2|
> |2020-08-10 00:00:00|   3|
> +---++
> # pyspark
> df.withColumn('date', to_timestamp('date', '-MM-dd')) \
>   .withColumn('week', date_format('date', 'F')) \
>   .show(10, False)
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|3   |
> |2020-08-04 00:00:00|4   |
> |2020-08-05 00:00:00|5   |
> |2020-08-06 00:00:00|6   |
> |2020-08-07 00:00:00|7   |
> |2020-08-08 00:00:00|1   |
> |2020-08-09 00:00:00|2   |
> |2020-08-10 00:00:00|3   |
> +---++{code}
> h3. Expected result
> The `week` column is not the week of the month. It is a day of the week as a 
> number.
>   !comment.png!
> From my calendar, the first day of August should have 1 for the week-of-month 
> and from 2nd to 8th should have 2 and so on.
> {code:java}
> +---++
> |date   |week|
> +---++
> |2020-08-01 00:00:00|1   |
> |2020-08-02 00:00:00|2   |
> |2020-08-03 00:00:00|2   |
> |2020-08-04 00:00:00|2   |
> |2020-08-05 00:00:00|2   |
> |2020-08-06 00:00:00|2   |
> |2020-08-07 00:00:00|2   |
> |2020-08-08 00:00:00|2   |
> |2020-08-09 00:00:00|3   |
> |2020-08-10 00:00:00|3   |
> +---++{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32500) Query and Batch Id not set for Structured Streaming Jobs in case of ForeachBatch in PySpark

2020-08-25 Thread Abhishek Dixit (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183762#comment-17183762
 ] 

Abhishek Dixit commented on SPARK-32500:


Thanks [~JinxinTang] and [~hyukjin.kwon] !!

> Query and Batch Id not set for Structured Streaming Jobs in case of 
> ForeachBatch in PySpark
> ---
>
> Key: SPARK-32500
> URL: https://issues.apache.org/jira/browse/SPARK-32500
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Structured Streaming
>Affects Versions: 2.4.6
>Reporter: Abhishek Dixit
>Priority: Major
> Attachments: Screen Shot 2020-07-26 at 6.50.39 PM.png, Screen Shot 
> 2020-07-30 at 9.04.21 PM.png, image-2020-08-01-10-21-51-246.png
>
>
> Query Id and Batch Id information is not available for jobs started by 
> structured streaming query when _foreachBatch_ API is used in PySpark.
> This happens only with foreachBatch in pyspark. ForeachBatch in scala works 
> fine, and also other structured streaming sinks in pyspark work fine. I am 
> attaching a screenshot of jobs pages.
> I think job group is not set properly when _foreachBatch_ is used via 
> pyspark. I have a framework that depends on the _queryId_ and _batchId_ 
> information available in the job properties and so my framework doesn't work 
> for pyspark-foreachBatch use case.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >