date:20200206

[jira] [Resolved] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-06 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-30730.

Resolution: Won't Fix

> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp strings are 
> casted to TimestampType using the JVM system timezone but in fact the session 
> time zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> the casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code:java}
>   test("to_utc_timestamp in various system and session time zones") {
> val localTs = "2020-02-04T22:42:10"
> val defaultTz = TimeZone.getDefault
> try {
>   DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
> TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
> DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
>   withSQLConf(
> SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
> SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {
> DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
>   val instant = LocalDateTime
> .parse(localTs)
> .atZone(DateTimeUtils.getZoneId(toTz))
> .toInstant
>   val df = Seq(localTs).toDF("localTs")
>   val res = df.select(to_utc_timestamp(col("localTs"), 
> toTz)).first().apply(0)
>   if (instant != res) {
> println(s"system = $systemTz session = $sessionTz to = $toTz")
>   }
> }
>   }
> }
>   }
> } catch {
>   case NonFatal(_) => TimeZone.setDefault(defaultTz)
> }
>   }
> {code}
> {code:java}
> system = UTC session = PST to = UTC
> system = UTC session = PST to = PST
> system = UTC session = PST to = CET
> system = UTC session = PST to = Africa/Dakar
> system = UTC session = PST to = America/Los_Angeles
> system = UTC session = PST to = Antarctica/Vostok
> system = UTC session = PST to = Asia/Hong_Kong
> system = UTC session = PST to = Europe/Amsterdam
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30708) first_value/last_value window function throws ParseException

2020-02-06 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-30708.

Resolution: Won't Fix

> first_value/last_value window function throws ParseException
> 
>
> Key: SPARK-30708
> URL: https://issues.apache.org/jira/browse/SPARK-30708
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> first_value/last_value throws ParseException
>  
> {code:java}
> SELECT first_value(unique1) over w,
> last_value(unique1) over w, unique1, four
> FROM tenk1 WHERE unique1 < 10
> +WINDOW w AS (order by four range between current row and unbounded following)
>  
> org.apache.spark.sql.catalyst.parser.ParseException
>  
> no viable alternative at input 'first_value'(line 1, pos 7)
>  
> == SQL ==
> SELECT first_value(unique1) over w,
> ---^^^
> last_value(unique1) over w, unique1, four
> FROM tenk1 WHERE unique1 < 10
> WINDOW w AS (order by four range between current row and unbounded following)
> {code}
>  
> Maybe we need fix this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30619) org.slf4j.Logger and org.apache.commons.collections classes not built as part of hadoop-provided profile

2020-02-06 Thread Abhishek Rao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032106#comment-17032106
 ] 

Abhishek Rao commented on SPARK-30619:
--

Hi,

Any updates on this?

> org.slf4j.Logger and org.apache.commons.collections classes not built as part 
> of hadoop-provided profile
> 
>
> Key: SPARK-30619
> URL: https://issues.apache.org/jira/browse/SPARK-30619
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.2, 2.4.4
> Environment: Spark on kubernetes
>Reporter: Abhishek Rao
>Priority: Major
>
> We're using spark-2.4.4-bin-without-hadoop.tgz and executing Java Word count 
> (org.apache.spark.examples.JavaWordCount) example on local files.
> But we're seeing that it is expecting org.slf4j.Logger and 
> org.apache.commons.collections classes to be available for executing this.
> We expected the binary to work as it is for local files. Is there anything 
> which we're missing?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032102#comment-17032102
 ] 

Hyukjin Kwon commented on SPARK-30712:
--

SPARK-24914 JIRA and PR are still open.

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-06 Thread liupengcheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032076#comment-17032076
 ] 

liupengcheng commented on SPARK-30712:
--

[~hyukjin.kwon] SPARK-24914 seems already closed, I left some comments.

I also create another related Jira: 

[SPARK-30394|https://issues.apache.org/jira/browse/SPARK-30394]

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30394) Skip collecting stats in DetermineTableStats rule when hive table is convertible to datasource tables

2020-02-06 Thread liupengcheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-30394:
-
Description: 
Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will 
scan hdfs files to collect table stats in `DetermineTableStats` rule. But this 
can be expensive and not accurate(only file size on disk, not accounting 
compression factor), acutually we can skip this if this hive table can be 
converted to datasource table(parquet etc.), and do better estimation in 
`HadoopFsRelation`.

BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, 
which will cause the improper stats(for parquet, this size is greatly smaller 
than real size in memory) be used in joinSelection when the hive table can be 
convert to datasource table.

In our production environment, user's highly compressed parquet table can cause 
OOMs when doing `broadcastHashJoin` due to this improper stats.

  was:
Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark will 
scan hdfs files to collect table stats in `DetermineTableStats` rule. But this 
can be expensive in some cases, acutually we can skip this if this hive table 
can be converted to datasource table(parquet etc.).

Before[SPARK-28573|https://issues.apache.org/jira/browse/SPARK-28573], the 
implementaion will update the CatalogTableStatistics, which will cause the 
improper stats be used in joinSelection when the hive table can be convert to 
datasource table.

In our production environment, user's highly compressed parquet table can cause 
OOMs when doing `broadcastHashJoin` due to this improper stats.


> Skip collecting stats in DetermineTableStats rule when hive table is 
> convertible to  datasource tables
> --
>
> Key: SPARK-30394
> URL: https://issues.apache.org/jira/browse/SPARK-30394
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 3.0.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, if `spark.sql.statistics.fallBackToHdfs` is enabled, then spark 
> will scan hdfs files to collect table stats in `DetermineTableStats` rule. 
> But this can be expensive and not accurate(only file size on disk, not 
> accounting compression factor), acutually we can skip this if this hive table 
> can be converted to datasource table(parquet etc.), and do better estimation 
> in `HadoopFsRelation`.
> BeforeSPARK-28573, the implementation will update the CatalogTableStatistics, 
> which will cause the improper stats(for parquet, this size is greatly smaller 
> than real size in memory) be used in joinSelection when the hive table can be 
> convert to datasource table.
> In our production environment, user's highly compressed parquet table can 
> cause OOMs when doing `broadcastHashJoin` due to this improper stats.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-06 Thread liupengcheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032066#comment-17032066
 ] 

liupengcheng commented on SPARK-30712:
--

OK, thanks! [~hyukjin.kwon].

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30735) Improving writing performance by adding repartition based on columns to partitionBy for DataFrameWriter

2020-02-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30735.
--
Resolution: Won't Fix

> Improving writing performance by adding repartition based on columns to 
> partitionBy for DataFrameWriter
> ---
>
> Key: SPARK-30735
> URL: https://issues.apache.org/jira/browse/SPARK-30735
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
> Environment: * Spark-3.0.0
>  * Scala: version 2.12.10
>  * sbt 0.13.18, script ver: 1.3.7 (Built using sbt)
>  * Java: 1.8.0_231
>  ** Java(TM) SE Runtime Environment (build 1.8.0_231-b11)
>  ** Java HotSpot(TM) 64-Bit Server VM (build 25.231-b11, mixed mode)
>Reporter: Tomohiro Tanaka
>Priority: Trivial
>  Labels: performance, pull-request-available
> Attachments: repartition-before-partitionby.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> h1. New functionality for {{partitionBy}}
> To enhance performance using partitionBy , calling {{repartition}} method 
> based on columns is much good before calling {{partitionBy}}. I added new 
> function: {color:#0747a6}{{partitionBy(, columns>}}{color} to 
> {{partitionBy}}.
>  
> h2. Problems when not using {{repartition}} before {{partitionBy}}.
> When using {{paritionBy}}, following problems happen because of specified 
> columns in {{partitionBy}} are located separately.
>  * The spark application which includes {{partitionBy}} takes much longer 
> (for example, [[python - partitionBy taking too long while saving a dataset 
> on S3 using Pyspark - Stack 
> Overflow|https://stackoverflow.com/questions/56496387/partitionby-taking-too-long-while-saving-a-dataset-on-s3-using-pyspark]])]
>  * When using {{partitionBy}}, memory usage increases much high compared with 
> not using {{partitionBy}} (as follows I tested with Spark ver.2.4.3).
>  * Additional information about memory usage affection by partitionBy: Please 
> check the attachment (the left figure shows "using partitionBy", the other 
> shows "not using partitionBy)".
> h2. How to use?
> It's very simple. If you want to use repartition method before 
> {{partitionBy}}, just you specify {color:#0747a6}{{true}}{color} in 
> {{partitionBy}}.
> Example:
> {code:java}
> val df  = spark.read.format("csv").option("header", true).load()
> df.write.format("json").partitionBy(true, columns).save(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-06 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17032036#comment-17032036
 ] 

Hyukjin Kwon commented on SPARK-30712:
--

SPARK-24914 is trying to add the base for this mechanism in general. You should 
probably take a look for the PRs and help review first.

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30732) BroadcastExchangeExec does not fully honor "spark.broadcast.compress"

2020-02-06 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031996#comment-17031996
 ] 

L. C. Hsieh commented on SPARK-30732:
-

`getByteArrayRdd` is not used only there. And as the comment said, UnsafeRow is 
highly compressible. It sounds not a good idea to disable compression there.

I think `spark.broadcast.compress` provides an option to disable compression 
because you might have objects in a RDD that is hardly compressible.

For `getByteArrayRdd`, the purpose is to collect highly compressible rows back.

> BroadcastExchangeExec does not fully honor "spark.broadcast.compress"
> -
>
> Key: SPARK-30732
> URL: https://issues.apache.org/jira/browse/SPARK-30732
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Puneet
>Priority: Major
>
> Setting {{spark.broadcast.compress}} to false disables compression while 
> sending broadcast variable to executors 
> ([https://spark.apache.org/docs/latest/configuration.html#compression-and-serialization])
> However this does not disable compression for any child relations sent by the 
> executors to the driver. 
> Setting spark.boradcast.compress to false should disable both sides of the 
> traffic, allowing users to disable compression for the whole broadcast join 
> for example.
> [https://github.com/puneetguptanitj/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L89]
> ^here `executeCollectIterator` calls `getByteArrayRdd` which by default 
> always gets a compressed stream
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-02-06 Thread Sunitha Kambhampati (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031938#comment-17031938
 ] 

Sunitha Kambhampati commented on SPARK-27298:
-

Thanks for trying it out in your env. That is good to know, that you are 
getting the right result on spark-2.4.4 and not on Spark-2.3.0.

Based on that, I ran this test on spark 2.3.0 in my linux environment and I can 
see the wrong count. I generated the explain to debug this and the plan is 
optimized to 

Spark 2.3.0

 
{code:java}
== Optimized Logical Plan ==
Aggregate [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, 
Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101], [CustID#92, 
DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, 
Surname#99, Telephone#100L, Title#101]
+- Filter ((isnotnull(Gender#94) && (Gender#94 = M)) && NOT Income#96 IN 
(503.65,495.54,486.82,481.28,479.79))
   +- 
Relation[CustID#92,DOB#93,Gender#94,HouseholdID#95,Income#96,Initials#97,Occupation#98,Surname#99,Telephone#100L,Title#101]
 parquet
{code}
 

 

With Spark 2.3.3, where it generates the correct count, I see the optimized 
plan.

 
{code:java}
== Optimized Logical Plan ==
Aggregate [CustID#92, DOB#93, Gender#94, HouseholdID#95, Income#96, 
Initials#97, Occupation#98, Surname#99, Telephone#100L, Title#101], [CustID#92, 
DOB#93, Gender#94, HouseholdID#95, Income#96, Initials#97, Occupation#98, 
Surname#99, Telephone#100L, Title#101]
+- Filter ((isnotnull(Gender#94) && (Gender#94 = M)) && NOT coalesce(Income#96 
IN (503.65,495.54,486.82,481.28,479.79), false))
   +- 
Relation[CustID#92,DOB#93,Gender#94,HouseholdID#95,Income#96,Initials#97,Occupation#98,Surname#99,Telephone#100L,Title#101]
 parquet
{code}
 

 

Observations:

This issue is caused by the nulls in Income rows that were being filtered out 
incorrectly.  This is coming from the optimizer rule 'ReplaceExceptWithFilter'. 
 

This bug was fixed in SPARK-26366 and back ported and fixed in spark 2.3.3. 

-

There doesn't seem to be anything specific to OS in this fix, so I am not sure 
why you were seeing the correct count on windows with  Spark 2.3.0(that has the 
bug).  For this, will need to get hold of the explain and also the count on how 
many rows were null for income column for gender=M on windows (as mentioned in 
my earlier comments).

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Blocker
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> Linux-spark-2.3.0_result.txt, Linux-spark-2.4.4_result.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
>

[jira] [Commented] (SPARK-24655) [K8S] Custom Docker Image Expectations and Documentation

2020-02-06 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031870#comment-17031870
 ] 

Thomas Graves commented on SPARK-24655:
---

some other discussions on this from https://github.com/apache/spark/pull/23347

> [K8S] Custom Docker Image Expectations and Documentation
> 
>
> Key: SPARK-24655
> URL: https://issues.apache.org/jira/browse/SPARK-24655
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.1
>Reporter: Matt Cheah
>Priority: Major
>
> A common use case we want to support with Kubernetes is the usage of custom 
> Docker images. Some examples include:
>  * A user builds an application using Gradle or Maven, using Spark as a 
> compile-time dependency. The application's jars (both the custom-written jars 
> and the dependencies) need to be packaged in a docker image that can be run 
> via spark-submit.
>  * A user builds a PySpark or R application and desires to include custom 
> dependencies
>  * A user wants to switch the base image from Alpine to CentOS while using 
> either built-in or custom jars
> We currently do not document how these custom Docker images are supposed to 
> be built, nor do we guarantee stability of these Docker images with various 
> spark-submit versions. To illustrate how this can break down, suppose for 
> example we decide to change the names of environment variables that denote 
> the driver/executor extra JVM options specified by 
> {{spark.[driver|executor].extraJavaOptions}}. If we change the environment 
> variable spark-submit provides then the user must update their custom 
> Dockerfile and build new images.
> Rather than jumping to an implementation immediately though, it's worth 
> taking a step back and considering these matters from the perspective of the 
> end user. Towards that end, this ticket will serve as a forum where we can 
> answer at least the following questions, and any others pertaining to the 
> matter:
>  # What would be the steps a user would need to take to build a custom Docker 
> image, given their desire to customize the dependencies and the content (OS 
> or otherwise) of said images?
>  # How can we ensure the user does not need to rebuild the image if only the 
> spark-submit version changes?
> The end deliverable for this ticket is a design document, and then we'll 
> create sub-issues for the technical implementation and documentation of the 
> contract.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-02-06 Thread Xiao Li (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031860#comment-17031860
 ] 

Xiao Li commented on SPARK-30668:
-

I think this is still not resolved. Spark 3.0 should not silently return a 
wrong result for a query whose pattern was right in the previous versions. I 
did not see the fallback mentioned in [~cloud_fan]

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master
> **2.4.5 RC2**
> {code}
> scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")""").show
> ++
> |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')|
> ++
> | 2020-01-27 20:06:11|
> ++
> {code}
> **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`).
> {code}
> spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz");
> 2020-01-27 20:06:11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30668) to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern "yyyy-MM-dd'T'HH:mm:ss.SSSz"

2020-02-06 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-30668:
-

> to_timestamp failed to parse 2020-01-27T20:06:11.847-0800 using pattern 
> "-MM-dd'T'HH:mm:ss.SSSz"
> 
>
> Key: SPARK-30668
> URL: https://issues.apache.org/jira/browse/SPARK-30668
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Blocker
> Fix For: 3.0.0
>
>
> {code:java}
> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")
> {code}
> This can return a valid value in Spark 2.4 but return NULL in the latest 
> master
> **2.4.5 RC2**
> {code}
> scala> sql("""SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz")""").show
> ++
> |to_timestamp('2020-01-27T20:06:11.847-0800', '-MM-dd\'T\'HH:mm:ss.SSSz')|
> ++
> | 2020-01-27 20:06:11|
> ++
> {code}
> **2.2.3 ~ 2.4.4** (2.0.2 ~ 2.1.3 doesn't have `to_timestamp`).
> {code}
> spark-sql> SELECT to_timestamp("2020-01-27T20:06:11.847-0800", 
> "-MM-dd'T'HH:mm:ss.SSSz");
> 2020-01-27 20:06:11
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day

2020-02-06 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031834#comment-17031834
 ] 

Dongjoon Hyun commented on SPARK-30752:
---

The next release 2.4.6 is scheduled but there is no release manager for that 
yet.

> Wrong result of to_utc_timestamp() on daylight saving day
> -
>
> Key: SPARK-30752
> URL: https://issues.apache.org/jira/browse/SPARK-30752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The to_utc_timestamp() function returns wrong result when:
> * JVM system time zone is PST
> * the session local time zone is UTC
> * fromZone is Asia/Hong_Kong
> for the local date '2019-11-03T12:00:00', the result must be 
> '2019-11-03T04:00:00'
> {code}
> scala> import java.util.TimeZone
> import java.util.TimeZone
> scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> TimeZone.setDefault(getTimeZone("PST"))
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs")
> df: org.apache.spark.sql.DataFrame = [localTs: string]
> scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show
> +-+
> |to_utc_timestamp(localTs, Asia/Hong_Kong)|
> +-+
> |  2019-11-03 03:00:00|
> +-+
> {code}
>  
> See 
> https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day

2020-02-06 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031833#comment-17031833
 ] 

Dongjoon Hyun commented on SPARK-30752:
---

Thanks, bit you don't need to ping me from today because I'm not a release 
manager any more, [~maxgekk]. :)

> Wrong result of to_utc_timestamp() on daylight saving day
> -
>
> Key: SPARK-30752
> URL: https://issues.apache.org/jira/browse/SPARK-30752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The to_utc_timestamp() function returns wrong result when:
> * JVM system time zone is PST
> * the session local time zone is UTC
> * fromZone is Asia/Hong_Kong
> for the local date '2019-11-03T12:00:00', the result must be 
> '2019-11-03T04:00:00'
> {code}
> scala> import java.util.TimeZone
> import java.util.TimeZone
> scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> TimeZone.setDefault(getTimeZone("PST"))
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs")
> df: org.apache.spark.sql.DataFrame = [localTs: string]
> scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show
> +-+
> |to_utc_timestamp(localTs, Asia/Hong_Kong)|
> +-+
> |  2019-11-03 03:00:00|
> +-+
> {code}
>  
> See 
> https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30719) AQE should not issue a "not supported" warning for queries being by-passed

2020-02-06 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-30719.
-
Fix Version/s: 3.0.0
 Assignee: Wenchen Fan
   Resolution: Fixed

> AQE should not issue a "not supported" warning for queries being by-passed
> --
>
> Key: SPARK-30719
> URL: https://issues.apache.org/jira/browse/SPARK-30719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.0.0
>
>
> This is a follow up for [https://github.com/apache/spark/pull/26813].
> AQE bypasses queries that don't have exchanges or subqueries. This is not a 
> limitation and it is different from queries that are not supported in AQE. 
> Issuing a warning in this case can be confusing and annoying.
> It would also be good to add an internal conf for this bypassing behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30730) Wrong results of `converTz` for different session and system time zones

2020-02-06 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031776#comment-17031776
 ] 

Maxim Gekk commented on SPARK-30730:


I am going to close this ticket w.r.t 
https://issues.apache.org/jira/browse/SPARK-30752 and bug fix 
[https://github.com/apache/spark/pull/27474] where I eliminated any assumptions.

> Wrong results of `converTz` for different session and system time zones
> ---
>
> Key: SPARK-30730
> URL: https://issues.apache.org/jira/browse/SPARK-30730
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, DateTimeUtils.convertTz() assumes that timestamp strings are 
> casted to TimestampType using the JVM system timezone but in fact the session 
> time zone defined by the SQL config *spark.sql.session.timeZone* is used in 
> the casting. This leads to wrong results of from_utc_timestamp and 
> to_utc_timestamp when session time zone is different from JVM time zones. The 
> issues can be reproduces by the code:
> {code:java}
>   test("to_utc_timestamp in various system and session time zones") {
> val localTs = "2020-02-04T22:42:10"
> val defaultTz = TimeZone.getDefault
> try {
>   DateTimeTestUtils.outstandingTimezonesIds.foreach { systemTz =>
> TimeZone.setDefault(DateTimeUtils.getTimeZone(systemTz))
> DateTimeTestUtils.outstandingTimezonesIds.foreach { sessionTz =>
>   withSQLConf(
> SQLConf.DATETIME_JAVA8API_ENABLED.key -> "true",
> SQLConf.SESSION_LOCAL_TIMEZONE.key -> sessionTz) {
> DateTimeTestUtils.outstandingTimezonesIds.foreach { toTz =>
>   val instant = LocalDateTime
> .parse(localTs)
> .atZone(DateTimeUtils.getZoneId(toTz))
> .toInstant
>   val df = Seq(localTs).toDF("localTs")
>   val res = df.select(to_utc_timestamp(col("localTs"), 
> toTz)).first().apply(0)
>   if (instant != res) {
> println(s"system = $systemTz session = $sessionTz to = $toTz")
>   }
> }
>   }
> }
>   }
> } catch {
>   case NonFatal(_) => TimeZone.setDefault(defaultTz)
> }
>   }
> {code}
> {code:java}
> system = UTC session = PST to = UTC
> system = UTC session = PST to = PST
> system = UTC session = PST to = CET
> system = UTC session = PST to = Africa/Dakar
> system = UTC session = PST to = America/Los_Angeles
> system = UTC session = PST to = Antarctica/Vostok
> system = UTC session = PST to = Asia/Hong_Kong
> system = UTC session = PST to = Europe/Amsterdam
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day

2020-02-06 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031774#comment-17031774
 ] 

Maxim Gekk commented on SPARK-30752:


[~dongjoon] FYI, 2.4 has the bug.

> Wrong result of to_utc_timestamp() on daylight saving day
> -
>
> Key: SPARK-30752
> URL: https://issues.apache.org/jira/browse/SPARK-30752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> The to_utc_timestamp() function returns wrong result when:
> * JVM system time zone is PST
> * the session local time zone is UTC
> * fromZone is Asia/Hong_Kong
> for the local date '2019-11-03T12:00:00', the result must be 
> '2019-11-03T04:00:00'
> {code}
> scala> import java.util.TimeZone
> import java.util.TimeZone
> scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
> scala> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.functions._
> scala> TimeZone.setDefault(getTimeZone("PST"))
> scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
> scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs")
> df: org.apache.spark.sql.DataFrame = [localTs: string]
> scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show
> +-+
> |to_utc_timestamp(localTs, Asia/Hong_Kong)|
> +-+
> |  2019-11-03 03:00:00|
> +-+
> {code}
>  
> See 
> https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30752) Wrong result of to_utc_timestamp() on daylight saving day

2020-02-06 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30752:
--

 Summary: Wrong result of to_utc_timestamp() on daylight saving day
 Key: SPARK-30752
 URL: https://issues.apache.org/jira/browse/SPARK-30752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Maxim Gekk


The to_utc_timestamp() function returns wrong result when:
* JVM system time zone is PST
* the session local time zone is UTC
* fromZone is Asia/Hong_Kong
for the local date '2019-11-03T12:00:00', the result must be 
'2019-11-03T04:00:00'
{code}
scala> import java.util.TimeZone
import java.util.TimeZone

scala> import org.apache.spark.sql.catalyst.util.DateTimeUtils._
import org.apache.spark.sql.catalyst.util.DateTimeUtils._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> TimeZone.setDefault(getTimeZone("PST"))
scala> spark.conf.set("spark.sql.session.timeZone", "UTC")
scala> val df = Seq("2019-11-03T12:00:00").toDF("localTs")
df: org.apache.spark.sql.DataFrame = [localTs: string]

scala> df.select(to_utc_timestamp(col("localTs"), "Asia/Hong_Kong")).show
+-+
|to_utc_timestamp(localTs, Asia/Hong_Kong)|
+-+
|  2019-11-03 03:00:00|
+-+
{code}
 
See 
https://www.worldtimebuddy.com/?qm=1=8,1819729,100=8=2019-11-2=21-22



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations

2020-02-06 Thread Wei Xue (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei Xue updated SPARK-30751:

Description: 
Assume we have N partitions based on the original join keys, and for a specific 
partition id {{Pi}} (i = 1 to N), we slice the left partition into {{Li}} 
sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into 
{{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current 
approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 to N where Li > 1 
or Mi > 1) plus one (for the rest of the partitions without skew) joins. *This 
can be a serious performance concern as the size of the query plan now depends 
on the number and size of skewed partitions.*

Now instead of generating so many joins we can create a “repeated” reader for 
either side of the join so that:
 # for the left side, with each partition id Pi and any given slice {{Sj}} in 
{{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective 
join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}}

 # for the right side, with each partition id Pi and any given slice {{Tk}} in 
{{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with respective 
join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}}

That way, we can have one SMJ for all the partitions and only one type of 
special reader.

  was:
Assume we have N partitions based on the original join keys, and for a specific 
partition id Pi (i = 1 to N), we slice the left partition into L(i) 
sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into 
M(i) sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current 
approach, we’ll end up with a sum of L(i) * M(i) (i = 1 to N where L(i) > 1 or 
M(i) > 1) plus one joins. *This can be a serious performance concern as the 
size of the query plan now depends on the number and size of skewed partitions.*

Now instead of generating so many joins we can create a “repeated” reader for 
either side of the join so that:
 # for the left side, with each partition id Pi and any given slice Sj in Pi (j 
= 1 to L(i)), it generates M(i) repeated partitions with respective join keys 
as PiSjT1, PiSjT2, …, PiSjTm

 # for the right side, with each partition id Pi and any given slice Tk in Pi 
(k = 1 to M(i)), it generates L(i) repeated partitions with respective join 
keys as PiS1Tk, PiS2Tk, …, PiSlTk

That way, we can have one SMJ for all the partitions and only one type of 
special reader.


> Combine the skewed readers into one in AQE skew join optimizations
> --
>
> Key: SPARK-30751
> URL: https://issues.apache.org/jira/browse/SPARK-30751
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Major
>
> Assume we have N partitions based on the original join keys, and for a 
> specific partition id {{Pi}} (i = 1 to N), we slice the left partition into 
> {{Li}} sub-partitions (L = 1 if no skew; L > 1 if skewed), the right 
> partition into {{Mi}} sub-partitions (M = 1 if no skew; M > 1 if skewed). 
> With the current approach, we’ll end up with a sum of {{Li}} * {{Mi}} (i = 1 
> to N where Li > 1 or Mi > 1) plus one (for the rest of the partitions without 
> skew) joins. *This can be a serious performance concern as the size of the 
> query plan now depends on the number and size of skewed partitions.*
> Now instead of generating so many joins we can create a “repeated” reader for 
> either side of the join so that:
>  # for the left side, with each partition id Pi and any given slice {{Sj}} in 
> {{Pi}} (j = 1 to Li), it generates {{Mi}} repeated partitions with respective 
> join keys as {{PiSjT1}}, {{PiSjT2}}, …, {{PiSjTm}}
>  # for the right side, with each partition id Pi and any given slice {{Tk}} 
> in {{Pi}} (k = 1 to Mi), it generates {{Li}} repeated partitions with 
> respective join keys as {{PiS1Tk}}, {{PiS2Tk}}, …, {{PiSlTk}}
> That way, we can have one SMJ for all the partitions and only one type of 
> special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30751) Combine the skewed readers into one in AQE skew join optimizations

2020-02-06 Thread Wei Xue (Jira)

Wei Xue created SPARK-30751:
---

 Summary: Combine the skewed readers into one in AQE skew join 
optimizations
 Key: SPARK-30751
 URL: https://issues.apache.org/jira/browse/SPARK-30751
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue


Assume we have N partitions based on the original join keys, and for a specific 
partition id Pi (i = 1 to N), we slice the left partition into L(i) 
sub-partitions (L = 1 if no skew; L > 1 if skewed), the right partition into 
M(i) sub-partitions (M = 1 if no skew; M > 1 if skewed). With the current 
approach, we’ll end up with a sum of L(i) * M(i) (i = 1 to N where L(i) > 1 or 
M(i) > 1) plus one joins. *This can be a serious performance concern as the 
size of the query plan now depends on the number and size of skewed partitions.*

Now instead of generating so many joins we can create a “repeated” reader for 
either side of the join so that:
 # for the left side, with each partition id Pi and any given slice Sj in Pi (j 
= 1 to L(i)), it generates M(i) repeated partitions with respective join keys 
as PiSjT1, PiSjT2, …, PiSjTm

 # for the right side, with each partition id Pi and any given slice Tk in Pi 
(k = 1 to M(i)), it generates L(i) repeated partitions with respective join 
keys as PiS1Tk, PiS2Tk, …, PiSlTk

That way, we can have one SMJ for all the partitions and only one type of 
special reader.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30750) stage level scheduling: Add ability to set dynamic allocation configs

2020-02-06 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-30750:
-

 Summary: stage level scheduling: Add ability to set dynamic 
allocation configs
 Key: SPARK-30750
 URL: https://issues.apache.org/jira/browse/SPARK-30750
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Thomas Graves


the initial Jira to modify dynamic allocation for stage level scheduling 
applies the configs - minimum, initial, max number of executors to each profile 
individually. This means that you can't set those different per resource 
profile.

Ideally those would be configurable per ResourceProfile so look at adding 
support for that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30749) stage level scheduling: Better cleanup of Resource profiles

2020-02-06 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-30749:
--
Summary: stage level scheduling: Better cleanup of Resource profiles  (was: 
Better cleanup of Resource profiles)

> stage level scheduling: Better cleanup of Resource profiles
> ---
>
> Key: SPARK-30749
> URL: https://issues.apache.org/jira/browse/SPARK-30749
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Thomas Graves
>Priority: Major
>
> In the initial stage level scheduling jiras we aren't doing a great job of 
> cleaning up datastructures when the ResourceProfile is done being used. This 
> is mostly because its hard to tell when its done being used.
> We should find a way to clean up better.
> Also in the dynamic allocation manager, if the resource profile is not being 
> used anymore we should be more active about getting rid of the executors that 
> are up. Especially if there is a minimum number that is set, we can kill 
> those off.
>  
> I think this can be done as followup to the main feature



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30749) Better cleanup of Resource profiles

2020-02-06 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-30749:
-

 Summary: Better cleanup of Resource profiles
 Key: SPARK-30749
 URL: https://issues.apache.org/jira/browse/SPARK-30749
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Thomas Graves


In the initial stage level scheduling jiras we aren't doing a great job of 
cleaning up datastructures when the ResourceProfile is done being used. This is 
mostly because its hard to tell when its done being used.

We should find a way to clean up better.

Also in the dynamic allocation manager, if the resource profile is not being 
used anymore we should be more active about getting rid of the executors that 
are up. Especially if there is a minimum number that is set, we can kill those 
off.

 

I think this can be done as followup to the main feature



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27570) java.io.EOFException Reached the end of stream - Reading Parquet from Swift

2020-02-06 Thread Hadrien Negros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031683#comment-17031683
 ] 

Hadrien Negros commented on SPARK-27570:


I have the same problem with reading pretty large parquet files stored in 
Openstack Swift with Spark 2.4.4

Has someone found a solution?

> java.io.EOFException Reached the end of stream - Reading Parquet from Swift
> ---
>
> Key: SPARK-27570
> URL: https://issues.apache.org/jira/browse/SPARK-27570
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Harry Hough
>Priority: Major
>
> I did see issue SPARK-25966 but it seems there are some differences as his 
> problem was resolved after rebuilding the parquet files on write. This is 
> 100% reproducible for me across many different days of data.
> I get exceptions such as "Reached the end of stream with 750477 bytes left to 
> read" during some read operations of parquet files. I am reading these files 
> from Openstack swift using openstack-hadoop 2.7.7 on Spark 2.4.
> The issues seem to happen with the where statement. I have also tried filter 
> and combining the statements into one as well as the dataset method with 
> column without any luck. Which column or what the actual filter is on the 
> where also doesn't seem to make a difference to the error occurring or not.
>  
> {code:java}
> val engagementDS = spark
>   .read
>   .parquet(createSwiftAddr("engagements", folder))
>   .where("engtype != 0")
>   .where("engtype != 1000")
>   .groupBy($"accid", $"sessionkey")
>   .agg(collect_list(struct($"time", $"pid", $"engtype", $"pageid", 
> $"testid")).as("engagements"))
> // Exiting paste mode, now interpreting.
> [Stage 53:> (0 + 32) / 32]2019-04-25 19:02:12 ERROR Executor:91 - Exception 
> in task 24.0 in stage 53.0 (TID 688)
> java.io.EOFException: Reached the end of stream with 1323959 bytes left to 
> read
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:104)
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFullyHeapBuffer(DelegatingSeekableInputStream.java:127)
> at 
> org.apache.parquet.io.DelegatingSeekableInputStream.readFully(DelegatingSeekableInputStream.java:91)
> at 
> org.apache.parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:1174)
> at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:805)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:301)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:256)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:159)
> at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:619)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:107)
> at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:105)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
> at

[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-06 Thread Sujith Chacko (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031647#comment-17031647
 ] 

Sujith Chacko commented on SPARK-24615:
---

Great!!Thanks for the update.

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-06 Thread liupengcheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031644#comment-17031644
 ] 

liupengcheng commented on SPARK-30712:
--

[~hyukjin.kwon] We use the rowCount info in metadata and the schema to infer 
the memory consumption of `UnsafeRow`s in memory. It works fine.

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-06 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031637#comment-17031637
 ] 

Thomas Graves commented on SPARK-24615:
---

yes it will be in 3.0, the feature is complete other then if someone wants 
mesos support, see the linked jiras in the epic.

 

You can find documentation checked in master branch:

https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30712) Estimate sizeInBytes from file metadata for parquet files

2020-02-06 Thread liupengcheng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031636#comment-17031636
 ] 

liupengcheng commented on SPARK-30712:
--

[~hyukjin.kwon] Yes, in our customed spark version, we use parquet metadata to 
compute this size, it's more accurate and work well for some tables.

I think  we still scan files to get the file size in `DetermineTableStats` Rule 
when `fallBackToHdfs` is true. If you worry about that we can just also add a 
config for this. 

Also, in many cases, we can make use of the summary-metadata of parquet files 
to speed up this estimation.

> Estimate sizeInBytes from file metadata for parquet files
> -
>
> Key: SPARK-30712
> URL: https://issues.apache.org/jira/browse/SPARK-30712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: liupengcheng
>Priority: Major
>
> Currently, Spark will use a compressionFactor when calculating `sizeInBytes` 
> for `HadoopFsRelation`, but this is not accurate and it's hard to choose the 
> best `compressionFactor`. Sometimes, this can causing OOMs due to improper 
> BroadcastHashJoin.
> So I propose to use the rowCount in the BlockMetadata to estimate the size in 
> memory, which can be more accurate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-06 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614
 ] 

Attila Zsolt Piros edited comment on SPARK-30688 at 2/6/20 2:22 PM:


I have checked on Spark 3.0.0-preview2 and week in year fails there too (but 
for invalid format I would say this is fine):
{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
| null|
+-+
{noformat}
But it fails for a correct pattern too:
{noformat}
scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show();
++
|unix_timestamp(2020-10, -ww)|
++
|null|
++
{noformat}
But for this it is strangely works:
{noformat}
scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show();
++
|unix_timestamp(2020-01, -ww)|
++
|  1577833200|
++
{noformat}


was (Author: attilapiros):
I have checked on Spark 3.0.0-preview2 and week in year fails there too:

{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
| null|
+-+
{noformat}

As you can see it fails for even a simpler pattern too:

{noformat}
scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show();
++
|unix_timestamp(2020-10, -ww)|
++
|null|
++
{noformat}

But this strangely works:

{noformat}
scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show();
++
|unix_timestamp(2020-01, -ww)|
++
|  1577833200|
++
{noformat}


> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Commented] (SPARK-30688) Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF

2020-02-06 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031614#comment-17031614
 ] 

Attila Zsolt Piros commented on SPARK-30688:


I have checked on Spark 3.0.0-preview2 and week in year fails there too:

{noformat}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-preview2
  /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_202)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
+-+
|unix_timestamp(20201, ww)|
+-+
| null|
+-+
{noformat}

As you can see it fails for even a simpler pattern too:

{noformat}
scala> spark.sql("select unix_timestamp('2020-10', '-ww')").show();
++
|unix_timestamp(2020-10, -ww)|
++
|null|
++
{noformat}

But this strangely works:

{noformat}
scala> spark.sql("select unix_timestamp('2020-01', '-ww')").show();
++
|unix_timestamp(2020-01, -ww)|
++
|  1577833200|
++
{noformat}


> Spark SQL Unix Timestamp produces incorrect result with unix_timestamp UDF
> --
>
> Key: SPARK-30688
> URL: https://issues.apache.org/jira/browse/SPARK-30688
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Rajkumar Singh
>Priority: Major
>
>  
> {code:java}
> scala> spark.sql("select unix_timestamp('20201', 'ww')").show();
> +-+
> |unix_timestamp(20201, ww)|
> +-+
> |                         null|
> +-+
>  
> scala> spark.sql("select unix_timestamp('20202', 'ww')").show();
> -+
> |unix_timestamp(20202, ww)|
> +-+
> |                   1578182400|
> +-+
>  
> {code}
>  
>  
> This seems to happen for leap year only, I dig deeper into it and it seems 
> that  Spark is using the java.text.SimpleDateFormat and try to parse the 
> expression here
> [org.apache.spark.sql.catalyst.expressions.UnixTime#eval|https://github.com/hortonworks/spark2/blob/49ec35bbb40ec6220282d932c9411773228725be/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala#L652]
> {code:java}
> formatter.parse(
>  t.asInstanceOf[UTF8String].toString).getTime / 1000L{code}
>  but fail and SimpleDateFormat unable to parse the date throw Unparseable 
> Exception but Spark handle it silently and returns NULL.
>  
> *Spark-3.0:* I did some tests where spark no longer using the legacy 
> java.text.SimpleDateFormat but java date/time API, it seems  date/time API 
> expect a valid date with valid format
>  org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter#parse



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30748) Storage Memory in Spark Web UI means

2020-02-06 Thread islandshinji (Jira)

islandshinji created SPARK-30748:


 Summary: Storage Memory in Spark Web UI means
 Key: SPARK-30748
 URL: https://issues.apache.org/jira/browse/SPARK-30748
 Project: Spark
  Issue Type: Question
  Components: Web UI
Affects Versions: 2.4.0
Reporter: islandshinji


Does the denominator of 'Storage Memory' in Spark Web UI include execution 
memory?
In my environment, set 'spark.executor.memory' to 20g and the denominator of 
'Storage Memory' is 11.3g. I think it is too big just include storage memory.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30744) Optimize AnalyzePartitionCommand by calculating location sizes in parallel

2020-02-06 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30744.
-
Fix Version/s: 3.1.0
 Assignee: wuyi
   Resolution: Fixed

> Optimize AnalyzePartitionCommand by calculating location sizes in parallel
> --
>
> Key: SPARK-30744
> URL: https://issues.apache.org/jira/browse/SPARK-30744
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> AnalyzePartitionCommand could use CommandUtils.calculateTotalLocationSize to 
> calculate location sizes in parallel to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30747) Update roxygen2 to 7.0.1

2020-02-06 Thread Maciej Szymkiewicz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031471#comment-17031471
 ] 

Maciej Szymkiewicz commented on SPARK-30747:


CC [~felixcheung] [~hyukjin.kwon] [~shaneknapp]

> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30747) Update roxygen2 to 7.0.1

2020-02-06 Thread Maciej Szymkiewicz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-30747:
---
Description: 
Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old (2015-11-11) 
so it could be a good idea to use current R updates to update it as well.

At crude inspection:

* SPARK-22430 has been resolved a while ago.
* SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed 
resolved persisting warnings
* Documentation builds and CRAN checks pass
* Generated HTML docs are identical to 5.0.1

Since {{roxygen2}} shares some potentially unstable dependencies with 
{{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
overwritten by local tests).


  was:
Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old (2015-11-11) 
so it could be a good idea to use current R updates to update it as well.

At crude inspection:

* SPARK-22430 has been resolved a while ago.
* https://github.com/apache/spark/pull/27437 and 
https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed 
resolved persisting warnings
* Documentation builds and CRAN checks pass
* Generated HTML docs are identical to 5.0.1




> Update roxygen2 to 7.0.1
> 
>
> Key: SPARK-30747
> URL: https://issues.apache.org/jira/browse/SPARK-30747
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, Tests
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old 
> (2015-11-11) so it could be a good idea to use current R updates to update it 
> as well.
> At crude inspection:
> * SPARK-22430 has been resolved a while ago.
> * SPARK-30737][SPARK-27262,  https://github.com/apache/spark/pull/27437 and 
> https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed
>  resolved persisting warnings
> * Documentation builds and CRAN checks pass
> * Generated HTML docs are identical to 5.0.1
> Since {{roxygen2}} shares some potentially unstable dependencies with 
> {{devtools}} (primarily {{rlang}}) it might be a good idea to keep these in 
> sync (as a bonus we wouldn't have to worry about {{DESCRIPTION}} being 
> overwritten by local tests).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30747) Update roxygen2 to 7.0.1

2020-02-06 Thread Maciej Szymkiewicz (Jira)

Maciej Szymkiewicz created SPARK-30747:
--

 Summary: Update roxygen2 to 7.0.1
 Key: SPARK-30747
 URL: https://issues.apache.org/jira/browse/SPARK-30747
 Project: Spark
  Issue Type: Improvement
  Components: SparkR, Tests
Affects Versions: 3.0.0
Reporter: Maciej Szymkiewicz


Currently Spark uses {{roxygen2}} 5.0.1. It is already pretty old (2015-11-11) 
so it could be a good idea to use current R updates to update it as well.

At crude inspection:

* SPARK-22430 has been resolved a while ago.
* https://github.com/apache/spark/pull/27437 and 
https://github.com/apache/spark/commit/b95ccb1d8b726b11435789cdb5882df6643430ed 
resolved persisting warnings
* Documentation builds and CRAN checks pass
* Generated HTML docs are identical to 5.0.1





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30739) unable to turn off Hadoop's trash feature

2020-02-06 Thread Ohad Raviv (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031437#comment-17031437
 ] 

Ohad Raviv commented on SPARK-30739:


Closing as I realized this is actually the documented behaviour 
[here|https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/core-default.xml].

_fs.trash.interval_

_Number of minutes between trash checkpoints. Should be smaller or equal to 
fs.trash.interval. If zero, the value is set to the value of fs.trash.interval. 
Every time the checkpointer runs it creates a new checkpoint out of current and 
removes checkpoints created more than fs.trash.interval minutes ago._

so decided to use the _fs.trash.classname_ approach.

> unable to turn off Hadoop's trash feature
> -
>
> Key: SPARK-30739
> URL: https://issues.apache.org/jira/browse/SPARK-30739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> We're trying to turn off the `TrashPolicyDefault` in one of our Spark 
> applications by setting `spark.hadoop.fs.trash.interval=0`, but it just stays 
> `360` as configured in our cluster's `core-site.xml`.
> Trying to debug it we managed to set 
> `spark.hadoop.fs.trash.classname=OtherTrashPolicy` and it worked. the main 
> difference seems to be that `spark.hadoop.fs.trash.classname` does not appear 
> in any of the `*-site.xml` files.
> when we print the conf that get initialized in `TrashPolicyDefault` we get:
> ```
> Configuration: core-default.xml, core-site.xml, yarn-default.xml, 
> yarn-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, 
> hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@561f0431, 
> file:/hadoop03/yarn/local/usercache/.../hive-site.xml
> ```
> and:
> `fs.trash.interval=360 [programatically]`
> `fs.trash.classname=OtherTrashPolicy [programatically]`
>  
> any idea why `fs.trash.classname` works but `fs.trash.interval` doesn't?
> this seems maybe related to: -SPARK-9825.-
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30739) unable to turn off Hadoop's trash feature

2020-02-06 Thread Ohad Raviv (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ohad Raviv resolved SPARK-30739.

Resolution: Workaround

> unable to turn off Hadoop's trash feature
> -
>
> Key: SPARK-30739
> URL: https://issues.apache.org/jira/browse/SPARK-30739
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ohad Raviv
>Priority: Minor
>
> We're trying to turn off the `TrashPolicyDefault` in one of our Spark 
> applications by setting `spark.hadoop.fs.trash.interval=0`, but it just stays 
> `360` as configured in our cluster's `core-site.xml`.
> Trying to debug it we managed to set 
> `spark.hadoop.fs.trash.classname=OtherTrashPolicy` and it worked. the main 
> difference seems to be that `spark.hadoop.fs.trash.classname` does not appear 
> in any of the `*-site.xml` files.
> when we print the conf that get initialized in `TrashPolicyDefault` we get:
> ```
> Configuration: core-default.xml, core-site.xml, yarn-default.xml, 
> yarn-site.xml, mapred-default.xml, mapred-site.xml, hdfs-default.xml, 
> hdfs-site.xml, 
> org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@561f0431, 
> file:/hadoop03/yarn/local/usercache/.../hive-site.xml
> ```
> and:
> `fs.trash.interval=360 [programatically]`
> `fs.trash.classname=OtherTrashPolicy [programatically]`
>  
> any idea why `fs.trash.classname` works but `fs.trash.interval` doesn't?
> this seems maybe related to: -SPARK-9825.-
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24615) SPIP: Accelerator-aware task scheduling for Spark

2020-02-06 Thread Sujith Chacko (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17031378#comment-17031378
 ] 

Sujith Chacko commented on SPARK-24615:
---

Will this feature be a part of Spark 3.0?  Any update on release timeline of 
this feature. Thanks

> SPIP: Accelerator-aware task scheduling for Spark
> -
>
> Key: SPARK-24615
> URL: https://issues.apache.org/jira/browse/SPARK-24615
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Saisai Shao
>Assignee: Thomas Graves
>Priority: Major
>  Labels: Hydrogen, SPIP
> Attachments: Accelerator-aware scheduling in Apache Spark 3.0.pdf, 
> SPIP_ Accelerator-aware scheduling.pdf
>
>
> (The JIRA received a major update on 2019/02/28. Some comments were based on 
> an earlier version. Please ignore them. New comments start at 
> [#comment-16778026].)
> h2. Background and Motivation
> GPUs and other accelerators have been widely used for accelerating special 
> workloads, e.g., deep learning and signal processing. While users from the AI 
> community use GPUs heavily, they often need Apache Spark to load and process 
> large datasets and to handle complex data scenarios like streaming. YARN and 
> Kubernetes already support GPUs in their recent releases. Although Spark 
> supports those two cluster managers, Spark itself is not aware of GPUs 
> exposed by them and hence Spark cannot properly request GPUs and schedule 
> them for users. This leaves a critical gap to unify big data and AI workloads 
> and make life simpler for end users.
> To make Spark be aware of GPUs, we shall make two major changes at high level:
> * At cluster manager level, we update or upgrade cluster managers to include 
> GPU support. Then we expose user interfaces for Spark to request GPUs from 
> them.
> * Within Spark, we update its scheduler to understand available GPUs 
> allocated to executors, user task requests, and assign GPUs to tasks properly.
> Based on the work done in YARN and Kubernetes to support GPUs and some 
> offline prototypes, we could have necessary features implemented in the next 
> major release of Spark. You can find a detailed scoping doc here, where we 
> listed user stories and their priorities.
> h2. Goals
> * Make Spark 3.0 GPU-aware in standalone, YARN, and Kubernetes.
> * No regression on scheduler performance for normal jobs.
> h2. Non-goals
> * Fine-grained scheduling within one GPU card.
> ** We treat one GPU card and its memory together as a non-divisible unit.
> * Support TPU.
> * Support Mesos.
> * Support Windows.
> h2. Target Personas
> * Admins who need to configure clusters to run Spark with GPU nodes.
> * Data scientists who need to build DL applications on Spark.
> * Developers who need to integrate DL features on Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

40 matches

Mail list logo