[jira] [Updated] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27298:
--
Affects Version/s: 2.4.2

> Dataset except operation gives different results(dataset count) on Spark 
> 2.3.0 Windows and Spark 2.3.0 Linux environment
> 
>
> Key: SPARK-27298
> URL: https://issues.apache.org/jira/browse/SPARK-27298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.2
>Reporter: Mahima Khatri
>Priority: Major
>  Labels: data-loss
> Attachments: Console-Result-Windows.txt, 
> console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, 
> console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, 
> console-result-spark-2.4.2-windows, customer.csv, pom.xml
>
>
> {code:java}
> // package com.verifyfilter.example;
> import org.apache.spark.SparkConf;
> import org.apache.spark.SparkContext;
> import org.apache.spark.sql.SparkSession;
> import org.apache.spark.sql.Column;
> import org.apache.spark.sql.Dataset;
> import org.apache.spark.sql.Row;
> import org.apache.spark.sql.SaveMode;
> public class ExcludeInTesting {
> public static void main(String[] args) {
> SparkSession spark = SparkSession.builder()
> .appName("ExcludeInTesting")
> .config("spark.some.config.option", "some-value")
> .getOrCreate();
> Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv")
> .option("header", "true")
> .option("delimiter", "|")
> .option("inferSchema", "true")
> //.load("E:/resources/customer.csv"); local //below path for VM
> .load("/home/myproject/bda/home/bin/customer.csv");
> dataReadFromCSV.printSchema();
> dataReadFromCSV.show();
> //Adding an extra step of saving to db and then loading it again
> dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer");
> Dataset dataLoaded = spark.sql("select * from customer");
> //Gender EQ M
> Column genderCol = dataLoaded.col("Gender");
> Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M"));
> //Dataset onlyMaleDS = spark.sql("select count(*) from customer where 
> Gender='M'");
> onlyMaleDS.show();
> System.out.println("The count of Male customers is :"+ onlyMaleDS.count());
> System.out.println("*");
> // Income in the list
> Object[] valuesArray = new Object[5];
> valuesArray[0]=503.65;
> valuesArray[1]=495.54;
> valuesArray[2]=486.82;
> valuesArray[3]=481.28;
> valuesArray[4]=479.79;
> Column incomeCol = dataLoaded.col("Income");
> Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) 
> valuesArray));
> System.out.println("The count of customers satisfaying Income is :"+ 
> incomeMatchingSet.count());
> System.out.println("*");
> Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet);
> System.out.println("The count of final customers is :"+ 
> maleExcptIncomeMatch.count());
> System.out.println("*");
> }
> }
> {code}
>  When the above code is executed on Spark 2.3.0 ,it gives below different 
> results:
> *Windows* :  The code gives correct count of dataset 148237,
> *Linux :*         The code gives different {color:#172b4d}count of dataset 
> 129532 {color}
>  
> {color:#172b4d}Some more info related to this bug:{color}
> {color:#172b4d}1. Application Code (attached)
> 2. CSV file used(attached)
> 3. Windows spec 
>           Windows 10- 64 bit OS 
> 4. Linux spec (Running on Oracle VM virtual box)
>       Specifications: \{as captured from Vbox.log}
>         00:00:26.112908 VMMDev: Guest Additions information report: Version 
> 5.0.32 r112930          '5.0.32_Ubuntu'
>         00:00:26.112996 VMMDev: Guest Additions information report: Interface 
> = 0x00010004         osType = 0x00053100 (Linux >= 2.6, 64-bit)
> 5. Snapshots of output in both cases (attached){color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28024:
--
Affects Version/s: 2.0.2
   2.1.3

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Critical
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> For example
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> {code}
> Case 4:
> {code:sql}
> spark-sql> select exp(-1.2345678901234E200);
> 0.0
> postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28024:
--
Affects Version/s: 2.2.3
   2.3.4
   2.4.4

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0
>Reporter: Yuming Wang
>Priority: Critical
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> For example
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> {code}
> Case 4:
> {code:sql}
> spark-sql> select exp(-1.2345678901234E200);
> 0.0
> postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27619:
--
Affects Version/s: 2.3.4

> MapType should be prohibited in hash expressions
> 
>
> Key: SPARK-27619
> URL: https://issues.apache.org/jira/browse/SPARK-27619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>  Labels: correctness
>
> Spark currently allows MapType expressions to be used as input to hash 
> expressions, but I think that this should be prohibited because Spark SQL 
> does not support map equality.
> Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
> map elements:
> {code:java}
> val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
> val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
> // Demonstration of how Scala Map equality is unaffected by insertion order:
> assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
> assert(Map(1->1, 2->2) == Map(2->2, 1->1))
> assert(a.first() == b.first())
> // In contrast, this will print two different hashcodes:
> println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
> This behavior might be surprising to Scala developers.
> I think there's precedence for banning the use of MapType here because we 
> already prohibit MapType in aggregation / joins / equality comparisons 
> (SPARK-9415) and set operations (SPARK-19893).
> If we decide that we want this to be an error then it might also be a good 
> idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the 
> old and buggy behavior (in case applications were relying on it in cases 
> where it just so happens to be safe-by-accident (e.g. maps which only have 
> one entry)).
> Alternatively, we could support hashing here if we implemented support for 
> comparable map types (SPARK-18134).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27619) MapType should be prohibited in hash expressions

2020-01-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020854#comment-17020854
 ] 

Dongjoon Hyun commented on SPARK-27619:
---

I just checked the result on 3.0.0-preview2. This issue is still valid.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))

// Exiting paste mode, now interpreting.

List([-125051660], [783998793])
a: org.apache.spark.sql.Dataset[scala.collection.immutable.Map[Int,Int]] = 
[value: map]
b: org.apache.spark.sql.Dataset[scala.collection.immutable.Map[Int,Int]] = 
[value: map]

scala> spark.version
res1: String = 3.0.0-preview2
{code}

> MapType should be prohibited in hash expressions
> 
>
> Key: SPARK-27619
> URL: https://issues.apache.org/jira/browse/SPARK-27619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>  Labels: correctness
>
> Spark currently allows MapType expressions to be used as input to hash 
> expressions, but I think that this should be prohibited because Spark SQL 
> does not support map equality.
> Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
> map elements:
> {code:java}
> val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
> val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
> // Demonstration of how Scala Map equality is unaffected by insertion order:
> assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
> assert(Map(1->1, 2->2) == Map(2->2, 1->1))
> assert(a.first() == b.first())
> // In contrast, this will print two different hashcodes:
> println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
> This behavior might be surprising to Scala developers.
> I think there's precedence for banning the use of MapType here because we 
> already prohibit MapType in aggregation / joins / equality comparisons 
> (SPARK-9415) and set operations (SPARK-19893).
> If we decide that we want this to be an error then it might also be a good 
> idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the 
> old and buggy behavior (in case applications were relying on it in cases 
> where it just so happens to be safe-by-accident (e.g. maps which only have 
> one entry)).
> Alternatively, we could support hashing here if we implemented support for 
> comparable map types (SPARK-18134).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27619:
--
Affects Version/s: 3.0.0
   2.4.1
   2.4.2
   2.4.3
   2.4.4

> MapType should be prohibited in hash expressions
> 
>
> Key: SPARK-27619
> URL: https://issues.apache.org/jira/browse/SPARK-27619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>  Labels: correctness
>
> Spark currently allows MapType expressions to be used as input to hash 
> expressions, but I think that this should be prohibited because Spark SQL 
> does not support map equality.
> Currently, Spark SQL's map hashcodes are sensitive to the insertion order of 
> map elements:
> {code:java}
> val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
> val b = spark.createDataset(Map(2->2, 1->1) :: Nil)
> // Demonstration of how Scala Map equality is unaffected by insertion order:
> assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
> assert(Map(1->1, 2->2) == Map(2->2, 1->1))
> assert(a.first() == b.first())
> // In contrast, this will print two different hashcodes:
> println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code}
> This behavior might be surprising to Scala developers.
> I think there's precedence for banning the use of MapType here because we 
> already prohibit MapType in aggregation / joins / equality comparisons 
> (SPARK-9415) and set operations (SPARK-19893).
> If we decide that we want this to be an error then it might also be a good 
> idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the 
> old and buggy behavior (in case applications were relying on it in cases 
> where it just so happens to be safe-by-accident (e.g. maps which only have 
> one entry)).
> Alternatively, we could support hashing here if we implemented support for 
> comparable map types (SPARK-18134).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage

2020-01-21 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020837#comment-17020837
 ] 

Wenchen Fan commented on SPARK-30218:
-

I believe it has been fixed in master by 
https://issues.apache.org/jira/browse/SPARK-28344 . [~FC] can you try your 
query with the master branch? Thanks!

> Columns used in inequality conditions for joins not resolved correctly in 
> case of common lineage
> 
>
> Key: SPARK-30218
> URL: https://issues.apache.org/jira/browse/SPARK-30218
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.4, 2.4.4
>Reporter: Francesco Cavrini
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: correctness
>
> When columns from different data-frames that have a common lineage are used 
> in inequality conditions in joins, they are not resolved correctly. In 
> particular, both the column from the left DF and the one from the right DF 
> are resolved to the same column, thus making the inequality condition either 
> always satisfied or always not-satisfied.
> Minimal example to reproduce follows.
> {code:python}
> import pyspark.sql.functions as F
> data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", 
> 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], 
> ["id", "kind", "timestamp"])
> df_left = data.where(F.col("kind") == "A").alias("left")
> df_right = data.where(F.col("kind") == "B").alias("right")
> conds = [df_left["id"] == df_right["id"]]
> conds.append(df_right["timestamp"].between(df_left["timestamp"], 
> df_left["timestamp"] + 2))
> res = df_left.join(df_right, conds, how="left")
> {code}
> The result is:
> | id|kind|timestamp| id|kind|timestamp|
> |id1|   A|0|id1|   B|1|
> |id1|   A|0|id1|   B|5|
> |id1|   A|1|id1|   B|1|
> |id1|   A|1|id1|   B|5|
> |id2|   A|2|id2|   B|   10|
> |id2|   A|3|id2|   B|   10|
> which violates the condition that the timestamp from the right DF should be 
> between df_left["timestamp"] and  df_left["timestamp"] + 2.
> The plan shows the problem in the column resolution.
> {code:bash}
> == Parsed Logical Plan ==
> Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && 
> (timestamp#2L <= (timestamp#2L + cast(2 as bigint)
> :- SubqueryAlias `left`
> :  +- Filter (kind#1 = A)
> : +- LogicalRDD [id#0, kind#1, timestamp#2L], false
> +- SubqueryAlias `right`
>+- Filter (kind#37 = B)
>   +- LogicalRDD [id#36, kind#37, timestamp#38L], false
> {code}
> Note, the columns used in the equality condition of the join have been 
> correctly resolved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30546) Make interval type more future-proofing

2020-01-21 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30546:

Priority: Blocker  (was: Major)

> Make interval type more future-proofing
> ---
>
> Key: SPARK-30546
> URL: https://issues.apache.org/jira/browse/SPARK-30546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Blocker
>
> Before 3.0 we may make some efforts for the current interval type to make it
> more future-proofing. e.g.
> 1. add unstable annotation to the CalendarInterval class. People already use
> it as UDF inputs so it’s better to make it clear it’s unstable.
> 2. Add a schema checker to prohibit create v2 custom catalog table with
> intervals, as same as what we do for the builtin catalog
> 3. Add a schema checker for DataFrameWriterV2 too
> 4. Make the interval type incomparable as version 2.4 for disambiguation of
> comparison between year-month and day-time fields
> 5. The 3.0 newly added to_csv should not support output intervals as same as
> using CSV file format
> 6. The function to_json should not allow using interval as a key field as
> same as the value field and JSON datasource, with a legacy config to
> restore.
> 7. Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30546) Make interval type more future-proofing

2020-01-21 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-30546:

Issue Type: New Feature  (was: Improvement)

> Make interval type more future-proofing
> ---
>
> Key: SPARK-30546
> URL: https://issues.apache.org/jira/browse/SPARK-30546
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Blocker
>
> Before 3.0 we may make some efforts for the current interval type to make it
> more future-proofing. e.g.
> 1. add unstable annotation to the CalendarInterval class. People already use
> it as UDF inputs so it’s better to make it clear it’s unstable.
> 2. Add a schema checker to prohibit create v2 custom catalog table with
> intervals, as same as what we do for the builtin catalog
> 3. Add a schema checker for DataFrameWriterV2 too
> 4. Make the interval type incomparable as version 2.4 for disambiguation of
> comparison between year-month and day-time fields
> 5. The 3.0 newly added to_csv should not support output intervals as same as
> using CSV file format
> 6. The function to_json should not allow using interval as a key field as
> same as the value field and JSON datasource, with a legacy config to
> restore.
> 7. Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28264) Revisiting Python / pandas UDF

2020-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28264.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27165
[https://github.com/apache/spark/pull/27165]

> Revisiting Python / pandas UDF
> --
>
> Key: SPARK-28264
> URL: https://issues.apache.org/jira/browse/SPARK-28264
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Reynold Xin
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.0.0
>
>
> In the past two years, the pandas UDFs are perhaps the most important changes 
> to Spark for Python data science. However, these functionalities have evolved 
> organically, leading to some inconsistencies and confusions among users. This 
> document revisits UDF definition and naming, as a result of discussions among 
> Xiangrui, Li Jin, Hyukjin, and Reynold.
> -See document here: 
> [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]-
>  New proposal: 
> https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30528) DPP issues

2020-01-21 Thread Mayur Bhosale (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020803#comment-17020803
 ] 

Mayur Bhosale edited comment on SPARK-30528 at 1/22/20 6:13 AM:


Thanks for the explanation [~maryannxue]
 # Should DPP be turned off by default till the heuristics are improved or keep 
having it turned on by default but don't do DPP when the column level stats are 
not available? Because for some cases this can be really disastrous.
 # Can we use the bloom filter to store the pruning values (for non-Broadcast 
Hash Join)? This will have multiple advantages -
 ## The size of the result returned to the driver would be way smaller
 ## Faster lookups compared to hashSet
 ## Reuse of the exchange will happen (because we won't be adding Aggregate on 
top)
 ## Duplicate subqueries because of multiple join conditions on partitioned 
columns will get removed (cases like example 3 in the description above)

         This will require more thoughts though. Let me know if this sounds 
feasible and useful, then I can get back with more details and can pick it up 
as well. 

       3. Yes, one of the subqueries selects `col1` and the other selects 
`col2`.
{code:java}
== Physical Plan == 
*(5) SortMergeJoin [partcol1#2L, partcol2#3], [col1#5L, col2#6], Inner
:- *(2) Sort [partcol1#2L ASC NULLS FIRST, partcol2#3 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(partcol1#2L, partcol2#3, 200), true, [id=#103]
: +- *(1) ColumnarToRow
:+- FileScan parquet 
default.partitionedtable[id#0L,name#1,partCol1#2L,partCol2#3] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/sp...,
 PartitionFilters: [isnotnull(partCol2#3), isnotnull(partCol1#2L), 
dynamicpruningexpression(partCol1#2L IN subquery#..., PushedFilters: [], 
ReadSchema: struct
:  :- Subquery subquery#19, [id=#49]
:  :  +- *(2) HashAggregate(keys=[col1#5L], functions=[])
:  : +- Exchange hashpartitioning(col1#5L, 200), true, [id=#45]
:  :+- *(1) HashAggregate(keys=[col1#5L], functions=[])
:  :   +- *(1) Project [col1#5L]
:  :  +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 
0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L))
:  : +- *(1) ColumnarToRow
:  :+- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: 
[isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct
:  +- Subquery subquery#21, [id=#82]
: +- *(2) HashAggregate(keys=[col2#6], functions=[])
:+- Exchange hashpartitioning(col2#6, 200), true, [id=#78]
:   +- *(1) HashAggregate(keys=[col2#6], functions=[])
:  +- *(1) Project [col2#6]
: +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 
0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L))
:+- *(1) ColumnarToRow
:   +- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: 
[isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct
+- *(4) Sort [col1#5L ASC NULLS FIRST, col2#6 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(col1#5L, col2#6, 200), true, [id=#113]
  +- *(3) Project [id#4L, col1#5L, col2#6, name#7]
 +- *(3) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND 
isnotnull(col2#6)) AND isnotnull(col1#5L))
+- *(3) ColumnarToRow
   +- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6,name#7] Batched: true, 
DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), 
isnotnull(col1#5L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct 
{code}
 

           If we don't decide to go with removing Aggregate (not using 
BloomFilter), should we combine such DPP subqueries into a                     
single sub-query? We can 

[jira] [Comment Edited] (SPARK-30528) DPP issues

2020-01-21 Thread Mayur Bhosale (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020803#comment-17020803
 ] 

Mayur Bhosale edited comment on SPARK-30528 at 1/22/20 6:11 AM:


Thanks for the explanation [~maryannxue]
 # Should DPP be turned off by default till the heuristics are improved or keep 
having it turned on by default but don't do DPP when the column level stats are 
not available? Because for some cases this can be really disastrous.
 # Can we use the bloom filter to store the pruning values (for non-Broadcast 
Hash Join)? This will have multiple advantages -
 ## The size of the result returned to the driver would be way smaller
 ## Faster lookups compared to hashSet
 ## Reuse of the exchange will happen (because we won't be adding Aggregate on 
top)
 ## Duplicate subqueries because of multiple join conditions on partitioned 
columns will get removed (cases like example 3 in the description above)

          With Bloom Filter, DPP subquery should look something like this - 

            | Bloom Filter |<| Other operations |<--| 
Exchange |<--| Scan |

         This will require more thoughts though. Let me know if this sounds 
feasible and useful, then I can get back with more details and can pick it up 
as well. 

       3. Yes, one of the subqueries selects `col1` and the other selects 
`col2`.
{code:java}
== Physical Plan == 
*(5) SortMergeJoin [partcol1#2L, partcol2#3], [col1#5L, col2#6], Inner
:- *(2) Sort [partcol1#2L ASC NULLS FIRST, partcol2#3 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(partcol1#2L, partcol2#3, 200), true, [id=#103]
: +- *(1) ColumnarToRow
:+- FileScan parquet 
default.partitionedtable[id#0L,name#1,partCol1#2L,partCol2#3] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/sp...,
 PartitionFilters: [isnotnull(partCol2#3), isnotnull(partCol1#2L), 
dynamicpruningexpression(partCol1#2L IN subquery#..., PushedFilters: [], 
ReadSchema: struct
:  :- Subquery subquery#19, [id=#49]
:  :  +- *(2) HashAggregate(keys=[col1#5L], functions=[])
:  : +- Exchange hashpartitioning(col1#5L, 200), true, [id=#45]
:  :+- *(1) HashAggregate(keys=[col1#5L], functions=[])
:  :   +- *(1) Project [col1#5L]
:  :  +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 
0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L))
:  : +- *(1) ColumnarToRow
:  :+- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: 
[isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct
:  +- Subquery subquery#21, [id=#82]
: +- *(2) HashAggregate(keys=[col2#6], functions=[])
:+- Exchange hashpartitioning(col2#6, 200), true, [id=#78]
:   +- *(1) HashAggregate(keys=[col2#6], functions=[])
:  +- *(1) Project [col2#6]
: +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 
0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L))
:+- *(1) ColumnarToRow
:   +- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: 
[isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct
+- *(4) Sort [col1#5L ASC NULLS FIRST, col2#6 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(col1#5L, col2#6, 200), true, [id=#113]
  +- *(3) Project [id#4L, col1#5L, col2#6, name#7]
 +- *(3) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND 
isnotnull(col2#6)) AND isnotnull(col1#5L))
+- *(3) ColumnarToRow
   +- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6,name#7] Batched: true, 
DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), 
isnotnull(col1#5L)], Format: Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct 
{code}
 


[jira] [Commented] (SPARK-30528) DPP issues

2020-01-21 Thread Mayur Bhosale (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020803#comment-17020803
 ] 

Mayur Bhosale commented on SPARK-30528:
---

Thanks for the explanation [~maryannxue]
 # Should DPP be turned off by default till the heuristics are improved or keep 
having it turned on by default but don't do DPP when the column level stats are 
not available? Because for some cases this can be really disastrous.
 # Can we use the bloom filter to store the pruning values (for non-Broadcast 
Hash Join)? This will have multiple advantages -
 ## The size of the result returned to the driver would be way smaller
 ## Faster lookups compared to hashSet
 ## Reuse of the exchange will happen (because we won't be adding Aggregate on 
top)
 ## Duplicate subqueries because of multiple join conditions on partitioned 
columns will get removed (cases like example 3 in the description above)

          With Bloom Filter, DPP subquery should look something like this - 

          ++                +--+            
+---+       +--+
          | Bloom filter  |< | Other operations  | <--| Exchange  
|<---| Scan  |
          ++                +--+            
+---+       +--+

         This will require more thoughts though. Let me know if this sounds 
feasible and useful, then I can get back with more details and can pick it up 
as well. 

       3. Yes, one of the subqueries selects `col1` and the other selects 
`col2`.
{code:java}
== Physical Plan == 
*(5) SortMergeJoin [partcol1#2L, partcol2#3], [col1#5L, col2#6], Inner
:- *(2) Sort [partcol1#2L ASC NULLS FIRST, partcol2#3 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(partcol1#2L, partcol2#3, 200), true, [id=#103]
: +- *(1) ColumnarToRow
:+- FileScan parquet 
default.partitionedtable[id#0L,name#1,partCol1#2L,partCol2#3] Batched: true, 
DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/sp...,
 PartitionFilters: [isnotnull(partCol2#3), isnotnull(partCol1#2L), 
dynamicpruningexpression(partCol1#2L IN subquery#..., PushedFilters: [], 
ReadSchema: struct
:  :- Subquery subquery#19, [id=#49]
:  :  +- *(2) HashAggregate(keys=[col1#5L], functions=[])
:  : +- Exchange hashpartitioning(col1#5L, 200), true, [id=#45]
:  :+- *(1) HashAggregate(keys=[col1#5L], functions=[])
:  :   +- *(1) Project [col1#5L]
:  :  +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 
0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L))
:  : +- *(1) ColumnarToRow
:  :+- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: 
[isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct
:  +- Subquery subquery#21, [id=#82]
: +- *(2) HashAggregate(keys=[col2#6], functions=[])
:+- Exchange hashpartitioning(col2#6, 200), true, [id=#78]
:   +- *(1) HashAggregate(keys=[col2#6], functions=[])
:  +- *(1) Project [col2#6]
: +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 
0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L))
:+- *(1) ColumnarToRow
:   +- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: 
[isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: 
Parquet, Location: 
InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa...,
 PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), 
IsNotNull(col2), IsNotNull(col1)], ReadSchema: 
struct
+- *(4) Sort [col1#5L ASC NULLS FIRST, col2#6 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(col1#5L, col2#6, 200), true, [id=#113]
  +- *(3) Project [id#4L, col1#5L, col2#6, name#7]
 +- *(3) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND 
isnotnull(col2#6)) AND isnotnull(col1#5L))
+- *(3) ColumnarToRow
   +- FileScan parquet 
default.nonpartitionedtable[id#4L,col1#5L,col2#6,name#7] Batched: true, 
DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), 
isnotnull(col1#5L)], Format: Parquet, Location: 

[jira] [Updated] (SPARK-30553) Fix structured-streaming java example error

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30553:
--
Affects Version/s: 2.1.1
   2.2.3

> Fix structured-streaming java example error
> ---
>
> Key: SPARK-30553
> URL: https://issues.apache.org/jira/browse/SPARK-30553
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.1, 2.2.3, 2.3.4, 2.4.4
>Reporter: bettermouse
>Assignee: bettermouse
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>
> [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]
> I write code according to this by java and scala.
> java
> {code:java}
> public static void main(String[] args) throws StreamingQueryException {
> SparkSession spark = 
> SparkSession.builder().appName("test").master("local[*]")
> .config("spark.sql.shuffle.partitions", 1)
> .getOrCreate();Dataset lines = 
> spark.readStream().format("socket")
> .option("host", "skynet")
> .option("includeTimestamp", true)
> .option("port", ).load();
> Dataset words = lines.select("timestamp", "value");
> Dataset count = words.withWatermark("timestamp", "10 seconds")
> .groupBy(functions.window(words.col("timestamp"), "10 
> seconds", "10 seconds")
> , words.col("value")).count();
> StreamingQuery start = count.writeStream()
> .outputMode("update")
> .format("console").start();
> start.awaitTermination();}
> {code}
> scala
>  
> {code:java}
>  def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder.appName("test").
>   master("local[*]").
>   config("spark.sql.shuffle.partitions", 1)
>   .getOrCreate
> import spark.implicits._
> val lines = spark.readStream.format("socket").
>   option("host", "skynet").option("includeTimestamp", true).
>   option("port", ).load
> val words = lines.select("timestamp", "value")
> val count = words.withWatermark("timestamp", "10 seconds").
>   groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
>   .count()
> val start = count.writeStream.outputMode("update").format("console").start
> start.awaitTermination()
>   }
> {code}
> This is according to official documents. written in Java I found metrics 
> "stateOnCurrentVersionSizeBytes" always increase .but scala is ok.
>  
> java
>  
> {code:java}
> == Physical Plan ==
> WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
> +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
> output=[window#11, value#0, count#10L])
>+- StateStoreSave [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], Update, 1579274372624, 2
>   +- *(3) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>  +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], 2
> +- *(2) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>+- Exchange hashpartitioning(window#11, value#0, 1)
>   +- *(1) HashAggregate(keys=[window#11, value#0], 
> functions=[partial_count(1)], output=[window#11, value#0, count#21L])
>  +- *(1) Project [named_struct(start, 
> precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) + 1) ELSE 
> CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
> TimestampType), end, precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> 

[jira] [Updated] (SPARK-30553) Fix structured-streaming java example error

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30553:
--
Affects Version/s: 2.3.4

> Fix structured-streaming java example error
> ---
>
> Key: SPARK-30553
> URL: https://issues.apache.org/jira/browse/SPARK-30553
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.3.4, 2.4.4
>Reporter: bettermouse
>Assignee: bettermouse
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>
> [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]
> I write code according to this by java and scala.
> java
> {code:java}
> public static void main(String[] args) throws StreamingQueryException {
> SparkSession spark = 
> SparkSession.builder().appName("test").master("local[*]")
> .config("spark.sql.shuffle.partitions", 1)
> .getOrCreate();Dataset lines = 
> spark.readStream().format("socket")
> .option("host", "skynet")
> .option("includeTimestamp", true)
> .option("port", ).load();
> Dataset words = lines.select("timestamp", "value");
> Dataset count = words.withWatermark("timestamp", "10 seconds")
> .groupBy(functions.window(words.col("timestamp"), "10 
> seconds", "10 seconds")
> , words.col("value")).count();
> StreamingQuery start = count.writeStream()
> .outputMode("update")
> .format("console").start();
> start.awaitTermination();}
> {code}
> scala
>  
> {code:java}
>  def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder.appName("test").
>   master("local[*]").
>   config("spark.sql.shuffle.partitions", 1)
>   .getOrCreate
> import spark.implicits._
> val lines = spark.readStream.format("socket").
>   option("host", "skynet").option("includeTimestamp", true).
>   option("port", ).load
> val words = lines.select("timestamp", "value")
> val count = words.withWatermark("timestamp", "10 seconds").
>   groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
>   .count()
> val start = count.writeStream.outputMode("update").format("console").start
> start.awaitTermination()
>   }
> {code}
> This is according to official documents. written in Java I found metrics 
> "stateOnCurrentVersionSizeBytes" always increase .but scala is ok.
>  
> java
>  
> {code:java}
> == Physical Plan ==
> WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
> +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
> output=[window#11, value#0, count#10L])
>+- StateStoreSave [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], Update, 1579274372624, 2
>   +- *(3) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>  +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], 2
> +- *(2) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>+- Exchange hashpartitioning(window#11, value#0, 1)
>   +- *(1) HashAggregate(keys=[window#11, value#0], 
> functions=[partial_count(1)], output=[window#11, value#0, count#21L])
>  +- *(1) Project [named_struct(start, 
> precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) + 1) ELSE 
> CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
> TimestampType), end, precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> 

[jira] [Updated] (SPARK-30553) Fix structured-streaming java example error

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30553:
--
Summary: Fix structured-streaming java example error  (was: 
structured-streaming documentation  java   watermark group by)

> Fix structured-streaming java example error
> ---
>
> Key: SPARK-30553
> URL: https://issues.apache.org/jira/browse/SPARK-30553
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: bettermouse
>Assignee: bettermouse
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>
> [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]
> I write code according to this by java and scala.
> java
> {code:java}
> public static void main(String[] args) throws StreamingQueryException {
> SparkSession spark = 
> SparkSession.builder().appName("test").master("local[*]")
> .config("spark.sql.shuffle.partitions", 1)
> .getOrCreate();Dataset lines = 
> spark.readStream().format("socket")
> .option("host", "skynet")
> .option("includeTimestamp", true)
> .option("port", ).load();
> Dataset words = lines.select("timestamp", "value");
> Dataset count = words.withWatermark("timestamp", "10 seconds")
> .groupBy(functions.window(words.col("timestamp"), "10 
> seconds", "10 seconds")
> , words.col("value")).count();
> StreamingQuery start = count.writeStream()
> .outputMode("update")
> .format("console").start();
> start.awaitTermination();}
> {code}
> scala
>  
> {code:java}
>  def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder.appName("test").
>   master("local[*]").
>   config("spark.sql.shuffle.partitions", 1)
>   .getOrCreate
> import spark.implicits._
> val lines = spark.readStream.format("socket").
>   option("host", "skynet").option("includeTimestamp", true).
>   option("port", ).load
> val words = lines.select("timestamp", "value")
> val count = words.withWatermark("timestamp", "10 seconds").
>   groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
>   .count()
> val start = count.writeStream.outputMode("update").format("console").start
> start.awaitTermination()
>   }
> {code}
> This is according to official documents. written in Java I found metrics 
> "stateOnCurrentVersionSizeBytes" always increase .but scala is ok.
>  
> java
>  
> {code:java}
> == Physical Plan ==
> WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
> +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
> output=[window#11, value#0, count#10L])
>+- StateStoreSave [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], Update, 1579274372624, 2
>   +- *(3) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>  +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], 2
> +- *(2) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>+- Exchange hashpartitioning(window#11, value#0, 1)
>   +- *(1) HashAggregate(keys=[window#11, value#0], 
> functions=[partial_count(1)], output=[window#11, value#0, count#21L])
>  +- *(1) Project [named_struct(start, 
> precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) + 1) ELSE 
> CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
> TimestampType), end, precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, 

[jira] [Resolved] (SPARK-30553) structured-streaming documentation java watermark group by

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30553.
---
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27268
[https://github.com/apache/spark/pull/27268]

> structured-streaming documentation  java   watermark group by
> -
>
> Key: SPARK-30553
> URL: https://issues.apache.org/jira/browse/SPARK-30553
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: bettermouse
>Assignee: bettermouse
>Priority: Trivial
> Fix For: 2.4.5, 3.0.0
>
>
> [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]
> I write code according to this by java and scala.
> java
> {code:java}
> public static void main(String[] args) throws StreamingQueryException {
> SparkSession spark = 
> SparkSession.builder().appName("test").master("local[*]")
> .config("spark.sql.shuffle.partitions", 1)
> .getOrCreate();Dataset lines = 
> spark.readStream().format("socket")
> .option("host", "skynet")
> .option("includeTimestamp", true)
> .option("port", ).load();
> Dataset words = lines.select("timestamp", "value");
> Dataset count = words.withWatermark("timestamp", "10 seconds")
> .groupBy(functions.window(words.col("timestamp"), "10 
> seconds", "10 seconds")
> , words.col("value")).count();
> StreamingQuery start = count.writeStream()
> .outputMode("update")
> .format("console").start();
> start.awaitTermination();}
> {code}
> scala
>  
> {code:java}
>  def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder.appName("test").
>   master("local[*]").
>   config("spark.sql.shuffle.partitions", 1)
>   .getOrCreate
> import spark.implicits._
> val lines = spark.readStream.format("socket").
>   option("host", "skynet").option("includeTimestamp", true).
>   option("port", ).load
> val words = lines.select("timestamp", "value")
> val count = words.withWatermark("timestamp", "10 seconds").
>   groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
>   .count()
> val start = count.writeStream.outputMode("update").format("console").start
> start.awaitTermination()
>   }
> {code}
> This is according to official documents. written in Java I found metrics 
> "stateOnCurrentVersionSizeBytes" always increase .but scala is ok.
>  
> java
>  
> {code:java}
> == Physical Plan ==
> WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
> +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
> output=[window#11, value#0, count#10L])
>+- StateStoreSave [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], Update, 1579274372624, 2
>   +- *(3) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>  +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], 2
> +- *(2) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>+- Exchange hashpartitioning(window#11, value#0, 1)
>   +- *(1) HashAggregate(keys=[window#11, value#0], 
> functions=[partial_count(1)], output=[window#11, value#0, count#21L])
>  +- *(1) Project [named_struct(start, 
> precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) + 1) ELSE 
> CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
> TimestampType), end, precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as 

[jira] [Assigned] (SPARK-30553) structured-streaming documentation java watermark group by

2020-01-21 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30553:
-

Assignee: bettermouse

> structured-streaming documentation  java   watermark group by
> -
>
> Key: SPARK-30553
> URL: https://issues.apache.org/jira/browse/SPARK-30553
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: bettermouse
>Assignee: bettermouse
>Priority: Trivial
>
> [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]
> I write code according to this by java and scala.
> java
> {code:java}
> public static void main(String[] args) throws StreamingQueryException {
> SparkSession spark = 
> SparkSession.builder().appName("test").master("local[*]")
> .config("spark.sql.shuffle.partitions", 1)
> .getOrCreate();Dataset lines = 
> spark.readStream().format("socket")
> .option("host", "skynet")
> .option("includeTimestamp", true)
> .option("port", ).load();
> Dataset words = lines.select("timestamp", "value");
> Dataset count = words.withWatermark("timestamp", "10 seconds")
> .groupBy(functions.window(words.col("timestamp"), "10 
> seconds", "10 seconds")
> , words.col("value")).count();
> StreamingQuery start = count.writeStream()
> .outputMode("update")
> .format("console").start();
> start.awaitTermination();}
> {code}
> scala
>  
> {code:java}
>  def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder.appName("test").
>   master("local[*]").
>   config("spark.sql.shuffle.partitions", 1)
>   .getOrCreate
> import spark.implicits._
> val lines = spark.readStream.format("socket").
>   option("host", "skynet").option("includeTimestamp", true).
>   option("port", ).load
> val words = lines.select("timestamp", "value")
> val count = words.withWatermark("timestamp", "10 seconds").
>   groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
>   .count()
> val start = count.writeStream.outputMode("update").format("console").start
> start.awaitTermination()
>   }
> {code}
> This is according to official documents. written in Java I found metrics 
> "stateOnCurrentVersionSizeBytes" always increase .but scala is ok.
>  
> java
>  
> {code:java}
> == Physical Plan ==
> WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
> +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
> output=[window#11, value#0, count#10L])
>+- StateStoreSave [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], Update, 1579274372624, 2
>   +- *(3) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>  +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], 2
> +- *(2) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>+- Exchange hashpartitioning(window#11, value#0, 1)
>   +- *(1) HashAggregate(keys=[window#11, value#0], 
> functions=[partial_count(1)], output=[window#11, value#0, count#21L])
>  +- *(1) Project [named_struct(start, 
> precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) + 1) ELSE 
> CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
> TimestampType), end, precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> 

[jira] [Commented] (SPARK-30466) remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13

2020-01-21 Thread Michael Burgener (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020700#comment-17020700
 ] 

Michael Burgener commented on SPARK-30466:
--

Fair enough for the 2.x releases but the Spark 3.x release should update the 
dependencies for hadoop-client and hadoop-minikdc to the 3.x release as the 
Hadoop project has removed the dependencies and migrated to jackson-databind 
already.  One would expect potentially breaking changes between major releases 
so now would be the time to do that while 3.0.0 is in preview.

 

Is there a particular reason the hadoop libraries have not been updated to the 
3.x releases?

> remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13
> --
>
> Key: SPARK-30466
> URL: https://issues.apache.org/jira/browse/SPARK-30466
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Michael Burgener
>Priority: Major
>  Labels: security
>
> These 2 libraries are deprecated and replaced by the jackson-databind 
> libraries which are already included.  These two libraries are flagged by our 
> vulnerability scanners as having the following security vulnerabilities.  
> I've set the priority to Major due to the Critical nature and hopefully they 
> can be addressed quickly.  Please note, I'm not a developer but work in 
> InfoSec and this was flagged when we incorporated spark into our product.  If 
> you feel the priority is not set correctly please change accordingly.  I'll 
> watch the issue and flag our dev team to update once resolved.  
> jackson-mapper-asl-1.9.13
> CVE-2018-7489 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-7489] 
>  
> CVE-2017-7525 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-7525]
>  
> CVE-2017-17485 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-17485]
>  
> CVE-2017-15095 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-15095]
>  
> CVE-2018-5968 (CVSS 3.0 Score 8.1 High)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-5968]
>  
> jackson-core-asl-1.9.13
> CVE-2016-7051 (CVSS 3.0 Score 8.6 High)
> https://nvd.nist.gov/vuln/detail/CVE-2016-7051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30602) Support push-based shuffle to improve shuffle efficiency

2020-01-21 Thread Min Shen (Jira)
Min Shen created SPARK-30602:


 Summary: Support push-based shuffle to improve shuffle efficiency
 Key: SPARK-30602
 URL: https://issues.apache.org/jira/browse/SPARK-30602
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Min Shen


In a large deployment of a Spark compute infrastructure, Spark shuffle is 
becoming a potential scaling bottleneck and a source of inefficiency in the 
cluster. When doing Spark on YARN for a large-scale deployment, people usually 
enable Spark external shuffle service and store the intermediate shuffle files 
on HDD. Because the number of blocks generated for a particular shuffle grows 
quadratically compared to the size of shuffled data (# mappers and reducers 
grows linearly with the size of shuffled data, but # blocks is # mappers * # 
reducers), one general trend we have observed is that the more data a Spark 
application processes, the smaller the block size becomes. In a few production 
clusters we have seen, the average shuffle block size is only 10s of KBs. 
Because of the inefficiency of performing random reads on HDD for small amount 
of data, the overall efficiency of the Spark external shuffle services serving 
the shuffle blocks degrades as we see an increasing # of Spark applications 
processing an increasing amount of data. In addition, because Spark external 
shuffle service is a shared service in a multi-tenancy cluster, the 
inefficiency with one Spark application could propagate to other applications 
as well.

In this ticket, we propose a solution to improve Spark shuffle efficiency in 
above mentioned environments with push-based shuffle. With push-based shuffle, 
shuffle is performed at the end of mappers and blocks get pre-merged and move 
towards reducers. In our prototype implementation, we have seen significant 
efficiency improvements when performing large shuffles. We take a Spark-native 
approach to achieve this, i.e., extending Spark’s existing shuffle netty 
protocol, and the behaviors of Spark mappers, reducers and drivers. This way, 
we can bring the benefits of more efficient shuffle in Spark without incurring 
the dependency or overhead of either specialized storage layer or external 
infrastructure pieces.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30601) Add a Google Maven Central as a primary repository

2020-01-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-30601:


 Summary: Add a Google Maven Central as a primary repository
 Key: SPARK-30601
 URL: https://issues.apache.org/jira/browse/SPARK-30601
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.5, 3.0.0
Reporter: Hyukjin Kwon


See 
[http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html]

This Jira targets to switch the main repo to Google Maven Central.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30534) Use mvn in `dev/scalastyle`

2020-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30534.
--
Resolution: Later

Actually, the main issue we should want to fix was handled by ^.

> Use mvn in `dev/scalastyle`
> ---
>
> Key: SPARK-30534
> URL: https://issues.apache.org/jira/browse/SPARK-30534
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to use `mvn` instead of `sbt`.
> As of now, Apache Spark sbt build is broken by the Maven Central repository 
> policy.
> - 
> https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
> > Effective January 15, 2020, The Central Maven Repository no longer supports 
> > insecure
> > communication over plain HTTP and requires that all requests to the 
> > repository are 
> > encrypted over HTTPS.
> We can reproduce this locally by the following.
> $ rm -rf ~/.m2/repository/org/apache/apache/18/
> $ build/sbt clean
> This issue aims to recover GitHub Action `lint-scala` first by using mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30534) Use mvn in `dev/scalastyle`

2020-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30534:
-
Fix Version/s: (was: 2.4.5)
   (was: 3.0.0)

> Use mvn in `dev/scalastyle`
> ---
>
> Key: SPARK-30534
> URL: https://issues.apache.org/jira/browse/SPARK-30534
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to use `mvn` instead of `sbt`.
> As of now, Apache Spark sbt build is broken by the Maven Central repository 
> policy.
> - 
> https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
> > Effective January 15, 2020, The Central Maven Repository no longer supports 
> > insecure
> > communication over plain HTTP and requires that all requests to the 
> > repository are 
> > encrypted over HTTPS.
> We can reproduce this locally by the following.
> $ rm -rf ~/.m2/repository/org/apache/apache/18/
> $ build/sbt clean
> This issue aims to recover GitHub Action `lint-scala` first by using mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30534) Use mvn in `dev/scalastyle`

2020-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-30534:
--

> Use mvn in `dev/scalastyle`
> ---
>
> Key: SPARK-30534
> URL: https://issues.apache.org/jira/browse/SPARK-30534
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This issue aims to use `mvn` instead of `sbt`.
> As of now, Apache Spark sbt build is broken by the Maven Central repository 
> policy.
> - 
> https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
> > Effective January 15, 2020, The Central Maven Repository no longer supports 
> > insecure
> > communication over plain HTTP and requires that all requests to the 
> > repository are 
> > encrypted over HTTPS.
> We can reproduce this locally by the following.
> $ rm -rf ~/.m2/repository/org/apache/apache/18/
> $ build/sbt clean
> This issue aims to recover GitHub Action `lint-scala` first by using mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27878) Support ARRAY(sub-SELECT) expressions

2020-01-21 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003657#comment-17003657
 ] 

Takeshi Yamamuro edited comment on SPARK-27878 at 1/22/20 12:35 AM:


It seems PgSql and BigQuery support this feature. If we implement the Pg 
dialect in future, it might be worth supporting this. So, I'll make this keep 
open now.


was (Author: maropu):
It seems PgSql and BigQuery support this feature. If we implement the Pg 
dialect in future, it might be worth supporting this. So, I'll make this keep 
now.

> Support ARRAY(sub-SELECT) expressions
> -
>
> Key: SPARK-27878
> URL: https://issues.apache.org/jira/browse/SPARK-27878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Construct an array from the results of a subquery. In this form, the array 
> constructor is written with the key word {{ARRAY}} followed by a 
> parenthesized (not bracketed) subquery. For example:
> {code:sql}
> SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%');
>  array
> ---
>  {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413}
> (1 row)
> {code}
> More details:
>  
> [https://www.postgresql.org/docs/9.3/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS]
> [https://github.com/postgres/postgres/commit/730840c9b649a48604083270d48792915ca89233]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020685#comment-17020685
 ] 

Takeshi Yamamuro commented on SPARK-30599:
--

+1, too. The fix looks trivial.

> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: fail_trace.txt, success_trace.txt
>
>
> The LogAppender has a limit for the number of logged events. By default, it 
> is 100 entries. All tests that use LogAppender finish successfully with this 
> limit but sometimes one of the tests fails with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test *"SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema"* uses 2 log appenders. The 
> `success_trace.txt` contains stored log event in normal runs, another one 
> `fail_trace.txt` which I got while re-running the test in a loop.
> The difference is in:
> {code:java}
> [35] Block broadcast_222 stored as values in memory (estimated size 286.9 
> KiB, free 1998.5 MiB)
> [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.8 MiB)
> [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2003.9 MiB)
> [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.0 MiB)
> [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.0 MiB)
> [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.0 MiB)
> [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.0 MiB)
> [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.1 MiB)
> [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.2 MiB)
> [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.2 MiB)
> [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.2 MiB)
> [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.3 MiB)
> [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.3 MiB)
> [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.4 MiB)
> [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.4 MiB)
> [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 
> 24.2 KiB, free 2002.0 MiB)
> [63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 

[jira] [Created] (SPARK-30600) Migrate ALTER VIEW SET/UNSET commands to the new resolution framework

2020-01-21 Thread Terry Kim (Jira)
Terry Kim created SPARK-30600:
-

 Summary: Migrate ALTER VIEW SET/UNSET commands to the new 
resolution framework
 Key: SPARK-30600
 URL: https://issues.apache.org/jira/browse/SPARK-30600
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.

2020-01-21 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim reopened SPARK-30420:
---

> Commands involved with namespace go thru the new resolution framework.
> --
>
> Key: SPARK-30420
> URL: https://issues.apache.org/jira/browse/SPARK-30420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> V2 commands that need to resolve namespace should go thru new resolution 
> framework introduced in 
> [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.

2020-01-21 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim resolved SPARK-30420.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Commands involved with namespace go thru the new resolution framework.
> --
>
> Key: SPARK-30420
> URL: https://issues.apache.org/jira/browse/SPARK-30420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> V2 commands that need to resolve namespace should go thru new resolution 
> framework introduced in 
> [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27878) Support ARRAY(sub-SELECT) expressions

2020-01-21 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-27878:
-

> Support ARRAY(sub-SELECT) expressions
> -
>
> Key: SPARK-27878
> URL: https://issues.apache.org/jira/browse/SPARK-27878
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Construct an array from the results of a subquery. In this form, the array 
> constructor is written with the key word {{ARRAY}} followed by a 
> parenthesized (not bracketed) subquery. For example:
> {code:sql}
> SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%');
>  array
> ---
>  {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413}
> (1 row)
> {code}
> More details:
>  
> [https://www.postgresql.org/docs/9.3/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS]
> [https://github.com/postgres/postgres/commit/730840c9b649a48604083270d48792915ca89233]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28329) SELECT INTO syntax

2020-01-21 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020597#comment-17020597
 ] 

Xiao Li commented on SPARK-28329:
-

This conflicts with the SQL standard. 

> The SQL standard uses SELECT INTO to represent selecting values into scalar 
> variables of a host program, rather than creating a new table. 



> SELECT INTO syntax
> --
>
> Key: SPARK-28329
> URL: https://issues.apache.org/jira/browse/SPARK-28329
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> h2. Synopsis
> {noformat}
> [ WITH [ RECURSIVE ] with_query [, ...] ]
> SELECT [ ALL | DISTINCT [ ON ( expression [, ...] ) ] ]
> * | expression [ [ AS ] output_name ] [, ...]
> INTO [ TEMPORARY | TEMP | UNLOGGED ] [ TABLE ] new_table
> [ FROM from_item [, ...] ]
> [ WHERE condition ]
> [ GROUP BY expression [, ...] ]
> [ HAVING condition [, ...] ]
> [ WINDOW window_name AS ( window_definition ) [, ...] ]
> [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ]
> [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | 
> LAST } ] [, ...] ]
> [ LIMIT { count | ALL } ]
> [ OFFSET start [ ROW | ROWS ] ]
> [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ]
> [ FOR { UPDATE | SHARE } [ OF table_name [, ...] ] [ NOWAIT ] [...] ]
> {noformat}
> h2. Description
> {{SELECT INTO}} creates a new table and fills it with data computed by a 
> query. The data is not returned to the client, as it is with a normal 
> {{SELECT}}. The new table's columns have the names and data types associated 
> with the output columns of the {{SELECT}}.
>  
> {{CREATE TABLE AS}} offers a superset of the functionality offered by 
> {{SELECT INTO}}.
> [https://www.postgresql.org/docs/11/sql-selectinto.html]
>  [https://www.postgresql.org/docs/11/sql-createtableas.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions

2020-01-21 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020563#comment-17020563
 ] 

Thomas Graves commented on SPARK-27784:
---

[~rdblue] can you confirm this doesn't exist in master?

> Alias ID reuse can break correctness when substituting foldable expressions
> ---
>
> Key: SPARK-27784
> URL: https://issues.apache.org/jira/browse/SPARK-27784
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.3.2
>Reporter: Ryan Blue
>Priority: Major
>  Labels: correctness
>
> This is a correctness bug when reusing a set of project expressions in the 
> DataFrame API.
> Use case: a user was migrating a table to a new version with an additional 
> column ("data" in the repro case). To migrate the user unions the old table 
> ("t2") with the new table ("t1"), and applies a common set of projections to 
> ensure the union doesn't hit an issue with ordering (SPARK-22335). In some 
> cases, this produces an incorrect query plan:
> {code:java}
> Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1")
> Seq(1, 2, 3).toDF("id").write.saveAsTable("t2")
> val dim = Seq(2, 3, 4).toDF("id")
> val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data"))
> val t1 = spark.table("t1").select(outputCols:_*)
> val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*)
> t1.join(dim, t1("id") === dim("id")).select(t1("id"), 
> t1("data")).union(t2).explain(true){code}
> {code:java}
> == Physical Plan ==
> Union
> :- *Project [id#330, _ AS data#237] < THE CONSTANT IS 
> FROM THE OTHER SIDE OF THE UNION
> : +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight
> : :- *Project [id#330]
> : :  +- *Filter isnotnull(id#330)
> : : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, 
> Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: 
> [IsNotNull(id)], ReadSchema: struct
> : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, 
> false] as bigint)))
> :+- LocalTableScan [id#234]
> +- *Project [id#340, _ AS data#237]
>+- *FileScan parquet t2[id#340] Batched: true, Format: Parquet, Location: 
> CatalogFileIndex[s3://.../t2], PartitionFilters: [], PushedFilters: [], 
> ReadSchema: struct{code}
> The problem happens because "outputCols" has an alias. The ID for that alias 
> is created when the projection Seq is created, so it is reused in both sides 
> of the union.
> When FoldablePropagation runs, it identifies that "data" in the t2 side of 
> the union is a foldable expression and replaces all references to it, 
> including the references in the t1 side of the union.
> The join to a dimension table is necessary to reproduce the problem because 
> it requires a Projection on top of the join that uses an AttributeReference 
> for data#237. Otherwise, the projections are collapsed and the projection 
> includes an Alias that does not get rewritten by FoldablePropagation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause

2020-01-21 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020560#comment-17020560
 ] 

Thomas Graves commented on SPARK-27282:
---

[~sofia] can you confirm this isn't fixed in the last version of Spark?

> Spark incorrect results when using UNION with GROUP BY clause
> -
>
> Key: SPARK-27282
> URL: https://issues.apache.org/jira/browse/SPARK-27282
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, Spark Submit, SQL
>Affects Versions: 2.3.2
> Environment: I'm using :
> IntelliJ  IDEA ==> 2018.1.4
> spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1)
> scala ==> 2.11.8
>Reporter: Sofia
>Priority: Major
>  Labels: correctness
>
> When using UNION clause after a GROUP BY clause in spark, the results 
> obtained are wrong.
> The following example explicit this issue:
> {code:java}
> CREATE TABLE test_un (
> col1 varchar(255),
> col2 varchar(255),
> col3 varchar(255),
> col4 varchar(255)
> );
> INSERT INTO test_un (col1, col2, col3, col4)
> VALUES (1,1,2,4),
> (1,1,2,4),
> (1,1,3,5),
> (2,2,2,null);
> {code}
> I used the following code :
> {code:java}
> val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un")
> val  y = x
>.filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col3")
>   .agg(count(col("col3")).alias("cnt"))
>   .withColumn("col_name", lit("col3"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col3").alias("col_value"), col("cnt"))
> val z = x
>   .filter(col("col4")isNotNull)
>   .groupBy("col1", "col2","col4")
>   .agg(count(col("col4")).alias("cnt"))
>   .withColumn("col_name", lit("col4"))
>   .select(col("col1"), col("col2"), 
> col("col_name"),col("col4").alias("col_value"), col("cnt"))
> y.union(z).show()
> {code}
>  And i obtained the following results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|5|1|
> |1|1|col3|4|2|
> |1|1|col4|5|1|
> |1|1|col4|4|2|
> Expected results:
> ||col1||col2||col_name||col_value||cnt||
> |1|1|col3|3|1|
> |1|1|col3|2|2|
> |1|1|col4|4|2|
> |1|1|col4|5|1|
> But when i remove the last row of the table, i obtain the correct results.
> {code:java}
> (2,2,2,null){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020495#comment-17020495
 ] 

Dongjoon Hyun commented on SPARK-30599:
---

+1 for [~srowen]'s advice on increasing the capacity.

> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: fail_trace.txt, success_trace.txt
>
>
> The LogAppender has a limit for the number of logged events. By default, it 
> is 100 entries. All tests that use LogAppender finish successfully with this 
> limit but sometimes one of the tests fails with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test *"SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema"* uses 2 log appenders. The 
> `success_trace.txt` contains stored log event in normal runs, another one 
> `fail_trace.txt` which I got while re-running the test in a loop.
> The difference is in:
> {code:java}
> [35] Block broadcast_222 stored as values in memory (estimated size 286.9 
> KiB, free 1998.5 MiB)
> [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.8 MiB)
> [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2003.9 MiB)
> [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.0 MiB)
> [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.0 MiB)
> [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.0 MiB)
> [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.0 MiB)
> [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.1 MiB)
> [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.2 MiB)
> [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.2 MiB)
> [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.2 MiB)
> [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.3 MiB)
> [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.3 MiB)
> [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.4 MiB)
> [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.4 MiB)
> [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 
> 24.2 KiB, free 2002.0 MiB)
> [63] Added broadcast_222_piece0 in memory on 

[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020480#comment-17020480
 ] 

Maxim Gekk commented on SPARK-30599:


> Or I wonder if we can make the logging less noisy too.

We could try to filter by the log level, for instance by Level.WARN by default. 
But still there is a risk of getting so many warnings since we don't control 
what could happens on executors. FYI, LogAppender gathers logs not only on the 
driver but on executors as well (in local mode).

> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: fail_trace.txt, success_trace.txt
>
>
> The LogAppender has a limit for the number of logged events. By default, it 
> is 100 entries. All tests that use LogAppender finish successfully with this 
> limit but sometimes one of the tests fails with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test *"SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema"* uses 2 log appenders. The 
> `success_trace.txt` contains stored log event in normal runs, another one 
> `fail_trace.txt` which I got while re-running the test in a loop.
> The difference is in:
> {code:java}
> [35] Block broadcast_222 stored as values in memory (estimated size 286.9 
> KiB, free 1998.5 MiB)
> [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.8 MiB)
> [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2003.9 MiB)
> [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.0 MiB)
> [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.0 MiB)
> [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.0 MiB)
> [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.0 MiB)
> [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.1 MiB)
> [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.2 MiB)
> [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.2 MiB)
> [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.2 MiB)
> [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.3 MiB)
> [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.3 MiB)
> [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> 

[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30599:
---
Description: 
The LogAppender has a limit for the number of logged events. By default, it is 
100 entries. All tests that use LogAppender finish successfully with this limit 
but sometimes one of the tests fails with the exception like:
{code:java}
java.lang.IllegalStateException: Number of events reached the limit of 100 
while logging CSV header matches to schema w/ enforceSchema.
at 
org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856) 
{code}
For example, the CSV test *"SPARK-23786: warning should be printed if CSV 
header doesn't conform to schema"* uses 2 log appenders. The 
`success_trace.txt` contains stored log event in normal runs, another one 
`fail_trace.txt` which I got while re-running the test in a loop.

The difference is in:
{code:java}
[35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, 
free 1998.5 MiB)
[36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.8 MiB)
[37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2003.9 MiB)
[39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.0 MiB)
[43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.0 MiB)
[45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.0 MiB)
[46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.0 MiB)
[47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.1 MiB)
[50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.2 MiB)
[52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.2 MiB)
[53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.2 MiB)
[55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.3 MiB)
[59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.3 MiB)
[60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.4 MiB)
[61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.4 MiB)
[62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 
KiB, free 2002.0 MiB)
[63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 
KiB, free: 2004.4 MiB)
[64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[65] Created broadcast 222 from apply at OutcomeOf.scala:85
[66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.5 MiB)
[68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.5 MiB)
[69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, 
free 2002.5 MiB)
[70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.5 MiB)
[71] 

[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020474#comment-17020474
 ] 

Sean R. Owen commented on SPARK-30599:
--

For now, just increase the capacity? 1000 doesn't seem too large.
Or I wonder if we can make the logging less noisy too.

> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: fail_trace.txt, success_trace.txt
>
>
> The LogAppender has a limit for the number of logged event. By default, it is 
> 100. All tests that use LogAppender finish successfully with this limit but 
> sometimes the test fails with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test "SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema" uses 2 log appenders. The 
> `success_trace.txt` contains stored log event in normal runs, another one 
> `fail_trace.txt` which I got while re-running the test in a loop.
> The difference is in:
> {code:java}
> [35] Block broadcast_222 stored as values in memory (estimated size 286.9 
> KiB, free 1998.5 MiB)
> [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.8 MiB)
> [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2003.9 MiB)
> [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.0 MiB)
> [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.0 MiB)
> [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.0 MiB)
> [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.0 MiB)
> [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.1 MiB)
> [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.2 MiB)
> [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.2 MiB)
> [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.2 MiB)
> [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.3 MiB)
> [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.3 MiB)
> [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.4 MiB)
> [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.4 MiB)
> [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 
> 24.2 KiB, free 2002.0 MiB)
> [63] 

[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30599:
---
Description: 
The LogAppender has a limit for the number of logged event. By default, it is 
100. All tests that use LogAppender finish successfully with this limit but 
sometimes the test fails with the exception like:
{code:java}
java.lang.IllegalStateException: Number of events reached the limit of 100 
while logging CSV header matches to schema w/ enforceSchema.
at 
org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856) 
{code}
For example, the CSV test "SPARK-23786: warning should be printed if CSV header 
doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` 
contains stored log event in normal runs, another one `fail_trace.txt` I got 
while re-running the test in a loop.

The difference is in:
{code:java}
[35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, 
free 1998.5 MiB)
[36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.8 MiB)
[37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2003.9 MiB)
[39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.0 MiB)
[43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.0 MiB)
[45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.0 MiB)
[46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.0 MiB)
[47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.1 MiB)
[50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.2 MiB)
[52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.2 MiB)
[53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.2 MiB)
[55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.3 MiB)
[59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.3 MiB)
[60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.4 MiB)
[61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.4 MiB)
[62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 
KiB, free 2002.0 MiB)
[63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 
KiB, free: 2004.4 MiB)
[64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[65] Created broadcast 222 from apply at OutcomeOf.scala:85
[66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.5 MiB)
[68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.5 MiB)
[69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, 
free 2002.5 MiB)
[70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.5 MiB)
[71] Removed broadcast_188_piece0 on 

[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30599:
---
Description: 
The LogAppender has a limit for the number of logged event. By default, it is 
100. All tests that use LogAppender finish successfully with this limit but 
sometimes the test fails with the exception like:
{code:java}
java.lang.IllegalStateException: Number of events reached the limit of 100 
while logging CSV header matches to schema w/ enforceSchema.
at 
org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856) 
{code}
For example, the CSV test "SPARK-23786: warning should be printed if CSV header 
doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` 
contains stored log event in normal runs, another one `fail_trace.txt` which I 
got while re-running the test in a loop.

The difference is in:
{code:java}
[35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, 
free 1998.5 MiB)
[36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.8 MiB)
[37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2003.9 MiB)
[39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.0 MiB)
[43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.0 MiB)
[45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.0 MiB)
[46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.0 MiB)
[47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.1 MiB)
[50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.2 MiB)
[52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.2 MiB)
[53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.2 MiB)
[55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.3 MiB)
[59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.3 MiB)
[60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.4 MiB)
[61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.4 MiB)
[62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 
KiB, free 2002.0 MiB)
[63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 
KiB, free: 2004.4 MiB)
[64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[65] Created broadcast 222 from apply at OutcomeOf.scala:85
[66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.5 MiB)
[68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.5 MiB)
[69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, 
free 2002.5 MiB)
[70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.5 MiB)
[71] Removed 

[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020465#comment-17020465
 ] 

Maxim Gekk commented on SPARK-30599:


I see at least 2 solutions:
1. Increase the limit from 100 to something bigger number. For example, 1000
2. Pass to LogAppender's constructor a sequence of strings, and filter incoming 
log events by the strings in the `append()` method.

[~cloud_fan] [~dongjoon] [~hyukjin.kwon] [~maropu] [~srowen] WDYT?

> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: fail_trace.txt, success_trace.txt
>
>
> The LogAppender has a limit for the number of logged event. By default, it is 
> 100. All tests that use LogAppender finish successfully with this limit but 
> sometime the test fail with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test "SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema" uses 2 log appenders. The 
> `success_trace.txt` contains stored log event in normal runs, another one 
> `fail_trace.txt` I got while re-running the test in a loop.
> The difference is in:
> {code}
> [35] Block broadcast_222 stored as values in memory (estimated size 286.9 
> KiB, free 1998.5 MiB)
> [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.8 MiB)
> [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2003.9 MiB)
> [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2003.9 MiB)
> [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2003.9 MiB)
> [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.0 MiB)
> [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.0 MiB)
> [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.0 MiB)
> [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.0 MiB)
> [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.0 MiB)
> [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.1 MiB)
> [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.1 MiB)
> [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
> KiB, free: 2004.2 MiB)
> [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.2 MiB)
> [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.2 MiB)
> [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
> KiB, free: 2004.2 MiB)
> [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
> KiB, free: 2004.2 MiB)
> [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.3 MiB)
> [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
> KiB, free: 2004.3 MiB)
> [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
> KiB, free: 2004.4 MiB)
> [61] Removed broadcast_220_piece0 on 

[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30599:
---
Description: 
The LogAppender has a limit for the number of logged event. By default, it is 
100. All tests that use LogAppender finish successfully with this limit but 
sometime the test fail with the exception like:
{code:java}
java.lang.IllegalStateException: Number of events reached the limit of 100 
while logging CSV header matches to schema w/ enforceSchema.
at 
org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856) 
{code}
For example, the CSV test "SPARK-23786: warning should be printed if CSV header 
doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` 
contains stored log event in normal runs, another one `fail_trace.txt` I got 
while re-running the test in a loop.

The difference is in:
{code}
[35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, 
free 1998.5 MiB)
[36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.8 MiB)
[37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2003.9 MiB)
[39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2003.9 MiB)
[41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2003.9 MiB)
[42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.0 MiB)
[43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.0 MiB)
[45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.0 MiB)
[46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.0 MiB)
[47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.0 MiB)
[48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.1 MiB)
[50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.1 MiB)
[51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 
KiB, free: 2004.2 MiB)
[52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.2 MiB)
[53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.2 MiB)
[55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.2 MiB)
[57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 
KiB, free: 2004.2 MiB)
[58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.3 MiB)
[59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.3 MiB)
[60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.4 MiB)
[61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.4 MiB)
[62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 
KiB, free 2002.0 MiB)
[63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 
KiB, free: 2004.4 MiB)
[64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[65] Created broadcast 222 from apply at OutcomeOf.scala:85
[66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.4 MiB)
[67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 
KiB, free: 2004.5 MiB)
[68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 
KiB, free: 2004.5 MiB)
[69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, 
free 2002.5 MiB)
[70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 
KiB, free: 2004.5 MiB)
[71] Removed broadcast_188_piece0 on 

[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30599:
---
Description: 
The LogAppender has a limit for the number of logged event. By default, it is 
100. All tests that use LogAppender finish successfully with this limit but 
sometime the test fail with the exception like:
{code:java}
java.lang.IllegalStateException: Number of events reached the limit of 100 
while logging CSV header matches to schema w/ enforceSchema.
at 
org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856) 
{code}
For example, the CSV test "SPARK-23786: warning should be printed if CSV header 
doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` 
contains stored log event in normal runs, another one `fail_trace.txt` I got 
while re-running the test in a loop.

  was:
The LogAppender has a limit for the number of logged event. By default, it is 
100. All tests that use LogAppender finish successfully with this limit but 
sometime the test fail with the exception like:
{code:java}
java.lang.IllegalStateException: Number of events reached the limit of 100 
while logging CSV header matches to schema w/ enforceSchema.
at 
org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856) 
{code}
For example, the CSV test "SPARK-23786: warning should be printed if CSV header 
doesn't conform to schema" uses 2 log appenders.


> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: fail_trace.txt, success_trace.txt
>
>
> The LogAppender has a limit for the number of logged event. By default, it is 
> 100. All tests that use LogAppender finish successfully with this limit but 
> sometime the test fail with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test "SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema" uses 2 log appenders. The 
> `success_trace.txt` contains stored log event in normal runs, another one 
> `fail_trace.txt` I got while re-running the test in a loop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30599:
---
Attachment: fail_trace.txt

> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: fail_trace.txt, success_trace.txt
>
>
> The LogAppender has a limit for the number of logged event. By default, it is 
> 100. All tests that use LogAppender finish successfully with this limit but 
> sometime the test fail with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test "SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema" uses 2 log appenders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-30599:
--

 Summary: SparkFunSuite.LogAppender throws 
java.lang.IllegalStateException
 Key: SPARK-30599
 URL: https://issues.apache.org/jira/browse/SPARK-30599
 Project: Spark
  Issue Type: Test
  Components: Spark Core, SQL, Tests
Affects Versions: 3.0.0
Reporter: Maxim Gekk
 Attachments: success_trace.txt

The LogAppender has a limit for the number of logged event. By default, it is 
100. All tests that use LogAppender finish successfully with this limit but 
sometime the test fail with the exception like:
{code:java}
java.lang.IllegalStateException: Number of events reached the limit of 100 
while logging CSV header matches to schema w/ enforceSchema.
at 
org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
at 
org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
at org.apache.log4j.Category.callAppenders(Category.java:206)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856) 
{code}
For example, the CSV test "SPARK-23786: warning should be printed if CSV header 
doesn't conform to schema" uses 2 log appenders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException

2020-01-21 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-30599:
---
Attachment: success_trace.txt

> SparkFunSuite.LogAppender throws java.lang.IllegalStateException
> 
>
> Key: SPARK-30599
> URL: https://issues.apache.org/jira/browse/SPARK-30599
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
> Attachments: success_trace.txt
>
>
> The LogAppender has a limit for the number of logged event. By default, it is 
> 100. All tests that use LogAppender finish successfully with this limit but 
> sometime the test fail with the exception like:
> {code:java}
> java.lang.IllegalStateException: Number of events reached the limit of 100 
> while logging CSV header matches to schema w/ enforceSchema.
>   at 
> org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200)
>   at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
>   at 
> org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
>   at org.apache.log4j.Category.callAppenders(Category.java:206)
>   at org.apache.log4j.Category.forcedLog(Category.java:391)
>   at org.apache.log4j.Category.log(Category.java:856) 
> {code}
> For example, the CSV test "SPARK-23786: warning should be printed if CSV 
> header doesn't conform to schema" uses 2 log appenders.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30528) DPP issues

2020-01-21 Thread Wei Xue (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020385#comment-17020385
 ] 

Wei Xue commented on SPARK-30528:
-

Good point, [~mayurb31]!

1. Heuristics: yes, we should improve the cost estimate for the filter plan. As 
a quick workaround, though, you could set 

"spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio" to "0.0", 
which would disable this kind of DPP if columns stats are not available.

2. Reuse: yes, that's a dilemma here: first of all we wanna reduce the result 
set returned to the driver for pruning values, and that's why the Aggregate is 
added. It might kill some potential opportunities for exchange reuse if the 
filter plan contains another join, but that kind of potential opportunity is 
not fully guaranteed even if we didn't push down the Aggregate, for the  join 
in the filter plan can turn out to be a BHJ. However, `pruningHasBenefit` had 
intended to cost the entire DPP subquery as overhead without considering reuse, 
so this takes us back to point 1: we should improve the costing of heuristics 
so that this kind of DPP should not be triggered at all if the scan + join 
would be just too much work.

3. Can you attach the query plan instead of UI for your last example? I think 
the reuse did not happen because the first subquery selects `col1` while the 
second `col2`? 

> DPP issues
> --
>
> Key: SPARK-30528
> URL: https://issues.apache.org/jira/browse/SPARK-30528
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: Mayur Bhosale
>Priority: Major
>  Labels: performance
> Attachments: cases.png, dup_subquery.png, plan.png
>
>
> In DPP, heuristics to decide if DPP is going to benefit relies on the sizes 
> of the tables in the right subtree of the join. This might not be a correct 
> estimate especially when the detailed column level stats are not available.
> {code:java}
> // the pruning overhead is the total size in bytes of all scan relations
> val overhead = 
> otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat
> filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat
> {code}
> Also, DPP executes the entire right side of the join as a subquery because of 
> which multiple scans happen for the tables in the right subtree of the join. 
> This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of 
> the subquery result does not happen. Also, I couldn’t figure out, why do the 
> results from the subquery get re-used only for BHJ?
>  
> Consider a query,
> {code:java}
> SELECT * 
> FROM   store_sales_partitioned 
>JOIN (SELECT * 
>  FROM   store_returns_partitioned, 
> date_dim 
>  WHERE  sr_returned_date_sk = d_date_sk) ret_date 
>  ON ss_sold_date_sk = d_date_sk 
> WHERE  d_fy_quarter_seq > 0 
> {code}
> DPP will kick-in for both the join. (Please check the image plan.png attached 
> below for the plan)
> Some of the observations -
>  * Based on heuristics, DPP would go ahead with pruning if the cost of 
> scanning the tables in the right sub-tree of the join is less than the 
> benefit due to pruning. This is due to the reason that multiple scans will be 
> needed for an SMJ. But heuristics simply checks if the benefits offset the 
> cost of multiple scans and do not take into consideration other operations 
> like Join, etc in the right subtree which can be quite expensive. This issue 
> will be particularly prominent when detailed column level stats are not 
> available. In the example above, a decision that pruningHasBenefit was made 
> on the basis of sizes of the tables store_returns_partitioned and date_dim 
> but did not take into consideration the join between them before the join 
> happens with the store_sales_partitioned table.
>  * Multiple scans are needed when the join is SMJ as the reuse of the 
> exchanges does not happen. This is because Aggregate gets added on top of the 
> right subtree to be executed as a subquery in order to prune only required 
> columns. Here, scanning all the columns as the right subtree of the join 
> would, and reusing the same exchange might be more helpful as it avoids 
> duplicate scans.
> This was just a representative example, but in-general for cases such as in 
> the image cases.png below, DPP can cause performance issues.
>  
> Also, for the cases when there are multiple DPP compatible join conditions in 
> the same join, the entire right subtree of the join would be executed as a 
> subquery that many times. Consider an example,
> {code:java}
> SELECT * 
> FROM   partitionedtable 
>    JOIN nonpartitionedtable 
>  ON partcol1 = col1 
> AND partcol2 = col2 
> WHERE  nonpartitionedtable.id > 0 
> {code}
> Here 

[jira] [Commented] (SPARK-30049) SQL fails to parse when comment contains an unmatched quote character

2020-01-21 Thread Javier Fuentes (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020378#comment-17020378
 ] 

Javier Fuentes commented on SPARK-30049:


I have raised a PR with a simple fix and unit tests. Let me know if it looks 
good or not. 

> SQL fails to parse when comment contains an unmatched quote character
> -
>
> Key: SPARK-30049
> URL: https://issues.apache.org/jira/browse/SPARK-30049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Darrell Lowe
>Priority: Major
> Attachments: Screen Shot 2019-12-18 at 9.26.29 AM.png
>
>
> A SQL statement that contains a comment with an unmatched quote character can 
> lead to a parse error.  These queries parsed correctly in older versions of 
> Spark.  For example, here's an excerpt from an interactive spark-sql session 
> on a recent Spark-3.0.0-SNAPSHOT build (commit 
> e23c135e568d4401a5659bc1b5ae8fc8bf147693):
> {noformat}
> spark-sql> SELECT 1 -- someone's comment here
>  > ;
> Error in query: 
> extraneous input ';' expecting (line 2, pos 0)
> == SQL ==
> SELECT 1 -- someone's comment here
> ;
> ^^^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30598) Detect equijoins better

2020-01-21 Thread Peter Toth (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-30598:
---
Issue Type: Improvement  (was: Bug)

> Detect equijoins better
> ---
>
> Key: SPARK-30598
> URL: https://issues.apache.org/jira/browse/SPARK-30598
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Minor
>
> The following 2 query produce different plans, as the second one is not 
> recognised as equijoin.
> {noformat}
> SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 AND t1.c = t2.c
> SortMergeJoin [c#225], [c#236], FullOuter, ((c2#226 = 2) AND (c2#237 = 2))
> :- *(2) Sort [c#225 ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(c#225, 5), true, [id=#101]
> : +- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226]
> :+- *(1) LocalTableScan [_1#220, _2#221]
> +- *(4) Sort [c#236 ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(c#236, 5), true, [id=#106]
>   +- *(3) Project [_1#231 AS c#236, _2#232 AS c2#237]
>  +- *(3) LocalTableScan [_1#231, _2#232]
> {noformat}
> {noformat}
> SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2
> BroadcastNestedLoopJoin BuildRight, FullOuter, ((c2#226 = 2) AND (c2#237 = 2))
> :- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226]
> :  +- *(1) LocalTableScan [_1#220, _2#221]
> +- BroadcastExchange IdentityBroadcastMode, [id=#146]
>+- *(2) Project [_1#231 AS c#236, _2#232 AS c2#237]
>   +- *(2) LocalTableScan [_1#231, _2#232]
> {noformat}
> We could detect the implicit equalities from the join condition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30598) Detect equijoins better

2020-01-21 Thread Peter Toth (Jira)
Peter Toth created SPARK-30598:
--

 Summary: Detect equijoins better
 Key: SPARK-30598
 URL: https://issues.apache.org/jira/browse/SPARK-30598
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


The following 2 query produce different plans, as the second one is not 
recognised as equijoin.

{noformat}
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 AND t1.c = t2.c

SortMergeJoin [c#225], [c#236], FullOuter, ((c2#226 = 2) AND (c2#237 = 2))
:- *(2) Sort [c#225 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(c#225, 5), true, [id=#101]
: +- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226]
:+- *(1) LocalTableScan [_1#220, _2#221]
+- *(4) Sort [c#236 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(c#236, 5), true, [id=#106]
  +- *(3) Project [_1#231 AS c#236, _2#232 AS c2#237]
 +- *(3) LocalTableScan [_1#231, _2#232]
{noformat}

{noformat}
SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2

BroadcastNestedLoopJoin BuildRight, FullOuter, ((c2#226 = 2) AND (c2#237 = 2))
:- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226]
:  +- *(1) LocalTableScan [_1#220, _2#221]
+- BroadcastExchange IdentityBroadcastMode, [id=#146]
   +- *(2) Project [_1#231 AS c#236, _2#232 AS c2#237]
  +- *(2) LocalTableScan [_1#231, _2#232]
{noformat}

We could detect the implicit equalities from the join condition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15616) CatalogRelation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15616.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26805
[https://github.com/apache/spark/pull/26805]

> CatalogRelation should fallback to HDFS size of partitions that are involved 
> in Query if statistics are not available.
> --
>
> Key: SPARK-15616
> URL: https://issues.apache.org/jira/browse/SPARK-15616
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lianhui Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently if some partitions of a partitioned table are used in join 
> operation we rely on Metastore returned size of table to calculate if we can 
> convert the operation to Broadcast join. 
> if Filter can prune some partitions, Hive can prune partition before 
> determining to use broadcast joins according to HDFS size of partitions that 
> are involved in Query.So sparkSQL needs it that can improve join's 
> performance for partitioned table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30252) Disallow negative scale of Decimal under ansi mode

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30252.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26881
[https://github.com/apache/spark/pull/26881]

> Disallow negative scale of Decimal under ansi mode
> --
>
> Key: SPARK-30252
> URL: https://issues.apache.org/jira/browse/SPARK-30252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> According to SQL standard,
> {quote}4.4.2 Characteristics of numbers
> An exact numeric type has a precision P and a scale S. P is a positive 
> integer that determines the number of significant digits in a particular 
> radix R, where R is either 2 or 10. S is a non-negative integer.
> {quote}
> scale of Decimal should always be non-negative. And other mainstream 
> databases, like Presto, PostgreSQL, also don't allow negative scale.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30252) Disallow negative scale of Decimal under ansi mode

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30252:
---

Assignee: wuyi

> Disallow negative scale of Decimal under ansi mode
> --
>
> Key: SPARK-30252
> URL: https://issues.apache.org/jira/browse/SPARK-30252
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> According to SQL standard,
> {quote}4.4.2 Characteristics of numbers
> An exact numeric type has a precision P and a scale S. P is a positive 
> integer that determines the number of significant digits in a particular 
> radix R, where R is either 2 or 10. S is a non-negative integer.
> {quote}
> scale of Decimal should always be non-negative. And other mainstream 
> databases, like Presto, PostgreSQL, also don't allow negative scale.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30593) Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and no round trip.

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30593:
---

Assignee: Kent Yao

> Revert interval ISO/ANSI SQL Standard output since we decide not to follow 
> ANSI and no round trip.
> --
>
> Key: SPARK-30593
> URL: https://issues.apache.org/jira/browse/SPARK-30593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30593) Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and no round trip.

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30593.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27304
[https://github.com/apache/spark/pull/27304]

> Revert interval ISO/ANSI SQL Standard output since we decide not to follow 
> ANSI and no round trip.
> --
>
> Key: SPARK-30593
> URL: https://issues.apache.org/jira/browse/SPARK-30593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-29783) Support SQL Standard output style for interval type

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-29783:
-

> Support SQL Standard output style for interval type
> ---
>
> Key: SPARK-29783
> URL: https://issues.apache.org/jira/browse/SPARK-29783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Support sql standard interval-style for output.
>  
> ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval||
> |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06|
> |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes 
> 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29783) Support SQL Standard output style for interval type

2020-01-21 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020205#comment-17020205
 ] 

Wenchen Fan commented on SPARK-29783:
-

This has been reverted in https://github.com/apache/spark/pull/27304

> Support SQL Standard output style for interval type
> ---
>
> Key: SPARK-29783
> URL: https://issues.apache.org/jira/browse/SPARK-29783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Support sql standard interval-style for output.
>  
> ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval||
> |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06|
> |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes 
> 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29783) Support SQL Standard output style for interval type

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29783.
-
Resolution: Won't Do

> Support SQL Standard output style for interval type
> ---
>
> Key: SPARK-29783
> URL: https://issues.apache.org/jira/browse/SPARK-29783
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Support sql standard interval-style for output.
>  
> ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval||
> |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06|
> |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes 
> 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30597) Unable to load properties fine in SparkStandalone HDFS mode

2020-01-21 Thread Gajanan Hebbar (Jira)
Gajanan Hebbar created SPARK-30597:
--

 Summary: Unable to load properties fine in SparkStandalone HDFS 
mode
 Key: SPARK-30597
 URL: https://issues.apache.org/jira/browse/SPARK-30597
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3
 Environment: 2.4.3


standalone with HDFS 
Reporter: Gajanan Hebbar


We run the spark application in Yarn HDFS/NFS/WebHDFS and standalone HDFS/NFS 
mode.

when the application is submitted in standalone HDFS mode the configuration 
jar(properties file) is not read when the application is started, and this make 
the logger to fall back to default log file and log level.

So when application is submitted to standalone HDFS, configuration files are 
not read. 


STD OUT LOGS from Standalone HDFS  - properties file is not found 

==

log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Trying to find [osa-log.properties] using 
sun.misc.Launcher$AppClassLoader@4cdf35a9 class loader.
log4j: Trying to find [osa-log.properties] using 
ClassLoader.getSystemResource().
log4j: Could not find resource: [osa-log.properties].
log4j: Reading configuration from URL 
jar:file:/osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties

 

 

STD out from Standalone NFS  - properties files are found and able to load it

===

 

log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@4cdf35a9.
log4j: Using URL 
[jar:file:/scratch//gitlocal/soa-osa/out/configdir/ux6m3UQZ/app/sx_DatePipeline_A44C5337_B0D6_4A67_9D60_6BE629DABADA_x0jR7lg5_public/__config__.jar!/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator

 

STD out from YARN HDFS  — properties files are found and able to load it

===

log4j: Trying to find [osa-log.properties] using context classloader 
sun.misc.Launcher$AppClassLoader@2626b418.
log4j: Using URL 
[file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties]
 for automatic log4j configuration.
log4j: Preferred configurator class: OSALogPropertyConfigurator
log4j: configuration 
file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30546) Make interval type more future-proofing

2020-01-21 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-30546:
-
Description: 
Before 3.0 we may make some efforts for the current interval type to make it
more future-proofing. e.g.
1. add unstable annotation to the CalendarInterval class. People already use
it as UDF inputs so it’s better to make it clear it’s unstable.
2. Add a schema checker to prohibit create v2 custom catalog table with
intervals, as same as what we do for the builtin catalog
3. Add a schema checker for DataFrameWriterV2 too
4. Make the interval type incomparable as version 2.4 for disambiguation of
comparison between year-month and day-time fields
5. The 3.0 newly added to_csv should not support output intervals as same as
using CSV file format
6. The function to_json should not allow using interval as a key field as
same as the value field and JSON datasource, with a legacy config to
restore.
7. Revert interval ISO/ANSI SQL Standard output since we decide not to
follow ANSI, so there is no round trip.

  was:


Before 3.0 we maymake some efforts for the current interval type to make it
more future-proofing. e.g.
1. add unstable annotation to the CalendarInterval class. People already use
it as UDF inputs so it’s better to make it clear it’s unstable.
2. Add a schema checker to prohibit create v2 custom catalog table with
intervals, as same as what we do for the builtin catalog
3. Add a schema checker for DataFrameWriterV2 too
4. Make the interval type incomparable as version 2.4 for disambiguation of
comparison between year-month and day-time fields
5. The 3.0 newly added to_csv should not support output intervals as same as
using CSV file format
6. The function to_json should not allow using interval as a key field as
same as the value field and JSON datasource, with a legacy config to
restore.
7. Revert interval ISO/ANSI SQL Standard output since we decide not to
follow ANSI, so there is no round trip.


> Make interval type more future-proofing
> ---
>
> Key: SPARK-30546
> URL: https://issues.apache.org/jira/browse/SPARK-30546
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> Before 3.0 we may make some efforts for the current interval type to make it
> more future-proofing. e.g.
> 1. add unstable annotation to the CalendarInterval class. People already use
> it as UDF inputs so it’s better to make it clear it’s unstable.
> 2. Add a schema checker to prohibit create v2 custom catalog table with
> intervals, as same as what we do for the builtin catalog
> 3. Add a schema checker for DataFrameWriterV2 too
> 4. Make the interval type incomparable as version 2.4 for disambiguation of
> comparison between year-month and day-time fields
> 5. The 3.0 newly added to_csv should not support output intervals as same as
> using CSV file format
> 6. The function to_json should not allow using interval as a key field as
> same as the value field and JSON datasource, with a legacy config to
> restore.
> 7. Revert interval ISO/ANSI SQL Standard output since we decide not to
> follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30571) coalesce shuffle reader with splitting shuffle fetch request fails

2020-01-21 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30571.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27280
[https://github.com/apache/spark/pull/27280]

> coalesce shuffle reader with splitting shuffle fetch request fails
> --
>
> Key: SPARK-30571
> URL: https://issues.apache.org/jira/browse/SPARK-30571
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19798) Query returns stale results when tables are modified on other sessions

2020-01-21 Thread Giambattista (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Giambattista updated SPARK-19798:
-
Affects Version/s: 3.0.0

> Query returns stale results when tables are modified on other sessions
> --
>
> Key: SPARK-19798
> URL: https://issues.apache.org/jira/browse/SPARK-19798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 3.0.0
>Reporter: Giambattista
>Priority: Major
>
> I observed the problem on master branch with thrift server in multisession 
> mode (default), but I was able to replicate also with spark-shell as well 
> (see below the sequence for replicating).
> I observed cases where changes made in a session (table insert, table 
> renaming) are not visible to other derived sessions (created with 
> session.newSession).
> The problem seems due to the fact that each session has its own 
> tableRelationCache and it does not get refreshed.
> IMO tableRelationCache should be shared in sharedState, maybe in the 
> cacheManager so that refresh of caches for data that is not session-specific 
> such as temporary tables gets centralized.  
> --- Spark shell script
> val spark2 = spark.newSession
> spark.sql("CREATE TABLE test (a int) using parquet")
> spark2.sql("select * from test").show // OK returns empty
> spark.sql("select * from test").show // OK returns empty
> spark.sql("insert into TABLE test values 1,2,3")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 3,2,1
> spark.sql("create table test2 (a int) using parquet")
> spark.sql("insert into TABLE test2 values 4,5,6")
> spark2.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("select * from test2").show // OK returns 6,4,5
> spark.sql("alter table test rename to test3")
> spark.sql("alter table test2 rename to test")
> spark.sql("alter table test3 rename to test2")
> spark2.sql("select * from test").show // ERROR returns empty
> spark.sql("select * from test").show // OK returns 6,4,5
> spark2.sql("select * from test2").show // ERROR throws 
> java.io.FileNotFoundException
> spark.sql("select * from test2").show // OK returns 3,1,2



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30596) Use a different repository id in pom to work around sbt-pom-reader issue

2020-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30596.
--
Resolution: Invalid

> Use a different repository id in pom to work around sbt-pom-reader issue
> 
>
> Key: SPARK-30596
> URL: https://issues.apache.org/jira/browse/SPARK-30596
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See [https://github.com/apache/spark/pull/27242#issuecomment-576589529]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30596) Use a different repository id in pom to work around sbt-pom-reader issue

2020-01-21 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-30596:


 Summary: Use a different repository id in pom to work around 
sbt-pom-reader issue
 Key: SPARK-30596
 URL: https://issues.apache.org/jira/browse/SPARK-30596
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.5, 3.0.0
Reporter: Hyukjin Kwon


See [https://github.com/apache/spark/pull/27242#issuecomment-576589529]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30596) Use a different repository id in pom to work around sbt-pom-reader issue

2020-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30596:
-
Issue Type: Bug  (was: Improvement)

> Use a different repository id in pom to work around sbt-pom-reader issue
> 
>
> Key: SPARK-30596
> URL: https://issues.apache.org/jira/browse/SPARK-30596
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See [https://github.com/apache/spark/pull/27242#issuecomment-576589529]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30534) Use mvn in `dev/scalastyle`

2020-01-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020076#comment-17020076
 ] 

Hyukjin Kwon commented on SPARK-30534:
--

[https://github.com/apache/spark/pull/27281] fixed this issue.

> Use mvn in `dev/scalastyle`
> ---
>
> Key: SPARK-30534
> URL: https://issues.apache.org/jira/browse/SPARK-30534
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> This issue aims to use `mvn` instead of `sbt`.
> As of now, Apache Spark sbt build is broken by the Maven Central repository 
> policy.
> - 
> https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an
> > Effective January 15, 2020, The Central Maven Repository no longer supports 
> > insecure
> > communication over plain HTTP and requires that all requests to the 
> > repository are 
> > encrypted over HTTPS.
> We can reproduce this locally by the following.
> $ rm -rf ~/.m2/repository/org/apache/apache/18/
> $ build/sbt clean
> This issue aims to recover GitHub Action `lint-scala` first by using mvn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-30572) Add a fallback Maven repository

2020-01-21 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30572:
-
Comment: was deleted

(was: [https://github.com/apache/spark/pull/27281] fixed this issue.)

> Add a fallback Maven repository
> ---
>
> Key: SPARK-30572
> URL: https://issues.apache.org/jira/browse/SPARK-30572
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> This issue aims to add a fallback Maven repository when a mirror to `central` 
> fail.
> For example, when we use Google Maven Central in GitHub Action as a mirror of 
> `central`,
> this will be used when Google Maven Central is out of sync due to its late 
> sync cycle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30572) Add a fallback Maven repository

2020-01-21 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020073#comment-17020073
 ] 

Hyukjin Kwon commented on SPARK-30572:
--

[https://github.com/apache/spark/pull/27281] fixed this issue.

> Add a fallback Maven repository
> ---
>
> Key: SPARK-30572
> URL: https://issues.apache.org/jira/browse/SPARK-30572
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> This issue aims to add a fallback Maven repository when a mirror to `central` 
> fail.
> For example, when we use Google Maven Central in GitHub Action as a mirror of 
> `central`,
> this will be used when Google Maven Central is out of sync due to its late 
> sync cycle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30296) Dataset diffing transformation

2020-01-21 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30296:
--
Priority: Minor  (was: Major)

> Dataset diffing transformation
> --
>
> Key: SPARK-30296
> URL: https://issues.apache.org/jira/browse/SPARK-30296
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Minor
>
> Evolving Spark code needs frequent regression testing to prove it still 
> produces identical results, or if changes are expected, to investigate those 
> changes. Diffing the Datasets of two code paths provides confidence.
> Diffing small schemata is easy, but with wide schema the Spark query becomes 
> laborious and error-prone. With a single proven and tested method, diffing 
> becomes easier and a more reliable operation. As a Dataset transformation, 
> you get this operation first hand with your Dataset API.
> This has proven to be useful for interactive spark as well as deployed 
> production code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30296) Dataset diffing transformation

2020-01-21 Thread Enrico Minack (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack resolved SPARK-30296.
---
Resolution: Won't Do

> Dataset diffing transformation
> --
>
> Key: SPARK-30296
> URL: https://issues.apache.org/jira/browse/SPARK-30296
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> Evolving Spark code needs frequent regression testing to prove it still 
> produces identical results, or if changes are expected, to investigate those 
> changes. Diffing the Datasets of two code paths provides confidence.
> Diffing small schemata is easy, but with wide schema the Spark query becomes 
> laborious and error-prone. With a single proven and tested method, diffing 
> becomes easier and a more reliable operation. As a Dataset transformation, 
> you get this operation first hand with your Dataset API.
> This has proven to be useful for interactive spark as well as deployed 
> production code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30595) Unable to create local temp dir on spark on k8s mode, with defaults.

2020-01-21 Thread Prashant Sharma (Jira)
Prashant Sharma created SPARK-30595:
---

 Summary: Unable to create local temp dir on spark on k8s mode, 
with defaults.
 Key: SPARK-30595
 URL: https://issues.apache.org/jira/browse/SPARK-30595
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Prashant Sharma


Unless we configure the property,  {code} spark.local.dir /tmp {code} following 
error occurs:

{noformat}
*20/01/21 08:33:17 INFO SparkEnv: Registering BlockManagerMasterHeartbeat*

*20/01/21 08:33:17 ERROR DiskBlockManager: Failed to create local dir in 
/var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4. Ignoring this directory.*

*java.io.IOException: Failed to create a temp directory (under 
/var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4) after 10 attempts!*

*at org.apache.spark.util.Utils$.createDirectory(Utils.scala:304)*

*at 
org.apache.spark.storage.DiskBlockManager.$anonfun$createLocalDirs$1(DiskBlockManager.scala:164)*

*at 
scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)*

*at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)*
{noformat}

I have not yet fully understood the root cause, will post my findings once it 
is clear.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30594) Do not post SparkListenerBlockUpdated when updateBlockInfo returns false

2020-01-21 Thread wuyi (Jira)
wuyi created SPARK-30594:


 Summary: Do not post SparkListenerBlockUpdated when 
updateBlockInfo returns false
 Key: SPARK-30594
 URL: https://issues.apache.org/jira/browse/SPARK-30594
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 3.0.0
Reporter: wuyi


We should not post SparkListenerBlockUpdated event when updateBlockInfo returns 
false, which may possible show negative memory in UI(see snapshot in PR of 
SPARK-30465). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org