[jira] [Updated] (SPARK-27298) Dataset except operation gives different results(dataset count) on Spark 2.3.0 Windows and Spark 2.3.0 Linux environment
[ https://issues.apache.org/jira/browse/SPARK-27298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27298: -- Affects Version/s: 2.4.2 > Dataset except operation gives different results(dataset count) on Spark > 2.3.0 Windows and Spark 2.3.0 Linux environment > > > Key: SPARK-27298 > URL: https://issues.apache.org/jira/browse/SPARK-27298 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.2 >Reporter: Mahima Khatri >Priority: Major > Labels: data-loss > Attachments: Console-Result-Windows.txt, > console-reslt-2.3.3-linux.txt, console-result-2.3.3-windows.txt, > console-result-LinuxonVM.txt, console-result-spark-2.4.2-linux, > console-result-spark-2.4.2-windows, customer.csv, pom.xml > > > {code:java} > // package com.verifyfilter.example; > import org.apache.spark.SparkConf; > import org.apache.spark.SparkContext; > import org.apache.spark.sql.SparkSession; > import org.apache.spark.sql.Column; > import org.apache.spark.sql.Dataset; > import org.apache.spark.sql.Row; > import org.apache.spark.sql.SaveMode; > public class ExcludeInTesting { > public static void main(String[] args) { > SparkSession spark = SparkSession.builder() > .appName("ExcludeInTesting") > .config("spark.some.config.option", "some-value") > .getOrCreate(); > Dataset dataReadFromCSV = spark.read().format("com.databricks.spark.csv") > .option("header", "true") > .option("delimiter", "|") > .option("inferSchema", "true") > //.load("E:/resources/customer.csv"); local //below path for VM > .load("/home/myproject/bda/home/bin/customer.csv"); > dataReadFromCSV.printSchema(); > dataReadFromCSV.show(); > //Adding an extra step of saving to db and then loading it again > dataReadFromCSV.write().mode(SaveMode.Overwrite).saveAsTable("customer"); > Dataset dataLoaded = spark.sql("select * from customer"); > //Gender EQ M > Column genderCol = dataLoaded.col("Gender"); > Dataset onlyMaleDS = dataLoaded.where(genderCol.equalTo("M")); > //Dataset onlyMaleDS = spark.sql("select count(*) from customer where > Gender='M'"); > onlyMaleDS.show(); > System.out.println("The count of Male customers is :"+ onlyMaleDS.count()); > System.out.println("*"); > // Income in the list > Object[] valuesArray = new Object[5]; > valuesArray[0]=503.65; > valuesArray[1]=495.54; > valuesArray[2]=486.82; > valuesArray[3]=481.28; > valuesArray[4]=479.79; > Column incomeCol = dataLoaded.col("Income"); > Dataset incomeMatchingSet = dataLoaded.where(incomeCol.isin((Object[]) > valuesArray)); > System.out.println("The count of customers satisfaying Income is :"+ > incomeMatchingSet.count()); > System.out.println("*"); > Dataset maleExcptIncomeMatch = onlyMaleDS.except(incomeMatchingSet); > System.out.println("The count of final customers is :"+ > maleExcptIncomeMatch.count()); > System.out.println("*"); > } > } > {code} > When the above code is executed on Spark 2.3.0 ,it gives below different > results: > *Windows* : The code gives correct count of dataset 148237, > *Linux :* The code gives different {color:#172b4d}count of dataset > 129532 {color} > > {color:#172b4d}Some more info related to this bug:{color} > {color:#172b4d}1. Application Code (attached) > 2. CSV file used(attached) > 3. Windows spec > Windows 10- 64 bit OS > 4. Linux spec (Running on Oracle VM virtual box) > Specifications: \{as captured from Vbox.log} > 00:00:26.112908 VMMDev: Guest Additions information report: Version > 5.0.32 r112930 '5.0.32_Ubuntu' > 00:00:26.112996 VMMDev: Guest Additions information report: Interface > = 0x00010004 osType = 0x00053100 (Linux >= 2.6, 64-bit) > 5. Snapshots of output in both cases (attached){color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28024: -- Affects Version/s: 2.0.2 2.1.3 > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Critical > Labels: correctness > Attachments: SPARK-28024.png > > > For example > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > {code} > Case 4: > {code:sql} > spark-sql> select exp(-1.2345678901234E200); > 0.0 > postgres=# select exp(-1.2345678901234E200); > ERROR: value overflows numeric format > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28024) Incorrect numeric values when out of range
[ https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28024: -- Affects Version/s: 2.2.3 2.3.4 2.4.4 > Incorrect numeric values when out of range > -- > > Key: SPARK-28024 > URL: https://issues.apache.org/jira/browse/SPARK-28024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.4, 3.0.0 >Reporter: Yuming Wang >Priority: Critical > Labels: correctness > Attachments: SPARK-28024.png > > > For example > Case 1: > {code:sql} > select tinyint(128) * tinyint(2); -- 0 > select smallint(2147483647) * smallint(2); -- -2 > select int(2147483647) * int(2); -- -2 > SELECT smallint((-32768)) * smallint(-1); -- -32768 > {code} > Case 2: > {code:sql} > spark-sql> select cast('10e-70' as float), cast('-10e-70' as float); > 0.0 -0.0 > {code} > Case 3: > {code:sql} > spark-sql> select cast('10e-400' as double), cast('-10e-400' as double); > 0.0 -0.0 > {code} > Case 4: > {code:sql} > spark-sql> select exp(-1.2345678901234E200); > 0.0 > postgres=# select exp(-1.2345678901234E200); > ERROR: value overflows numeric format > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions
[ https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27619: -- Affects Version/s: 2.3.4 > MapType should be prohibited in hash expressions > > > Key: SPARK-27619 > URL: https://issues.apache.org/jira/browse/SPARK-27619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Josh Rosen >Priority: Major > Labels: correctness > > Spark currently allows MapType expressions to be used as input to hash > expressions, but I think that this should be prohibited because Spark SQL > does not support map equality. > Currently, Spark SQL's map hashcodes are sensitive to the insertion order of > map elements: > {code:java} > val a = spark.createDataset(Map(1->1, 2->2) :: Nil) > val b = spark.createDataset(Map(2->2, 1->1) :: Nil) > // Demonstration of how Scala Map equality is unaffected by insertion order: > assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) > assert(Map(1->1, 2->2) == Map(2->2, 1->1)) > assert(a.first() == b.first()) > // In contrast, this will print two different hashcodes: > println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} > This behavior might be surprising to Scala developers. > I think there's precedence for banning the use of MapType here because we > already prohibit MapType in aggregation / joins / equality comparisons > (SPARK-9415) and set operations (SPARK-19893). > If we decide that we want this to be an error then it might also be a good > idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the > old and buggy behavior (in case applications were relying on it in cases > where it just so happens to be safe-by-accident (e.g. maps which only have > one entry)). > Alternatively, we could support hashing here if we implemented support for > comparable map types (SPARK-18134). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27619) MapType should be prohibited in hash expressions
[ https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020854#comment-17020854 ] Dongjoon Hyun commented on SPARK-27619: --- I just checked the result on 3.0.0-preview2. This issue is still valid. {code} scala> :paste // Entering paste mode (ctrl-D to finish) val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) // Exiting paste mode, now interpreting. List([-125051660], [783998793]) a: org.apache.spark.sql.Dataset[scala.collection.immutable.Map[Int,Int]] = [value: map] b: org.apache.spark.sql.Dataset[scala.collection.immutable.Map[Int,Int]] = [value: map] scala> spark.version res1: String = 3.0.0-preview2 {code} > MapType should be prohibited in hash expressions > > > Key: SPARK-27619 > URL: https://issues.apache.org/jira/browse/SPARK-27619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Josh Rosen >Priority: Major > Labels: correctness > > Spark currently allows MapType expressions to be used as input to hash > expressions, but I think that this should be prohibited because Spark SQL > does not support map equality. > Currently, Spark SQL's map hashcodes are sensitive to the insertion order of > map elements: > {code:java} > val a = spark.createDataset(Map(1->1, 2->2) :: Nil) > val b = spark.createDataset(Map(2->2, 1->1) :: Nil) > // Demonstration of how Scala Map equality is unaffected by insertion order: > assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) > assert(Map(1->1, 2->2) == Map(2->2, 1->1)) > assert(a.first() == b.first()) > // In contrast, this will print two different hashcodes: > println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} > This behavior might be surprising to Scala developers. > I think there's precedence for banning the use of MapType here because we > already prohibit MapType in aggregation / joins / equality comparisons > (SPARK-9415) and set operations (SPARK-19893). > If we decide that we want this to be an error then it might also be a good > idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the > old and buggy behavior (in case applications were relying on it in cases > where it just so happens to be safe-by-accident (e.g. maps which only have > one entry)). > Alternatively, we could support hashing here if we implemented support for > comparable map types (SPARK-18134). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27619) MapType should be prohibited in hash expressions
[ https://issues.apache.org/jira/browse/SPARK-27619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27619: -- Affects Version/s: 3.0.0 2.4.1 2.4.2 2.4.3 2.4.4 > MapType should be prohibited in hash expressions > > > Key: SPARK-27619 > URL: https://issues.apache.org/jira/browse/SPARK-27619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Josh Rosen >Priority: Major > Labels: correctness > > Spark currently allows MapType expressions to be used as input to hash > expressions, but I think that this should be prohibited because Spark SQL > does not support map equality. > Currently, Spark SQL's map hashcodes are sensitive to the insertion order of > map elements: > {code:java} > val a = spark.createDataset(Map(1->1, 2->2) :: Nil) > val b = spark.createDataset(Map(2->2, 1->1) :: Nil) > // Demonstration of how Scala Map equality is unaffected by insertion order: > assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) > assert(Map(1->1, 2->2) == Map(2->2, 1->1)) > assert(a.first() == b.first()) > // In contrast, this will print two different hashcodes: > println(Seq(a, b).map(_.selectExpr("hash(*)").first())){code} > This behavior might be surprising to Scala developers. > I think there's precedence for banning the use of MapType here because we > already prohibit MapType in aggregation / joins / equality comparisons > (SPARK-9415) and set operations (SPARK-19893). > If we decide that we want this to be an error then it might also be a good > idea to add a {{spark.sql.legacy}} flag as an escape-hatch to re-enable the > old and buggy behavior (in case applications were relying on it in cases > where it just so happens to be safe-by-accident (e.g. maps which only have > one entry)). > Alternatively, we could support hashing here if we implemented support for > comparable map types (SPARK-18134). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30218) Columns used in inequality conditions for joins not resolved correctly in case of common lineage
[ https://issues.apache.org/jira/browse/SPARK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020837#comment-17020837 ] Wenchen Fan commented on SPARK-30218: - I believe it has been fixed in master by https://issues.apache.org/jira/browse/SPARK-28344 . [~FC] can you try your query with the master branch? Thanks! > Columns used in inequality conditions for joins not resolved correctly in > case of common lineage > > > Key: SPARK-30218 > URL: https://issues.apache.org/jira/browse/SPARK-30218 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.4, 2.4.4 >Reporter: Francesco Cavrini >Assignee: Dongjoon Hyun >Priority: Major > Labels: correctness > > When columns from different data-frames that have a common lineage are used > in inequality conditions in joins, they are not resolved correctly. In > particular, both the column from the left DF and the one from the right DF > are resolved to the same column, thus making the inequality condition either > always satisfied or always not-satisfied. > Minimal example to reproduce follows. > {code:python} > import pyspark.sql.functions as F > data = spark.createDataFrame([["id1", "A", 0], ["id1", "A", 1], ["id2", "A", > 2], ["id2", "A", 3], ["id1", "B", 1] , ["id1", "B", 5], ["id2", "B", 10]], > ["id", "kind", "timestamp"]) > df_left = data.where(F.col("kind") == "A").alias("left") > df_right = data.where(F.col("kind") == "B").alias("right") > conds = [df_left["id"] == df_right["id"]] > conds.append(df_right["timestamp"].between(df_left["timestamp"], > df_left["timestamp"] + 2)) > res = df_left.join(df_right, conds, how="left") > {code} > The result is: > | id|kind|timestamp| id|kind|timestamp| > |id1| A|0|id1| B|1| > |id1| A|0|id1| B|5| > |id1| A|1|id1| B|1| > |id1| A|1|id1| B|5| > |id2| A|2|id2| B| 10| > |id2| A|3|id2| B| 10| > which violates the condition that the timestamp from the right DF should be > between df_left["timestamp"] and df_left["timestamp"] + 2. > The plan shows the problem in the column resolution. > {code:bash} > == Parsed Logical Plan == > Join LeftOuter, ((id#0 = id#36) && ((timestamp#2L >= timestamp#2L) && > (timestamp#2L <= (timestamp#2L + cast(2 as bigint) > :- SubqueryAlias `left` > : +- Filter (kind#1 = A) > : +- LogicalRDD [id#0, kind#1, timestamp#2L], false > +- SubqueryAlias `right` >+- Filter (kind#37 = B) > +- LogicalRDD [id#36, kind#37, timestamp#38L], false > {code} > Note, the columns used in the equality condition of the join have been > correctly resolved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30546) Make interval type more future-proofing
[ https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30546: Priority: Blocker (was: Major) > Make interval type more future-proofing > --- > > Key: SPARK-30546 > URL: https://issues.apache.org/jira/browse/SPARK-30546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Blocker > > Before 3.0 we may make some efforts for the current interval type to make it > more future-proofing. e.g. > 1. add unstable annotation to the CalendarInterval class. People already use > it as UDF inputs so it’s better to make it clear it’s unstable. > 2. Add a schema checker to prohibit create v2 custom catalog table with > intervals, as same as what we do for the builtin catalog > 3. Add a schema checker for DataFrameWriterV2 too > 4. Make the interval type incomparable as version 2.4 for disambiguation of > comparison between year-month and day-time fields > 5. The 3.0 newly added to_csv should not support output intervals as same as > using CSV file format > 6. The function to_json should not allow using interval as a key field as > same as the value field and JSON datasource, with a legacy config to > restore. > 7. Revert interval ISO/ANSI SQL Standard output since we decide not to > follow ANSI, so there is no round trip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30546) Make interval type more future-proofing
[ https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-30546: Issue Type: New Feature (was: Improvement) > Make interval type more future-proofing > --- > > Key: SPARK-30546 > URL: https://issues.apache.org/jira/browse/SPARK-30546 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Blocker > > Before 3.0 we may make some efforts for the current interval type to make it > more future-proofing. e.g. > 1. add unstable annotation to the CalendarInterval class. People already use > it as UDF inputs so it’s better to make it clear it’s unstable. > 2. Add a schema checker to prohibit create v2 custom catalog table with > intervals, as same as what we do for the builtin catalog > 3. Add a schema checker for DataFrameWriterV2 too > 4. Make the interval type incomparable as version 2.4 for disambiguation of > comparison between year-month and day-time fields > 5. The 3.0 newly added to_csv should not support output intervals as same as > using CSV file format > 6. The function to_json should not allow using interval as a key field as > same as the value field and JSON datasource, with a legacy config to > restore. > 7. Revert interval ISO/ANSI SQL Standard output since we decide not to > follow ANSI, so there is no round trip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28264) Revisiting Python / pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-28264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28264. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27165 [https://github.com/apache/spark/pull/27165] > Revisiting Python / pandas UDF > -- > > Key: SPARK-28264 > URL: https://issues.apache.org/jira/browse/SPARK-28264 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Reynold Xin >Assignee: Hyukjin Kwon >Priority: Blocker > Fix For: 3.0.0 > > > In the past two years, the pandas UDFs are perhaps the most important changes > to Spark for Python data science. However, these functionalities have evolved > organically, leading to some inconsistencies and confusions among users. This > document revisits UDF definition and naming, as a result of discussions among > Xiangrui, Li Jin, Hyukjin, and Reynold. > -See document here: > [https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit#|https://docs.google.com/document/d/10Pkl-rqygGao2xQf6sddt0b-4FYK4g8qr_bXLKTL65A/edit]- > New proposal: > https://docs.google.com/document/d/1-kV0FS_LF2zvaRh_GhkV32Uqksm_Sq8SvnBBmRyxm30/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30528) DPP issues
[ https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020803#comment-17020803 ] Mayur Bhosale edited comment on SPARK-30528 at 1/22/20 6:13 AM: Thanks for the explanation [~maryannxue] # Should DPP be turned off by default till the heuristics are improved or keep having it turned on by default but don't do DPP when the column level stats are not available? Because for some cases this can be really disastrous. # Can we use the bloom filter to store the pruning values (for non-Broadcast Hash Join)? This will have multiple advantages - ## The size of the result returned to the driver would be way smaller ## Faster lookups compared to hashSet ## Reuse of the exchange will happen (because we won't be adding Aggregate on top) ## Duplicate subqueries because of multiple join conditions on partitioned columns will get removed (cases like example 3 in the description above) This will require more thoughts though. Let me know if this sounds feasible and useful, then I can get back with more details and can pick it up as well. 3. Yes, one of the subqueries selects `col1` and the other selects `col2`. {code:java} == Physical Plan == *(5) SortMergeJoin [partcol1#2L, partcol2#3], [col1#5L, col2#6], Inner :- *(2) Sort [partcol1#2L ASC NULLS FIRST, partcol2#3 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(partcol1#2L, partcol2#3, 200), true, [id=#103] : +- *(1) ColumnarToRow :+- FileScan parquet default.partitionedtable[id#0L,name#1,partCol1#2L,partCol2#3] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/sp..., PartitionFilters: [isnotnull(partCol2#3), isnotnull(partCol1#2L), dynamicpruningexpression(partCol1#2L IN subquery#..., PushedFilters: [], ReadSchema: struct : :- Subquery subquery#19, [id=#49] : : +- *(2) HashAggregate(keys=[col1#5L], functions=[]) : : +- Exchange hashpartitioning(col1#5L, 200), true, [id=#45] : :+- *(1) HashAggregate(keys=[col1#5L], functions=[]) : : +- *(1) Project [col1#5L] : : +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) : : +- *(1) ColumnarToRow : :+- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct : +- Subquery subquery#21, [id=#82] : +- *(2) HashAggregate(keys=[col2#6], functions=[]) :+- Exchange hashpartitioning(col2#6, 200), true, [id=#78] : +- *(1) HashAggregate(keys=[col2#6], functions=[]) : +- *(1) Project [col2#6] : +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) :+- *(1) ColumnarToRow : +- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct +- *(4) Sort [col1#5L ASC NULLS FIRST, col2#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col1#5L, col2#6, 200), true, [id=#113] +- *(3) Project [id#4L, col1#5L, col2#6, name#7] +- *(3) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) +- *(3) ColumnarToRow +- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6,name#7] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct {code} If we don't decide to go with removing Aggregate (not using BloomFilter), should we combine such DPP subqueries into a single sub-query? We can
[jira] [Comment Edited] (SPARK-30528) DPP issues
[ https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020803#comment-17020803 ] Mayur Bhosale edited comment on SPARK-30528 at 1/22/20 6:11 AM: Thanks for the explanation [~maryannxue] # Should DPP be turned off by default till the heuristics are improved or keep having it turned on by default but don't do DPP when the column level stats are not available? Because for some cases this can be really disastrous. # Can we use the bloom filter to store the pruning values (for non-Broadcast Hash Join)? This will have multiple advantages - ## The size of the result returned to the driver would be way smaller ## Faster lookups compared to hashSet ## Reuse of the exchange will happen (because we won't be adding Aggregate on top) ## Duplicate subqueries because of multiple join conditions on partitioned columns will get removed (cases like example 3 in the description above) With Bloom Filter, DPP subquery should look something like this - | Bloom Filter |<| Other operations |<--| Exchange |<--| Scan | This will require more thoughts though. Let me know if this sounds feasible and useful, then I can get back with more details and can pick it up as well. 3. Yes, one of the subqueries selects `col1` and the other selects `col2`. {code:java} == Physical Plan == *(5) SortMergeJoin [partcol1#2L, partcol2#3], [col1#5L, col2#6], Inner :- *(2) Sort [partcol1#2L ASC NULLS FIRST, partcol2#3 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(partcol1#2L, partcol2#3, 200), true, [id=#103] : +- *(1) ColumnarToRow :+- FileScan parquet default.partitionedtable[id#0L,name#1,partCol1#2L,partCol2#3] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/sp..., PartitionFilters: [isnotnull(partCol2#3), isnotnull(partCol1#2L), dynamicpruningexpression(partCol1#2L IN subquery#..., PushedFilters: [], ReadSchema: struct : :- Subquery subquery#19, [id=#49] : : +- *(2) HashAggregate(keys=[col1#5L], functions=[]) : : +- Exchange hashpartitioning(col1#5L, 200), true, [id=#45] : :+- *(1) HashAggregate(keys=[col1#5L], functions=[]) : : +- *(1) Project [col1#5L] : : +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) : : +- *(1) ColumnarToRow : :+- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct : +- Subquery subquery#21, [id=#82] : +- *(2) HashAggregate(keys=[col2#6], functions=[]) :+- Exchange hashpartitioning(col2#6, 200), true, [id=#78] : +- *(1) HashAggregate(keys=[col2#6], functions=[]) : +- *(1) Project [col2#6] : +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) :+- *(1) ColumnarToRow : +- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct +- *(4) Sort [col1#5L ASC NULLS FIRST, col2#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col1#5L, col2#6, 200), true, [id=#113] +- *(3) Project [id#4L, col1#5L, col2#6, name#7] +- *(3) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) +- *(3) ColumnarToRow +- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6,name#7] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct {code}
[jira] [Commented] (SPARK-30528) DPP issues
[ https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020803#comment-17020803 ] Mayur Bhosale commented on SPARK-30528: --- Thanks for the explanation [~maryannxue] # Should DPP be turned off by default till the heuristics are improved or keep having it turned on by default but don't do DPP when the column level stats are not available? Because for some cases this can be really disastrous. # Can we use the bloom filter to store the pruning values (for non-Broadcast Hash Join)? This will have multiple advantages - ## The size of the result returned to the driver would be way smaller ## Faster lookups compared to hashSet ## Reuse of the exchange will happen (because we won't be adding Aggregate on top) ## Duplicate subqueries because of multiple join conditions on partitioned columns will get removed (cases like example 3 in the description above) With Bloom Filter, DPP subquery should look something like this - ++ +--+ +---+ +--+ | Bloom filter |< | Other operations | <--| Exchange |<---| Scan | ++ +--+ +---+ +--+ This will require more thoughts though. Let me know if this sounds feasible and useful, then I can get back with more details and can pick it up as well. 3. Yes, one of the subqueries selects `col1` and the other selects `col2`. {code:java} == Physical Plan == *(5) SortMergeJoin [partcol1#2L, partcol2#3], [col1#5L, col2#6], Inner :- *(2) Sort [partcol1#2L ASC NULLS FIRST, partcol2#3 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(partcol1#2L, partcol2#3, 200), true, [id=#103] : +- *(1) ColumnarToRow :+- FileScan parquet default.partitionedtable[id#0L,name#1,partCol1#2L,partCol2#3] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/sp..., PartitionFilters: [isnotnull(partCol2#3), isnotnull(partCol1#2L), dynamicpruningexpression(partCol1#2L IN subquery#..., PushedFilters: [], ReadSchema: struct : :- Subquery subquery#19, [id=#49] : : +- *(2) HashAggregate(keys=[col1#5L], functions=[]) : : +- Exchange hashpartitioning(col1#5L, 200), true, [id=#45] : :+- *(1) HashAggregate(keys=[col1#5L], functions=[]) : : +- *(1) Project [col1#5L] : : +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) : : +- *(1) ColumnarToRow : :+- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct : +- Subquery subquery#21, [id=#82] : +- *(2) HashAggregate(keys=[col2#6], functions=[]) :+- Exchange hashpartitioning(col2#6, 200), true, [id=#78] : +- *(1) HashAggregate(keys=[col2#6], functions=[]) : +- *(1) Project [col2#6] : +- *(1) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) :+- *(1) ColumnarToRow : +- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/saurabhc/src/spark_version_merge/spark300preview/spark/bin/spark-wa..., PartitionFilters: [], PushedFilters: [IsNotNull(id), GreaterThan(id,0), IsNotNull(col2), IsNotNull(col1)], ReadSchema: struct +- *(4) Sort [col1#5L ASC NULLS FIRST, col2#6 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(col1#5L, col2#6, 200), true, [id=#113] +- *(3) Project [id#4L, col1#5L, col2#6, name#7] +- *(3) Filter (((isnotnull(id#4L) AND (id#4L > 0)) AND isnotnull(col2#6)) AND isnotnull(col1#5L)) +- *(3) ColumnarToRow +- FileScan parquet default.nonpartitionedtable[id#4L,col1#5L,col2#6,name#7] Batched: true, DataFilters: [isnotnull(id#4L), (id#4L > 0), isnotnull(col2#6), isnotnull(col1#5L)], Format: Parquet, Location:
[jira] [Updated] (SPARK-30553) Fix structured-streaming java example error
[ https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30553: -- Affects Version/s: 2.1.1 2.2.3 > Fix structured-streaming java example error > --- > > Key: SPARK-30553 > URL: https://issues.apache.org/jira/browse/SPARK-30553 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.1, 2.2.3, 2.3.4, 2.4.4 >Reporter: bettermouse >Assignee: bettermouse >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] > I write code according to this by java and scala. > java > {code:java} > public static void main(String[] args) throws StreamingQueryException { > SparkSession spark = > SparkSession.builder().appName("test").master("local[*]") > .config("spark.sql.shuffle.partitions", 1) > .getOrCreate();Dataset lines = > spark.readStream().format("socket") > .option("host", "skynet") > .option("includeTimestamp", true) > .option("port", ).load(); > Dataset words = lines.select("timestamp", "value"); > Dataset count = words.withWatermark("timestamp", "10 seconds") > .groupBy(functions.window(words.col("timestamp"), "10 > seconds", "10 seconds") > , words.col("value")).count(); > StreamingQuery start = count.writeStream() > .outputMode("update") > .format("console").start(); > start.awaitTermination();} > {code} > scala > > {code:java} > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder.appName("test"). > master("local[*]"). > config("spark.sql.shuffle.partitions", 1) > .getOrCreate > import spark.implicits._ > val lines = spark.readStream.format("socket"). > option("host", "skynet").option("includeTimestamp", true). > option("port", ).load > val words = lines.select("timestamp", "value") > val count = words.withWatermark("timestamp", "10 seconds"). > groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") > .count() > val start = count.writeStream.outputMode("update").format("console").start > start.awaitTermination() > } > {code} > This is according to official documents. written in Java I found metrics > "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. > > java > > {code:java} > == Physical Plan == > WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 > +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], > output=[window#11, value#0, count#10L]) >+- StateStoreSave [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], Update, 1579274372624, 2 > +- *(3) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) > +- StateStoreRestore [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], 2 > +- *(2) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) >+- Exchange hashpartitioning(window#11, value#0, 1) > +- *(1) HashAggregate(keys=[window#11, value#0], > functions=[partial_count(1)], output=[window#11, value#0, count#21L]) > +- *(1) Project [named_struct(start, > precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN > (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) + 1) ELSE > CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, > TimestampType), end, precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN >
[jira] [Updated] (SPARK-30553) Fix structured-streaming java example error
[ https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30553: -- Affects Version/s: 2.3.4 > Fix structured-streaming java example error > --- > > Key: SPARK-30553 > URL: https://issues.apache.org/jira/browse/SPARK-30553 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.3.4, 2.4.4 >Reporter: bettermouse >Assignee: bettermouse >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] > I write code according to this by java and scala. > java > {code:java} > public static void main(String[] args) throws StreamingQueryException { > SparkSession spark = > SparkSession.builder().appName("test").master("local[*]") > .config("spark.sql.shuffle.partitions", 1) > .getOrCreate();Dataset lines = > spark.readStream().format("socket") > .option("host", "skynet") > .option("includeTimestamp", true) > .option("port", ).load(); > Dataset words = lines.select("timestamp", "value"); > Dataset count = words.withWatermark("timestamp", "10 seconds") > .groupBy(functions.window(words.col("timestamp"), "10 > seconds", "10 seconds") > , words.col("value")).count(); > StreamingQuery start = count.writeStream() > .outputMode("update") > .format("console").start(); > start.awaitTermination();} > {code} > scala > > {code:java} > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder.appName("test"). > master("local[*]"). > config("spark.sql.shuffle.partitions", 1) > .getOrCreate > import spark.implicits._ > val lines = spark.readStream.format("socket"). > option("host", "skynet").option("includeTimestamp", true). > option("port", ).load > val words = lines.select("timestamp", "value") > val count = words.withWatermark("timestamp", "10 seconds"). > groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") > .count() > val start = count.writeStream.outputMode("update").format("console").start > start.awaitTermination() > } > {code} > This is according to official documents. written in Java I found metrics > "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. > > java > > {code:java} > == Physical Plan == > WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 > +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], > output=[window#11, value#0, count#10L]) >+- StateStoreSave [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], Update, 1579274372624, 2 > +- *(3) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) > +- StateStoreRestore [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], 2 > +- *(2) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) >+- Exchange hashpartitioning(window#11, value#0, 1) > +- *(1) HashAggregate(keys=[window#11, value#0], > functions=[partial_count(1)], output=[window#11, value#0, count#21L]) > +- *(1) Project [named_struct(start, > precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN > (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) + 1) ELSE > CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, > TimestampType), end, precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN >
[jira] [Updated] (SPARK-30553) Fix structured-streaming java example error
[ https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30553: -- Summary: Fix structured-streaming java example error (was: structured-streaming documentation java watermark group by) > Fix structured-streaming java example error > --- > > Key: SPARK-30553 > URL: https://issues.apache.org/jira/browse/SPARK-30553 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.4 >Reporter: bettermouse >Assignee: bettermouse >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] > I write code according to this by java and scala. > java > {code:java} > public static void main(String[] args) throws StreamingQueryException { > SparkSession spark = > SparkSession.builder().appName("test").master("local[*]") > .config("spark.sql.shuffle.partitions", 1) > .getOrCreate();Dataset lines = > spark.readStream().format("socket") > .option("host", "skynet") > .option("includeTimestamp", true) > .option("port", ).load(); > Dataset words = lines.select("timestamp", "value"); > Dataset count = words.withWatermark("timestamp", "10 seconds") > .groupBy(functions.window(words.col("timestamp"), "10 > seconds", "10 seconds") > , words.col("value")).count(); > StreamingQuery start = count.writeStream() > .outputMode("update") > .format("console").start(); > start.awaitTermination();} > {code} > scala > > {code:java} > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder.appName("test"). > master("local[*]"). > config("spark.sql.shuffle.partitions", 1) > .getOrCreate > import spark.implicits._ > val lines = spark.readStream.format("socket"). > option("host", "skynet").option("includeTimestamp", true). > option("port", ).load > val words = lines.select("timestamp", "value") > val count = words.withWatermark("timestamp", "10 seconds"). > groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") > .count() > val start = count.writeStream.outputMode("update").format("console").start > start.awaitTermination() > } > {code} > This is according to official documents. written in Java I found metrics > "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. > > java > > {code:java} > == Physical Plan == > WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 > +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], > output=[window#11, value#0, count#10L]) >+- StateStoreSave [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], Update, 1579274372624, 2 > +- *(3) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) > +- StateStoreRestore [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], 2 > +- *(2) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) >+- Exchange hashpartitioning(window#11, value#0, 1) > +- *(1) HashAggregate(keys=[window#11, value#0], > functions=[partial_count(1)], output=[window#11, value#0, count#21L]) > +- *(1) Project [named_struct(start, > precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN > (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) + 1) ELSE > CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, > TimestampType), end, precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType,
[jira] [Resolved] (SPARK-30553) structured-streaming documentation java watermark group by
[ https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30553. --- Fix Version/s: 3.0.0 2.4.5 Resolution: Fixed Issue resolved by pull request 27268 [https://github.com/apache/spark/pull/27268] > structured-streaming documentation java watermark group by > - > > Key: SPARK-30553 > URL: https://issues.apache.org/jira/browse/SPARK-30553 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.4 >Reporter: bettermouse >Assignee: bettermouse >Priority: Trivial > Fix For: 2.4.5, 3.0.0 > > > [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] > I write code according to this by java and scala. > java > {code:java} > public static void main(String[] args) throws StreamingQueryException { > SparkSession spark = > SparkSession.builder().appName("test").master("local[*]") > .config("spark.sql.shuffle.partitions", 1) > .getOrCreate();Dataset lines = > spark.readStream().format("socket") > .option("host", "skynet") > .option("includeTimestamp", true) > .option("port", ).load(); > Dataset words = lines.select("timestamp", "value"); > Dataset count = words.withWatermark("timestamp", "10 seconds") > .groupBy(functions.window(words.col("timestamp"), "10 > seconds", "10 seconds") > , words.col("value")).count(); > StreamingQuery start = count.writeStream() > .outputMode("update") > .format("console").start(); > start.awaitTermination();} > {code} > scala > > {code:java} > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder.appName("test"). > master("local[*]"). > config("spark.sql.shuffle.partitions", 1) > .getOrCreate > import spark.implicits._ > val lines = spark.readStream.format("socket"). > option("host", "skynet").option("includeTimestamp", true). > option("port", ).load > val words = lines.select("timestamp", "value") > val count = words.withWatermark("timestamp", "10 seconds"). > groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") > .count() > val start = count.writeStream.outputMode("update").format("console").start > start.awaitTermination() > } > {code} > This is according to official documents. written in Java I found metrics > "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. > > java > > {code:java} > == Physical Plan == > WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 > +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], > output=[window#11, value#0, count#10L]) >+- StateStoreSave [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], Update, 1579274372624, 2 > +- *(3) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) > +- StateStoreRestore [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], 2 > +- *(2) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) >+- Exchange hashpartitioning(window#11, value#0, 1) > +- *(1) HashAggregate(keys=[window#11, value#0], > functions=[partial_count(1)], output=[window#11, value#0, count#21L]) > +- *(1) Project [named_struct(start, > precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN > (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) + 1) ELSE > CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, > TimestampType), end, precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as
[jira] [Assigned] (SPARK-30553) structured-streaming documentation java watermark group by
[ https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30553: - Assignee: bettermouse > structured-streaming documentation java watermark group by > - > > Key: SPARK-30553 > URL: https://issues.apache.org/jira/browse/SPARK-30553 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.4 >Reporter: bettermouse >Assignee: bettermouse >Priority: Trivial > > [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking] > I write code according to this by java and scala. > java > {code:java} > public static void main(String[] args) throws StreamingQueryException { > SparkSession spark = > SparkSession.builder().appName("test").master("local[*]") > .config("spark.sql.shuffle.partitions", 1) > .getOrCreate();Dataset lines = > spark.readStream().format("socket") > .option("host", "skynet") > .option("includeTimestamp", true) > .option("port", ).load(); > Dataset words = lines.select("timestamp", "value"); > Dataset count = words.withWatermark("timestamp", "10 seconds") > .groupBy(functions.window(words.col("timestamp"), "10 > seconds", "10 seconds") > , words.col("value")).count(); > StreamingQuery start = count.writeStream() > .outputMode("update") > .format("console").start(); > start.awaitTermination();} > {code} > scala > > {code:java} > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder.appName("test"). > master("local[*]"). > config("spark.sql.shuffle.partitions", 1) > .getOrCreate > import spark.implicits._ > val lines = spark.readStream.format("socket"). > option("host", "skynet").option("includeTimestamp", true). > option("port", ).load > val words = lines.select("timestamp", "value") > val count = words.withWatermark("timestamp", "10 seconds"). > groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value") > .count() > val start = count.writeStream.outputMode("update").format("console").start > start.awaitTermination() > } > {code} > This is according to official documents. written in Java I found metrics > "stateOnCurrentVersionSizeBytes" always increase .but scala is ok. > > java > > {code:java} > == Physical Plan == > WriteToDataSourceV2 > org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001 > +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], > output=[window#11, value#0, count#10L]) >+- StateStoreSave [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], Update, 1579274372624, 2 > +- *(3) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) > +- StateStoreRestore [window#11, value#0], state info [ checkpoint = > file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state, > runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, > numPartitions = 1], 2 > +- *(2) HashAggregate(keys=[window#11, value#0], > functions=[merge_count(1)], output=[window#11, value#0, count#21L]) >+- Exchange hashpartitioning(window#11, value#0, 1) > +- *(1) HashAggregate(keys=[window#11, value#0], > functions=[partial_count(1)], output=[window#11, value#0, count#21L]) > +- *(1) Project [named_struct(start, > precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN > (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) + 1) ELSE > CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) > - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, > TimestampType), end, precisetimestampconversion(CASE WHEN > (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, > LongType) - 0) as double) / 1.0E7)) as double) = > (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) > as double) / 1.0E7)) THEN >
[jira] [Commented] (SPARK-30466) remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13
[ https://issues.apache.org/jira/browse/SPARK-30466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020700#comment-17020700 ] Michael Burgener commented on SPARK-30466: -- Fair enough for the 2.x releases but the Spark 3.x release should update the dependencies for hadoop-client and hadoop-minikdc to the 3.x release as the Hadoop project has removed the dependencies and migrated to jackson-databind already. One would expect potentially breaking changes between major releases so now would be the time to do that while 3.0.0 is in preview. Is there a particular reason the hadoop libraries have not been updated to the 3.x releases? > remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13 > -- > > Key: SPARK-30466 > URL: https://issues.apache.org/jira/browse/SPARK-30466 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Michael Burgener >Priority: Major > Labels: security > > These 2 libraries are deprecated and replaced by the jackson-databind > libraries which are already included. These two libraries are flagged by our > vulnerability scanners as having the following security vulnerabilities. > I've set the priority to Major due to the Critical nature and hopefully they > can be addressed quickly. Please note, I'm not a developer but work in > InfoSec and this was flagged when we incorporated spark into our product. If > you feel the priority is not set correctly please change accordingly. I'll > watch the issue and flag our dev team to update once resolved. > jackson-mapper-asl-1.9.13 > CVE-2018-7489 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2018-7489] > > CVE-2017-7525 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2017-7525] > > CVE-2017-17485 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2017-17485] > > CVE-2017-15095 (CVSS 3.0 Score 9.8 CRITICAL) > [https://nvd.nist.gov/vuln/detail/CVE-2017-15095] > > CVE-2018-5968 (CVSS 3.0 Score 8.1 High) > [https://nvd.nist.gov/vuln/detail/CVE-2018-5968] > > jackson-core-asl-1.9.13 > CVE-2016-7051 (CVSS 3.0 Score 8.6 High) > https://nvd.nist.gov/vuln/detail/CVE-2016-7051 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30602) Support push-based shuffle to improve shuffle efficiency
Min Shen created SPARK-30602: Summary: Support push-based shuffle to improve shuffle efficiency Key: SPARK-30602 URL: https://issues.apache.org/jira/browse/SPARK-30602 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.0.0 Reporter: Min Shen In a large deployment of a Spark compute infrastructure, Spark shuffle is becoming a potential scaling bottleneck and a source of inefficiency in the cluster. When doing Spark on YARN for a large-scale deployment, people usually enable Spark external shuffle service and store the intermediate shuffle files on HDD. Because the number of blocks generated for a particular shuffle grows quadratically compared to the size of shuffled data (# mappers and reducers grows linearly with the size of shuffled data, but # blocks is # mappers * # reducers), one general trend we have observed is that the more data a Spark application processes, the smaller the block size becomes. In a few production clusters we have seen, the average shuffle block size is only 10s of KBs. Because of the inefficiency of performing random reads on HDD for small amount of data, the overall efficiency of the Spark external shuffle services serving the shuffle blocks degrades as we see an increasing # of Spark applications processing an increasing amount of data. In addition, because Spark external shuffle service is a shared service in a multi-tenancy cluster, the inefficiency with one Spark application could propagate to other applications as well. In this ticket, we propose a solution to improve Spark shuffle efficiency in above mentioned environments with push-based shuffle. With push-based shuffle, shuffle is performed at the end of mappers and blocks get pre-merged and move towards reducers. In our prototype implementation, we have seen significant efficiency improvements when performing large shuffles. We take a Spark-native approach to achieve this, i.e., extending Spark’s existing shuffle netty protocol, and the behaviors of Spark mappers, reducers and drivers. This way, we can bring the benefits of more efficient shuffle in Spark without incurring the dependency or overhead of either specialized storage layer or external infrastructure pieces. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30601) Add a Google Maven Central as a primary repository
Hyukjin Kwon created SPARK-30601: Summary: Add a Google Maven Central as a primary repository Key: SPARK-30601 URL: https://issues.apache.org/jira/browse/SPARK-30601 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.5, 3.0.0 Reporter: Hyukjin Kwon See [http://apache-spark-developers-list.1001551.n3.nabble.com/Adding-Maven-Central-mirror-from-Google-to-the-build-td28728.html] This Jira targets to switch the main repo to Google Maven Central. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30534) Use mvn in `dev/scalastyle`
[ https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30534. -- Resolution: Later Actually, the main issue we should want to fix was handled by ^. > Use mvn in `dev/scalastyle` > --- > > Key: SPARK-30534 > URL: https://issues.apache.org/jira/browse/SPARK-30534 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to use `mvn` instead of `sbt`. > As of now, Apache Spark sbt build is broken by the Maven Central repository > policy. > - > https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an > > Effective January 15, 2020, The Central Maven Repository no longer supports > > insecure > > communication over plain HTTP and requires that all requests to the > > repository are > > encrypted over HTTPS. > We can reproduce this locally by the following. > $ rm -rf ~/.m2/repository/org/apache/apache/18/ > $ build/sbt clean > This issue aims to recover GitHub Action `lint-scala` first by using mvn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30534) Use mvn in `dev/scalastyle`
[ https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30534: - Fix Version/s: (was: 2.4.5) (was: 3.0.0) > Use mvn in `dev/scalastyle` > --- > > Key: SPARK-30534 > URL: https://issues.apache.org/jira/browse/SPARK-30534 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to use `mvn` instead of `sbt`. > As of now, Apache Spark sbt build is broken by the Maven Central repository > policy. > - > https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an > > Effective January 15, 2020, The Central Maven Repository no longer supports > > insecure > > communication over plain HTTP and requires that all requests to the > > repository are > > encrypted over HTTPS. > We can reproduce this locally by the following. > $ rm -rf ~/.m2/repository/org/apache/apache/18/ > $ build/sbt clean > This issue aims to recover GitHub Action `lint-scala` first by using mvn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30534) Use mvn in `dev/scalastyle`
[ https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-30534: -- > Use mvn in `dev/scalastyle` > --- > > Key: SPARK-30534 > URL: https://issues.apache.org/jira/browse/SPARK-30534 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to use `mvn` instead of `sbt`. > As of now, Apache Spark sbt build is broken by the Maven Central repository > policy. > - > https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an > > Effective January 15, 2020, The Central Maven Repository no longer supports > > insecure > > communication over plain HTTP and requires that all requests to the > > repository are > > encrypted over HTTPS. > We can reproduce this locally by the following. > $ rm -rf ~/.m2/repository/org/apache/apache/18/ > $ build/sbt clean > This issue aims to recover GitHub Action `lint-scala` first by using mvn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27878) Support ARRAY(sub-SELECT) expressions
[ https://issues.apache.org/jira/browse/SPARK-27878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17003657#comment-17003657 ] Takeshi Yamamuro edited comment on SPARK-27878 at 1/22/20 12:35 AM: It seems PgSql and BigQuery support this feature. If we implement the Pg dialect in future, it might be worth supporting this. So, I'll make this keep open now. was (Author: maropu): It seems PgSql and BigQuery support this feature. If we implement the Pg dialect in future, it might be worth supporting this. So, I'll make this keep now. > Support ARRAY(sub-SELECT) expressions > - > > Key: SPARK-27878 > URL: https://issues.apache.org/jira/browse/SPARK-27878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Construct an array from the results of a subquery. In this form, the array > constructor is written with the key word {{ARRAY}} followed by a > parenthesized (not bracketed) subquery. For example: > {code:sql} > SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%'); > array > --- > {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413} > (1 row) > {code} > More details: > > [https://www.postgresql.org/docs/9.3/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS] > [https://github.com/postgres/postgres/commit/730840c9b649a48604083270d48792915ca89233] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020685#comment-17020685 ] Takeshi Yamamuro commented on SPARK-30599: -- +1, too. The fix looks trivial. > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: fail_trace.txt, success_trace.txt > > > The LogAppender has a limit for the number of logged events. By default, it > is 100 entries. All tests that use LogAppender finish successfully with this > limit but sometimes one of the tests fails with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test *"SPARK-23786: warning should be printed if CSV > header doesn't conform to schema"* uses 2 log appenders. The > `success_trace.txt` contains stored log event in normal runs, another one > `fail_trace.txt` which I got while re-running the test in a loop. > The difference is in: > {code:java} > [35] Block broadcast_222 stored as values in memory (estimated size 286.9 > KiB, free 1998.5 MiB) > [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.8 MiB) > [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2003.9 MiB) > [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.0 MiB) > [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.0 MiB) > [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.0 MiB) > [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.0 MiB) > [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.1 MiB) > [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.2 MiB) > [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.2 MiB) > [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.2 MiB) > [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.3 MiB) > [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.3 MiB) > [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.4 MiB) > [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.4 MiB) > [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size > 24.2 KiB, free 2002.0 MiB) > [63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354
[jira] [Created] (SPARK-30600) Migrate ALTER VIEW SET/UNSET commands to the new resolution framework
Terry Kim created SPARK-30600: - Summary: Migrate ALTER VIEW SET/UNSET commands to the new resolution framework Key: SPARK-30600 URL: https://issues.apache.org/jira/browse/SPARK-30600 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Terry Kim -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.
[ https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Terry Kim reopened SPARK-30420: --- > Commands involved with namespace go thru the new resolution framework. > -- > > Key: SPARK-30420 > URL: https://issues.apache.org/jira/browse/SPARK-30420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > V2 commands that need to resolve namespace should go thru new resolution > framework introduced in > [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.
[ https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Terry Kim resolved SPARK-30420. --- Fix Version/s: 3.0.0 Resolution: Fixed > Commands involved with namespace go thru the new resolution framework. > -- > > Key: SPARK-30420 > URL: https://issues.apache.org/jira/browse/SPARK-30420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Terry Kim >Priority: Major > Fix For: 3.0.0 > > > V2 commands that need to resolve namespace should go thru new resolution > framework introduced in > [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-27878) Support ARRAY(sub-SELECT) expressions
[ https://issues.apache.org/jira/browse/SPARK-27878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reopened SPARK-27878: - > Support ARRAY(sub-SELECT) expressions > - > > Key: SPARK-27878 > URL: https://issues.apache.org/jira/browse/SPARK-27878 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > Construct an array from the results of a subquery. In this form, the array > constructor is written with the key word {{ARRAY}} followed by a > parenthesized (not bracketed) subquery. For example: > {code:sql} > SELECT ARRAY(SELECT oid FROM pg_proc WHERE proname LIKE 'bytea%'); > array > --- > {2011,1954,1948,1952,1951,1244,1950,2005,1949,1953,2006,31,2412,2413} > (1 row) > {code} > More details: > > [https://www.postgresql.org/docs/9.3/sql-expressions.html#SQL-SYNTAX-ARRAY-CONSTRUCTORS] > [https://github.com/postgres/postgres/commit/730840c9b649a48604083270d48792915ca89233] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28329) SELECT INTO syntax
[ https://issues.apache.org/jira/browse/SPARK-28329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020597#comment-17020597 ] Xiao Li commented on SPARK-28329: - This conflicts with the SQL standard. > The SQL standard uses SELECT INTO to represent selecting values into scalar > variables of a host program, rather than creating a new table. > SELECT INTO syntax > -- > > Key: SPARK-28329 > URL: https://issues.apache.org/jira/browse/SPARK-28329 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > h2. Synopsis > {noformat} > [ WITH [ RECURSIVE ] with_query [, ...] ] > SELECT [ ALL | DISTINCT [ ON ( expression [, ...] ) ] ] > * | expression [ [ AS ] output_name ] [, ...] > INTO [ TEMPORARY | TEMP | UNLOGGED ] [ TABLE ] new_table > [ FROM from_item [, ...] ] > [ WHERE condition ] > [ GROUP BY expression [, ...] ] > [ HAVING condition [, ...] ] > [ WINDOW window_name AS ( window_definition ) [, ...] ] > [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select ] > [ ORDER BY expression [ ASC | DESC | USING operator ] [ NULLS { FIRST | > LAST } ] [, ...] ] > [ LIMIT { count | ALL } ] > [ OFFSET start [ ROW | ROWS ] ] > [ FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } ONLY ] > [ FOR { UPDATE | SHARE } [ OF table_name [, ...] ] [ NOWAIT ] [...] ] > {noformat} > h2. Description > {{SELECT INTO}} creates a new table and fills it with data computed by a > query. The data is not returned to the client, as it is with a normal > {{SELECT}}. The new table's columns have the names and data types associated > with the output columns of the {{SELECT}}. > > {{CREATE TABLE AS}} offers a superset of the functionality offered by > {{SELECT INTO}}. > [https://www.postgresql.org/docs/11/sql-selectinto.html] > [https://www.postgresql.org/docs/11/sql-createtableas.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27784) Alias ID reuse can break correctness when substituting foldable expressions
[ https://issues.apache.org/jira/browse/SPARK-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020563#comment-17020563 ] Thomas Graves commented on SPARK-27784: --- [~rdblue] can you confirm this doesn't exist in master? > Alias ID reuse can break correctness when substituting foldable expressions > --- > > Key: SPARK-27784 > URL: https://issues.apache.org/jira/browse/SPARK-27784 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.3.2 >Reporter: Ryan Blue >Priority: Major > Labels: correctness > > This is a correctness bug when reusing a set of project expressions in the > DataFrame API. > Use case: a user was migrating a table to a new version with an additional > column ("data" in the repro case). To migrate the user unions the old table > ("t2") with the new table ("t1"), and applies a common set of projections to > ensure the union doesn't hit an issue with ordering (SPARK-22335). In some > cases, this produces an incorrect query plan: > {code:java} > Seq((4, "a"), (5, "b"), (6, "c")).toDF("id", "data").write.saveAsTable("t1") > Seq(1, 2, 3).toDF("id").write.saveAsTable("t2") > val dim = Seq(2, 3, 4).toDF("id") > val outputCols = Seq($"id", coalesce($"data", lit("_")).as("data")) > val t1 = spark.table("t1").select(outputCols:_*) > val t2 = spark.table("t2").withColumn("data", lit(null)).select(outputCols:_*) > t1.join(dim, t1("id") === dim("id")).select(t1("id"), > t1("data")).union(t2).explain(true){code} > {code:java} > == Physical Plan == > Union > :- *Project [id#330, _ AS data#237] < THE CONSTANT IS > FROM THE OTHER SIDE OF THE UNION > : +- *BroadcastHashJoin [id#330], [id#234], Inner, BuildRight > : :- *Project [id#330] > : : +- *Filter isnotnull(id#330) > : : +- *FileScan parquet t1[id#330] Batched: true, Format: Parquet, > Location: CatalogFileIndex[s3://.../t1], PartitionFilters: [], PushedFilters: > [IsNotNull(id)], ReadSchema: struct > : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, > false] as bigint))) > :+- LocalTableScan [id#234] > +- *Project [id#340, _ AS data#237] >+- *FileScan parquet t2[id#340] Batched: true, Format: Parquet, Location: > CatalogFileIndex[s3://.../t2], PartitionFilters: [], PushedFilters: [], > ReadSchema: struct{code} > The problem happens because "outputCols" has an alias. The ID for that alias > is created when the projection Seq is created, so it is reused in both sides > of the union. > When FoldablePropagation runs, it identifies that "data" in the t2 side of > the union is a foldable expression and replaces all references to it, > including the references in the t1 side of the union. > The join to a dimension table is necessary to reproduce the problem because > it requires a Projection on top of the join that uses an AttributeReference > for data#237. Otherwise, the projections are collapsed and the projection > includes an Alias that does not get rewritten by FoldablePropagation. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27282) Spark incorrect results when using UNION with GROUP BY clause
[ https://issues.apache.org/jira/browse/SPARK-27282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020560#comment-17020560 ] Thomas Graves commented on SPARK-27282: --- [~sofia] can you confirm this isn't fixed in the last version of Spark? > Spark incorrect results when using UNION with GROUP BY clause > - > > Key: SPARK-27282 > URL: https://issues.apache.org/jira/browse/SPARK-27282 > Project: Spark > Issue Type: Bug > Components: Spark Shell, Spark Submit, SQL >Affects Versions: 2.3.2 > Environment: I'm using : > IntelliJ IDEA ==> 2018.1.4 > spark-sql and spark-core ==> 2.3.2.3.1.0.0-78 (for HDP 3.1) > scala ==> 2.11.8 >Reporter: Sofia >Priority: Major > Labels: correctness > > When using UNION clause after a GROUP BY clause in spark, the results > obtained are wrong. > The following example explicit this issue: > {code:java} > CREATE TABLE test_un ( > col1 varchar(255), > col2 varchar(255), > col3 varchar(255), > col4 varchar(255) > ); > INSERT INTO test_un (col1, col2, col3, col4) > VALUES (1,1,2,4), > (1,1,2,4), > (1,1,3,5), > (2,2,2,null); > {code} > I used the following code : > {code:java} > val x = Toolkit.HiveToolkit.getDataFromHive("test","test_un") > val y = x >.filter(col("col4")isNotNull) > .groupBy("col1", "col2","col3") > .agg(count(col("col3")).alias("cnt")) > .withColumn("col_name", lit("col3")) > .select(col("col1"), col("col2"), > col("col_name"),col("col3").alias("col_value"), col("cnt")) > val z = x > .filter(col("col4")isNotNull) > .groupBy("col1", "col2","col4") > .agg(count(col("col4")).alias("cnt")) > .withColumn("col_name", lit("col4")) > .select(col("col1"), col("col2"), > col("col_name"),col("col4").alias("col_value"), col("cnt")) > y.union(z).show() > {code} > And i obtained the following results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|5|1| > |1|1|col3|4|2| > |1|1|col4|5|1| > |1|1|col4|4|2| > Expected results: > ||col1||col2||col_name||col_value||cnt|| > |1|1|col3|3|1| > |1|1|col3|2|2| > |1|1|col4|4|2| > |1|1|col4|5|1| > But when i remove the last row of the table, i obtain the correct results. > {code:java} > (2,2,2,null){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020495#comment-17020495 ] Dongjoon Hyun commented on SPARK-30599: --- +1 for [~srowen]'s advice on increasing the capacity. > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: fail_trace.txt, success_trace.txt > > > The LogAppender has a limit for the number of logged events. By default, it > is 100 entries. All tests that use LogAppender finish successfully with this > limit but sometimes one of the tests fails with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test *"SPARK-23786: warning should be printed if CSV > header doesn't conform to schema"* uses 2 log appenders. The > `success_trace.txt` contains stored log event in normal runs, another one > `fail_trace.txt` which I got while re-running the test in a loop. > The difference is in: > {code:java} > [35] Block broadcast_222 stored as values in memory (estimated size 286.9 > KiB, free 1998.5 MiB) > [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.8 MiB) > [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2003.9 MiB) > [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.0 MiB) > [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.0 MiB) > [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.0 MiB) > [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.0 MiB) > [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.1 MiB) > [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.2 MiB) > [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.2 MiB) > [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.2 MiB) > [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.3 MiB) > [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.3 MiB) > [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.4 MiB) > [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.4 MiB) > [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size > 24.2 KiB, free 2002.0 MiB) > [63] Added broadcast_222_piece0 in memory on
[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020480#comment-17020480 ] Maxim Gekk commented on SPARK-30599: > Or I wonder if we can make the logging less noisy too. We could try to filter by the log level, for instance by Level.WARN by default. But still there is a risk of getting so many warnings since we don't control what could happens on executors. FYI, LogAppender gathers logs not only on the driver but on executors as well (in local mode). > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: fail_trace.txt, success_trace.txt > > > The LogAppender has a limit for the number of logged events. By default, it > is 100 entries. All tests that use LogAppender finish successfully with this > limit but sometimes one of the tests fails with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test *"SPARK-23786: warning should be printed if CSV > header doesn't conform to schema"* uses 2 log appenders. The > `success_trace.txt` contains stored log event in normal runs, another one > `fail_trace.txt` which I got while re-running the test in a loop. > The difference is in: > {code:java} > [35] Block broadcast_222 stored as values in memory (estimated size 286.9 > KiB, free 1998.5 MiB) > [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.8 MiB) > [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2003.9 MiB) > [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.0 MiB) > [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.0 MiB) > [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.0 MiB) > [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.0 MiB) > [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.1 MiB) > [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.2 MiB) > [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.2 MiB) > [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.2 MiB) > [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.3 MiB) > [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.3 MiB) > [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 >
[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30599: --- Description: The LogAppender has a limit for the number of logged events. By default, it is 100 entries. All tests that use LogAppender finish successfully with this limit but sometimes one of the tests fails with the exception like: {code:java} java.lang.IllegalStateException: Number of events reached the limit of 100 while logging CSV header matches to schema w/ enforceSchema. at org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) {code} For example, the CSV test *"SPARK-23786: warning should be printed if CSV header doesn't conform to schema"* uses 2 log appenders. The `success_trace.txt` contains stored log event in normal runs, another one `fail_trace.txt` which I got while re-running the test in a loop. The difference is in: {code:java} [35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, free 1998.5 MiB) [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.8 MiB) [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2003.9 MiB) [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.0 MiB) [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.0 MiB) [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.0 MiB) [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.0 MiB) [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.1 MiB) [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.2 MiB) [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.2 MiB) [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.2 MiB) [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.3 MiB) [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.3 MiB) [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.4 MiB) [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.4 MiB) [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 KiB, free 2002.0 MiB) [63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 KiB, free: 2004.4 MiB) [64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [65] Created broadcast 222 from apply at OutcomeOf.scala:85 [66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.5 MiB) [68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.5 MiB) [69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, free 2002.5 MiB) [70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.5 MiB) [71]
[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020474#comment-17020474 ] Sean R. Owen commented on SPARK-30599: -- For now, just increase the capacity? 1000 doesn't seem too large. Or I wonder if we can make the logging less noisy too. > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: fail_trace.txt, success_trace.txt > > > The LogAppender has a limit for the number of logged event. By default, it is > 100. All tests that use LogAppender finish successfully with this limit but > sometimes the test fails with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test "SPARK-23786: warning should be printed if CSV > header doesn't conform to schema" uses 2 log appenders. The > `success_trace.txt` contains stored log event in normal runs, another one > `fail_trace.txt` which I got while re-running the test in a loop. > The difference is in: > {code:java} > [35] Block broadcast_222 stored as values in memory (estimated size 286.9 > KiB, free 1998.5 MiB) > [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.8 MiB) > [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2003.9 MiB) > [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.0 MiB) > [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.0 MiB) > [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.0 MiB) > [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.0 MiB) > [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.1 MiB) > [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.2 MiB) > [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.2 MiB) > [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.2 MiB) > [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.3 MiB) > [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.3 MiB) > [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.4 MiB) > [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.4 MiB) > [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size > 24.2 KiB, free 2002.0 MiB) > [63]
[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30599: --- Description: The LogAppender has a limit for the number of logged event. By default, it is 100. All tests that use LogAppender finish successfully with this limit but sometimes the test fails with the exception like: {code:java} java.lang.IllegalStateException: Number of events reached the limit of 100 while logging CSV header matches to schema w/ enforceSchema. at org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) {code} For example, the CSV test "SPARK-23786: warning should be printed if CSV header doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` contains stored log event in normal runs, another one `fail_trace.txt` I got while re-running the test in a loop. The difference is in: {code:java} [35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, free 1998.5 MiB) [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.8 MiB) [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2003.9 MiB) [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.0 MiB) [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.0 MiB) [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.0 MiB) [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.0 MiB) [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.1 MiB) [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.2 MiB) [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.2 MiB) [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.2 MiB) [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.3 MiB) [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.3 MiB) [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.4 MiB) [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.4 MiB) [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 KiB, free 2002.0 MiB) [63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 KiB, free: 2004.4 MiB) [64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [65] Created broadcast 222 from apply at OutcomeOf.scala:85 [66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.5 MiB) [68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.5 MiB) [69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, free 2002.5 MiB) [70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.5 MiB) [71] Removed broadcast_188_piece0 on
[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30599: --- Description: The LogAppender has a limit for the number of logged event. By default, it is 100. All tests that use LogAppender finish successfully with this limit but sometimes the test fails with the exception like: {code:java} java.lang.IllegalStateException: Number of events reached the limit of 100 while logging CSV header matches to schema w/ enforceSchema. at org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) {code} For example, the CSV test "SPARK-23786: warning should be printed if CSV header doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` contains stored log event in normal runs, another one `fail_trace.txt` which I got while re-running the test in a loop. The difference is in: {code:java} [35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, free 1998.5 MiB) [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.8 MiB) [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2003.9 MiB) [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.0 MiB) [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.0 MiB) [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.0 MiB) [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.0 MiB) [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.1 MiB) [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.2 MiB) [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.2 MiB) [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.2 MiB) [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.3 MiB) [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.3 MiB) [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.4 MiB) [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.4 MiB) [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 KiB, free 2002.0 MiB) [63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 KiB, free: 2004.4 MiB) [64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [65] Created broadcast 222 from apply at OutcomeOf.scala:85 [66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.5 MiB) [68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.5 MiB) [69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, free 2002.5 MiB) [70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.5 MiB) [71] Removed
[jira] [Commented] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020465#comment-17020465 ] Maxim Gekk commented on SPARK-30599: I see at least 2 solutions: 1. Increase the limit from 100 to something bigger number. For example, 1000 2. Pass to LogAppender's constructor a sequence of strings, and filter incoming log events by the strings in the `append()` method. [~cloud_fan] [~dongjoon] [~hyukjin.kwon] [~maropu] [~srowen] WDYT? > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: fail_trace.txt, success_trace.txt > > > The LogAppender has a limit for the number of logged event. By default, it is > 100. All tests that use LogAppender finish successfully with this limit but > sometime the test fail with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test "SPARK-23786: warning should be printed if CSV > header doesn't conform to schema" uses 2 log appenders. The > `success_trace.txt` contains stored log event in normal runs, another one > `fail_trace.txt` I got while re-running the test in a loop. > The difference is in: > {code} > [35] Block broadcast_222 stored as values in memory (estimated size 286.9 > KiB, free 1998.5 MiB) > [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.8 MiB) > [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2003.9 MiB) > [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2003.9 MiB) > [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2003.9 MiB) > [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.0 MiB) > [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.0 MiB) > [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.0 MiB) > [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.0 MiB) > [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.0 MiB) > [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.1 MiB) > [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.1 MiB) > [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 > KiB, free: 2004.2 MiB) > [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.2 MiB) > [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.2 MiB) > [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 > KiB, free: 2004.2 MiB) > [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 > KiB, free: 2004.2 MiB) > [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.3 MiB) > [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 > KiB, free: 2004.3 MiB) > [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 > KiB, free: 2004.4 MiB) > [61] Removed broadcast_220_piece0 on
[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30599: --- Description: The LogAppender has a limit for the number of logged event. By default, it is 100. All tests that use LogAppender finish successfully with this limit but sometime the test fail with the exception like: {code:java} java.lang.IllegalStateException: Number of events reached the limit of 100 while logging CSV header matches to schema w/ enforceSchema. at org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) {code} For example, the CSV test "SPARK-23786: warning should be printed if CSV header doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` contains stored log event in normal runs, another one `fail_trace.txt` I got while re-running the test in a loop. The difference is in: {code} [35] Block broadcast_222 stored as values in memory (estimated size 286.9 KiB, free 1998.5 MiB) [36] Removed broadcast_214_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.8 MiB) [37] Removed broadcast_204_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [38] Removed broadcast_200_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2003.9 MiB) [39] Removed broadcast_207_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [40] Removed broadcast_208_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2003.9 MiB) [41] Removed broadcast_194_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2003.9 MiB) [42] Removed broadcast_216_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.0 MiB) [43] Removed broadcast_195_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [44] Removed broadcast_219_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.0 MiB) [45] Removed broadcast_198_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.0 MiB) [46] Removed broadcast_201_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.0 MiB) [47] Removed broadcast_205_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.0 MiB) [48] Removed broadcast_197_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [49] Removed broadcast_189_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.1 MiB) [50] Removed broadcast_193_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.1 MiB) [51] Removed broadcast_196_piece0 on 192.168.1.66:52354 in memory (size: 53.3 KiB, free: 2004.2 MiB) [52] Removed broadcast_206_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.2 MiB) [53] Removed broadcast_209_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [54] Removed broadcast_192_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.2 MiB) [55] Removed broadcast_215_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [56] Removed broadcast_210_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.2 MiB) [57] Removed broadcast_199_piece0 on 192.168.1.66:52354 in memory (size: 5.7 KiB, free: 2004.2 MiB) [58] Removed broadcast_221_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.3 MiB) [59] Removed broadcast_213_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.3 MiB) [60] Removed broadcast_191_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.4 MiB) [61] Removed broadcast_220_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.4 MiB) [62] Block broadcast_222_piece0 stored as bytes in memory (estimated size 24.2 KiB, free 2002.0 MiB) [63] Added broadcast_222_piece0 in memory on 192.168.1.66:52354 (size: 24.2 KiB, free: 2004.4 MiB) [64] Removed broadcast_203_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [65] Created broadcast 222 from apply at OutcomeOf.scala:85 [66] Removed broadcast_218_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.4 MiB) [67] Removed broadcast_211_piece0 on 192.168.1.66:52354 in memory (size: 53.4 KiB, free: 2004.5 MiB) [68] Removed broadcast_190_piece0 on 192.168.1.66:52354 in memory (size: 3.7 KiB, free: 2004.5 MiB) [69] Block broadcast_223 stored as values in memory (estimated size 286.9 KiB, free 2002.5 MiB) [70] Removed broadcast_212_piece0 on 192.168.1.66:52354 in memory (size: 24.2 KiB, free: 2004.5 MiB) [71] Removed broadcast_188_piece0 on
[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30599: --- Description: The LogAppender has a limit for the number of logged event. By default, it is 100. All tests that use LogAppender finish successfully with this limit but sometime the test fail with the exception like: {code:java} java.lang.IllegalStateException: Number of events reached the limit of 100 while logging CSV header matches to schema w/ enforceSchema. at org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) {code} For example, the CSV test "SPARK-23786: warning should be printed if CSV header doesn't conform to schema" uses 2 log appenders. The `success_trace.txt` contains stored log event in normal runs, another one `fail_trace.txt` I got while re-running the test in a loop. was: The LogAppender has a limit for the number of logged event. By default, it is 100. All tests that use LogAppender finish successfully with this limit but sometime the test fail with the exception like: {code:java} java.lang.IllegalStateException: Number of events reached the limit of 100 while logging CSV header matches to schema w/ enforceSchema. at org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) {code} For example, the CSV test "SPARK-23786: warning should be printed if CSV header doesn't conform to schema" uses 2 log appenders. > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: fail_trace.txt, success_trace.txt > > > The LogAppender has a limit for the number of logged event. By default, it is > 100. All tests that use LogAppender finish successfully with this limit but > sometime the test fail with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test "SPARK-23786: warning should be printed if CSV > header doesn't conform to schema" uses 2 log appenders. The > `success_trace.txt` contains stored log event in normal runs, another one > `fail_trace.txt` I got while re-running the test in a loop. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30599: --- Attachment: fail_trace.txt > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: fail_trace.txt, success_trace.txt > > > The LogAppender has a limit for the number of logged event. By default, it is > 100. All tests that use LogAppender finish successfully with this limit but > sometime the test fail with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test "SPARK-23786: warning should be printed if CSV > header doesn't conform to schema" uses 2 log appenders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
Maxim Gekk created SPARK-30599: -- Summary: SparkFunSuite.LogAppender throws java.lang.IllegalStateException Key: SPARK-30599 URL: https://issues.apache.org/jira/browse/SPARK-30599 Project: Spark Issue Type: Test Components: Spark Core, SQL, Tests Affects Versions: 3.0.0 Reporter: Maxim Gekk Attachments: success_trace.txt The LogAppender has a limit for the number of logged event. By default, it is 100. All tests that use LogAppender finish successfully with this limit but sometime the test fail with the exception like: {code:java} java.lang.IllegalStateException: Number of events reached the limit of 100 while logging CSV header matches to schema w/ enforceSchema. at org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) {code} For example, the CSV test "SPARK-23786: warning should be printed if CSV header doesn't conform to schema" uses 2 log appenders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30599) SparkFunSuite.LogAppender throws java.lang.IllegalStateException
[ https://issues.apache.org/jira/browse/SPARK-30599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-30599: --- Attachment: success_trace.txt > SparkFunSuite.LogAppender throws java.lang.IllegalStateException > > > Key: SPARK-30599 > URL: https://issues.apache.org/jira/browse/SPARK-30599 > Project: Spark > Issue Type: Test > Components: Spark Core, SQL, Tests >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Minor > Attachments: success_trace.txt > > > The LogAppender has a limit for the number of logged event. By default, it is > 100. All tests that use LogAppender finish successfully with this limit but > sometime the test fail with the exception like: > {code:java} > java.lang.IllegalStateException: Number of events reached the limit of 100 > while logging CSV header matches to schema w/ enforceSchema. > at > org.apache.spark.SparkFunSuite$LogAppender.append(SparkFunSuite.scala:200) > at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) > at > org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) > at org.apache.log4j.Category.callAppenders(Category.java:206) > at org.apache.log4j.Category.forcedLog(Category.java:391) > at org.apache.log4j.Category.log(Category.java:856) > {code} > For example, the CSV test "SPARK-23786: warning should be printed if CSV > header doesn't conform to schema" uses 2 log appenders. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30528) DPP issues
[ https://issues.apache.org/jira/browse/SPARK-30528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020385#comment-17020385 ] Wei Xue commented on SPARK-30528: - Good point, [~mayurb31]! 1. Heuristics: yes, we should improve the cost estimate for the filter plan. As a quick workaround, though, you could set "spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio" to "0.0", which would disable this kind of DPP if columns stats are not available. 2. Reuse: yes, that's a dilemma here: first of all we wanna reduce the result set returned to the driver for pruning values, and that's why the Aggregate is added. It might kill some potential opportunities for exchange reuse if the filter plan contains another join, but that kind of potential opportunity is not fully guaranteed even if we didn't push down the Aggregate, for the join in the filter plan can turn out to be a BHJ. However, `pruningHasBenefit` had intended to cost the entire DPP subquery as overhead without considering reuse, so this takes us back to point 1: we should improve the costing of heuristics so that this kind of DPP should not be triggered at all if the scan + join would be just too much work. 3. Can you attach the query plan instead of UI for your last example? I think the reuse did not happen because the first subquery selects `col1` while the second `col2`? > DPP issues > -- > > Key: SPARK-30528 > URL: https://issues.apache.org/jira/browse/SPARK-30528 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: Mayur Bhosale >Priority: Major > Labels: performance > Attachments: cases.png, dup_subquery.png, plan.png > > > In DPP, heuristics to decide if DPP is going to benefit relies on the sizes > of the tables in the right subtree of the join. This might not be a correct > estimate especially when the detailed column level stats are not available. > {code:java} > // the pruning overhead is the total size in bytes of all scan relations > val overhead = > otherPlan.collectLeaves().map(_.stats.sizeInBytes).sum.toFloat > filterRatio * partPlan.stats.sizeInBytes.toFloat > overhead.toFloat > {code} > Also, DPP executes the entire right side of the join as a subquery because of > which multiple scans happen for the tables in the right subtree of the join. > This can cause issues when join is non-Broadcast Hash Join (BHJ) and reuse of > the subquery result does not happen. Also, I couldn’t figure out, why do the > results from the subquery get re-used only for BHJ? > > Consider a query, > {code:java} > SELECT * > FROM store_sales_partitioned >JOIN (SELECT * > FROM store_returns_partitioned, > date_dim > WHERE sr_returned_date_sk = d_date_sk) ret_date > ON ss_sold_date_sk = d_date_sk > WHERE d_fy_quarter_seq > 0 > {code} > DPP will kick-in for both the join. (Please check the image plan.png attached > below for the plan) > Some of the observations - > * Based on heuristics, DPP would go ahead with pruning if the cost of > scanning the tables in the right sub-tree of the join is less than the > benefit due to pruning. This is due to the reason that multiple scans will be > needed for an SMJ. But heuristics simply checks if the benefits offset the > cost of multiple scans and do not take into consideration other operations > like Join, etc in the right subtree which can be quite expensive. This issue > will be particularly prominent when detailed column level stats are not > available. In the example above, a decision that pruningHasBenefit was made > on the basis of sizes of the tables store_returns_partitioned and date_dim > but did not take into consideration the join between them before the join > happens with the store_sales_partitioned table. > * Multiple scans are needed when the join is SMJ as the reuse of the > exchanges does not happen. This is because Aggregate gets added on top of the > right subtree to be executed as a subquery in order to prune only required > columns. Here, scanning all the columns as the right subtree of the join > would, and reusing the same exchange might be more helpful as it avoids > duplicate scans. > This was just a representative example, but in-general for cases such as in > the image cases.png below, DPP can cause performance issues. > > Also, for the cases when there are multiple DPP compatible join conditions in > the same join, the entire right subtree of the join would be executed as a > subquery that many times. Consider an example, > {code:java} > SELECT * > FROM partitionedtable > JOIN nonpartitionedtable > ON partcol1 = col1 > AND partcol2 = col2 > WHERE nonpartitionedtable.id > 0 > {code} > Here
[jira] [Commented] (SPARK-30049) SQL fails to parse when comment contains an unmatched quote character
[ https://issues.apache.org/jira/browse/SPARK-30049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020378#comment-17020378 ] Javier Fuentes commented on SPARK-30049: I have raised a PR with a simple fix and unit tests. Let me know if it looks good or not. > SQL fails to parse when comment contains an unmatched quote character > - > > Key: SPARK-30049 > URL: https://issues.apache.org/jira/browse/SPARK-30049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jason Darrell Lowe >Priority: Major > Attachments: Screen Shot 2019-12-18 at 9.26.29 AM.png > > > A SQL statement that contains a comment with an unmatched quote character can > lead to a parse error. These queries parsed correctly in older versions of > Spark. For example, here's an excerpt from an interactive spark-sql session > on a recent Spark-3.0.0-SNAPSHOT build (commit > e23c135e568d4401a5659bc1b5ae8fc8bf147693): > {noformat} > spark-sql> SELECT 1 -- someone's comment here > > ; > Error in query: > extraneous input ';' expecting (line 2, pos 0) > == SQL == > SELECT 1 -- someone's comment here > ; > ^^^ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30598) Detect equijoins better
[ https://issues.apache.org/jira/browse/SPARK-30598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-30598: --- Issue Type: Improvement (was: Bug) > Detect equijoins better > --- > > Key: SPARK-30598 > URL: https://issues.apache.org/jira/browse/SPARK-30598 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Minor > > The following 2 query produce different plans, as the second one is not > recognised as equijoin. > {noformat} > SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 AND t1.c = t2.c > SortMergeJoin [c#225], [c#236], FullOuter, ((c2#226 = 2) AND (c2#237 = 2)) > :- *(2) Sort [c#225 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(c#225, 5), true, [id=#101] > : +- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226] > :+- *(1) LocalTableScan [_1#220, _2#221] > +- *(4) Sort [c#236 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(c#236, 5), true, [id=#106] > +- *(3) Project [_1#231 AS c#236, _2#232 AS c2#237] > +- *(3) LocalTableScan [_1#231, _2#232] > {noformat} > {noformat} > SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 > BroadcastNestedLoopJoin BuildRight, FullOuter, ((c2#226 = 2) AND (c2#237 = 2)) > :- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226] > : +- *(1) LocalTableScan [_1#220, _2#221] > +- BroadcastExchange IdentityBroadcastMode, [id=#146] >+- *(2) Project [_1#231 AS c#236, _2#232 AS c2#237] > +- *(2) LocalTableScan [_1#231, _2#232] > {noformat} > We could detect the implicit equalities from the join condition. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30598) Detect equijoins better
Peter Toth created SPARK-30598: -- Summary: Detect equijoins better Key: SPARK-30598 URL: https://issues.apache.org/jira/browse/SPARK-30598 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth The following 2 query produce different plans, as the second one is not recognised as equijoin. {noformat} SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 AND t1.c = t2.c SortMergeJoin [c#225], [c#236], FullOuter, ((c2#226 = 2) AND (c2#237 = 2)) :- *(2) Sort [c#225 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(c#225, 5), true, [id=#101] : +- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226] :+- *(1) LocalTableScan [_1#220, _2#221] +- *(4) Sort [c#236 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(c#236, 5), true, [id=#106] +- *(3) Project [_1#231 AS c#236, _2#232 AS c2#237] +- *(3) LocalTableScan [_1#231, _2#232] {noformat} {noformat} SELECT * FROM t1 FULL OUTER JOIN t2 ON t1.c2 = 2 AND t2.c2 = 2 BroadcastNestedLoopJoin BuildRight, FullOuter, ((c2#226 = 2) AND (c2#237 = 2)) :- *(1) Project [_1#220 AS c#225, _2#221 AS c2#226] : +- *(1) LocalTableScan [_1#220, _2#221] +- BroadcastExchange IdentityBroadcastMode, [id=#146] +- *(2) Project [_1#231 AS c#236, _2#232 AS c2#237] +- *(2) LocalTableScan [_1#231, _2#232] {noformat} We could detect the implicit equalities from the join condition. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15616) CatalogRelation should fallback to HDFS size of partitions that are involved in Query if statistics are not available.
[ https://issues.apache.org/jira/browse/SPARK-15616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15616. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26805 [https://github.com/apache/spark/pull/26805] > CatalogRelation should fallback to HDFS size of partitions that are involved > in Query if statistics are not available. > -- > > Key: SPARK-15616 > URL: https://issues.apache.org/jira/browse/SPARK-15616 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Lianhui Wang >Priority: Major > Fix For: 3.0.0 > > > Currently if some partitions of a partitioned table are used in join > operation we rely on Metastore returned size of table to calculate if we can > convert the operation to Broadcast join. > if Filter can prune some partitions, Hive can prune partition before > determining to use broadcast joins according to HDFS size of partitions that > are involved in Query.So sparkSQL needs it that can improve join's > performance for partitioned table. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30252) Disallow negative scale of Decimal under ansi mode
[ https://issues.apache.org/jira/browse/SPARK-30252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30252. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 26881 [https://github.com/apache/spark/pull/26881] > Disallow negative scale of Decimal under ansi mode > -- > > Key: SPARK-30252 > URL: https://issues.apache.org/jira/browse/SPARK-30252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > > According to SQL standard, > {quote}4.4.2 Characteristics of numbers > An exact numeric type has a precision P and a scale S. P is a positive > integer that determines the number of significant digits in a particular > radix R, where R is either 2 or 10. S is a non-negative integer. > {quote} > scale of Decimal should always be non-negative. And other mainstream > databases, like Presto, PostgreSQL, also don't allow negative scale. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30252) Disallow negative scale of Decimal under ansi mode
[ https://issues.apache.org/jira/browse/SPARK-30252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30252: --- Assignee: wuyi > Disallow negative scale of Decimal under ansi mode > -- > > Key: SPARK-30252 > URL: https://issues.apache.org/jira/browse/SPARK-30252 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > > According to SQL standard, > {quote}4.4.2 Characteristics of numbers > An exact numeric type has a precision P and a scale S. P is a positive > integer that determines the number of significant digits in a particular > radix R, where R is either 2 or 10. S is a non-negative integer. > {quote} > scale of Decimal should always be non-negative. And other mainstream > databases, like Presto, PostgreSQL, also don't allow negative scale. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30593) Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and no round trip.
[ https://issues.apache.org/jira/browse/SPARK-30593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30593: --- Assignee: Kent Yao > Revert interval ISO/ANSI SQL Standard output since we decide not to follow > ANSI and no round trip. > -- > > Key: SPARK-30593 > URL: https://issues.apache.org/jira/browse/SPARK-30593 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > Revert interval ISO/ANSI SQL Standard output since we decide not to > follow ANSI, so there is no round trip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30593) Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and no round trip.
[ https://issues.apache.org/jira/browse/SPARK-30593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30593. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27304 [https://github.com/apache/spark/pull/27304] > Revert interval ISO/ANSI SQL Standard output since we decide not to follow > ANSI and no round trip. > -- > > Key: SPARK-30593 > URL: https://issues.apache.org/jira/browse/SPARK-30593 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Revert interval ISO/ANSI SQL Standard output since we decide not to > follow ANSI, so there is no round trip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29783) Support SQL Standard output style for interval type
[ https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-29783: - > Support SQL Standard output style for interval type > --- > > Key: SPARK-29783 > URL: https://issues.apache.org/jira/browse/SPARK-29783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Support sql standard interval-style for output. > > ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval|| > |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06| > |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes > 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29783) Support SQL Standard output style for interval type
[ https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020205#comment-17020205 ] Wenchen Fan commented on SPARK-29783: - This has been reverted in https://github.com/apache/spark/pull/27304 > Support SQL Standard output style for interval type > --- > > Key: SPARK-29783 > URL: https://issues.apache.org/jira/browse/SPARK-29783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Support sql standard interval-style for output. > > ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval|| > |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06| > |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes > 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29783) Support SQL Standard output style for interval type
[ https://issues.apache.org/jira/browse/SPARK-29783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29783. - Resolution: Won't Do > Support SQL Standard output style for interval type > --- > > Key: SPARK-29783 > URL: https://issues.apache.org/jira/browse/SPARK-29783 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > Support sql standard interval-style for output. > > ||Style ||conf||Year-Month Interval||Day-Time Interval||Mixed Interval|| > |{{sql_standard}}|ANSI enabled|1-2|3 4:05:06|-1-2 3 -4:05:06| > |{{spark's current}}|ansi disabled|1 year 2 mons|1 days 2 hours 3 minutes > 4.123456 seconds|interval 1 days 2 hours 3 minutes 4.123456 seconds| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30597) Unable to load properties fine in SparkStandalone HDFS mode
Gajanan Hebbar created SPARK-30597: -- Summary: Unable to load properties fine in SparkStandalone HDFS mode Key: SPARK-30597 URL: https://issues.apache.org/jira/browse/SPARK-30597 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3 Environment: 2.4.3 standalone with HDFS Reporter: Gajanan Hebbar We run the spark application in Yarn HDFS/NFS/WebHDFS and standalone HDFS/NFS mode. when the application is submitted in standalone HDFS mode the configuration jar(properties file) is not read when the application is started, and this make the logger to fall back to default log file and log level. So when application is submitted to standalone HDFS, configuration files are not read. STD OUT LOGS from Standalone HDFS - properties file is not found == log4j: Trying to find [osa-log.properties] using context classloader sun.misc.Launcher$AppClassLoader@4cdf35a9. log4j: Trying to find [osa-log.properties] using sun.misc.Launcher$AppClassLoader@4cdf35a9 class loader. log4j: Trying to find [osa-log.properties] using ClassLoader.getSystemResource(). log4j: Could not find resource: [osa-log.properties]. log4j: Reading configuration from URL jar:file:/osa/spark-2.4.3-bin-hadoop2.7/jars/spark-core_2.11-2.4.3.jar!/org/apache/spark/log4j-defaults.properties STD out from Standalone NFS - properties files are found and able to load it === log4j: Trying to find [osa-log.properties] using context classloader sun.misc.Launcher$AppClassLoader@4cdf35a9. log4j: Using URL [jar:file:/scratch//gitlocal/soa-osa/out/configdir/ux6m3UQZ/app/sx_DatePipeline_A44C5337_B0D6_4A67_9D60_6BE629DABADA_x0jR7lg5_public/__config__.jar!/osa-log.properties] for automatic log4j configuration. log4j: Preferred configurator class: OSALogPropertyConfigurator STD out from YARN HDFS — properties files are found and able to load it === log4j: Trying to find [osa-log.properties] using context classloader sun.misc.Launcher$AppClassLoader@2626b418. log4j: Using URL [file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties] for automatic log4j configuration. log4j: Preferred configurator class: OSALogPropertyConfigurator log4j: configuration file:/tmp/hadoop-yarn/nm-local-dir/usercache/osa/filecache/16/__osa_config__8272345390991627491.jar/osa-log.properties -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30546) Make interval type more future-proofing
[ https://issues.apache.org/jira/browse/SPARK-30546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-30546: - Description: Before 3.0 we may make some efforts for the current interval type to make it more future-proofing. e.g. 1. add unstable annotation to the CalendarInterval class. People already use it as UDF inputs so it’s better to make it clear it’s unstable. 2. Add a schema checker to prohibit create v2 custom catalog table with intervals, as same as what we do for the builtin catalog 3. Add a schema checker for DataFrameWriterV2 too 4. Make the interval type incomparable as version 2.4 for disambiguation of comparison between year-month and day-time fields 5. The 3.0 newly added to_csv should not support output intervals as same as using CSV file format 6. The function to_json should not allow using interval as a key field as same as the value field and JSON datasource, with a legacy config to restore. 7. Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI, so there is no round trip. was: Before 3.0 we maymake some efforts for the current interval type to make it more future-proofing. e.g. 1. add unstable annotation to the CalendarInterval class. People already use it as UDF inputs so it’s better to make it clear it’s unstable. 2. Add a schema checker to prohibit create v2 custom catalog table with intervals, as same as what we do for the builtin catalog 3. Add a schema checker for DataFrameWriterV2 too 4. Make the interval type incomparable as version 2.4 for disambiguation of comparison between year-month and day-time fields 5. The 3.0 newly added to_csv should not support output intervals as same as using CSV file format 6. The function to_json should not allow using interval as a key field as same as the value field and JSON datasource, with a legacy config to restore. 7. Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI, so there is no round trip. > Make interval type more future-proofing > --- > > Key: SPARK-30546 > URL: https://issues.apache.org/jira/browse/SPARK-30546 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > Before 3.0 we may make some efforts for the current interval type to make it > more future-proofing. e.g. > 1. add unstable annotation to the CalendarInterval class. People already use > it as UDF inputs so it’s better to make it clear it’s unstable. > 2. Add a schema checker to prohibit create v2 custom catalog table with > intervals, as same as what we do for the builtin catalog > 3. Add a schema checker for DataFrameWriterV2 too > 4. Make the interval type incomparable as version 2.4 for disambiguation of > comparison between year-month and day-time fields > 5. The 3.0 newly added to_csv should not support output intervals as same as > using CSV file format > 6. The function to_json should not allow using interval as a key field as > same as the value field and JSON datasource, with a legacy config to > restore. > 7. Revert interval ISO/ANSI SQL Standard output since we decide not to > follow ANSI, so there is no round trip. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30571) coalesce shuffle reader with splitting shuffle fetch request fails
[ https://issues.apache.org/jira/browse/SPARK-30571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30571. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27280 [https://github.com/apache/spark/pull/27280] > coalesce shuffle reader with splitting shuffle fetch request fails > -- > > Key: SPARK-30571 > URL: https://issues.apache.org/jira/browse/SPARK-30571 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19798) Query returns stale results when tables are modified on other sessions
[ https://issues.apache.org/jira/browse/SPARK-19798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Giambattista updated SPARK-19798: - Affects Version/s: 3.0.0 > Query returns stale results when tables are modified on other sessions > -- > > Key: SPARK-19798 > URL: https://issues.apache.org/jira/browse/SPARK-19798 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 3.0.0 >Reporter: Giambattista >Priority: Major > > I observed the problem on master branch with thrift server in multisession > mode (default), but I was able to replicate also with spark-shell as well > (see below the sequence for replicating). > I observed cases where changes made in a session (table insert, table > renaming) are not visible to other derived sessions (created with > session.newSession). > The problem seems due to the fact that each session has its own > tableRelationCache and it does not get refreshed. > IMO tableRelationCache should be shared in sharedState, maybe in the > cacheManager so that refresh of caches for data that is not session-specific > such as temporary tables gets centralized. > --- Spark shell script > val spark2 = spark.newSession > spark.sql("CREATE TABLE test (a int) using parquet") > spark2.sql("select * from test").show // OK returns empty > spark.sql("select * from test").show // OK returns empty > spark.sql("insert into TABLE test values 1,2,3") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 3,2,1 > spark.sql("create table test2 (a int) using parquet") > spark.sql("insert into TABLE test2 values 4,5,6") > spark2.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("select * from test2").show // OK returns 6,4,5 > spark.sql("alter table test rename to test3") > spark.sql("alter table test2 rename to test") > spark.sql("alter table test3 rename to test2") > spark2.sql("select * from test").show // ERROR returns empty > spark.sql("select * from test").show // OK returns 6,4,5 > spark2.sql("select * from test2").show // ERROR throws > java.io.FileNotFoundException > spark.sql("select * from test2").show // OK returns 3,1,2 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30596) Use a different repository id in pom to work around sbt-pom-reader issue
[ https://issues.apache.org/jira/browse/SPARK-30596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30596. -- Resolution: Invalid > Use a different repository id in pom to work around sbt-pom-reader issue > > > Key: SPARK-30596 > URL: https://issues.apache.org/jira/browse/SPARK-30596 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See [https://github.com/apache/spark/pull/27242#issuecomment-576589529] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30596) Use a different repository id in pom to work around sbt-pom-reader issue
Hyukjin Kwon created SPARK-30596: Summary: Use a different repository id in pom to work around sbt-pom-reader issue Key: SPARK-30596 URL: https://issues.apache.org/jira/browse/SPARK-30596 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.5, 3.0.0 Reporter: Hyukjin Kwon See [https://github.com/apache/spark/pull/27242#issuecomment-576589529] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30596) Use a different repository id in pom to work around sbt-pom-reader issue
[ https://issues.apache.org/jira/browse/SPARK-30596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30596: - Issue Type: Bug (was: Improvement) > Use a different repository id in pom to work around sbt-pom-reader issue > > > Key: SPARK-30596 > URL: https://issues.apache.org/jira/browse/SPARK-30596 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > See [https://github.com/apache/spark/pull/27242#issuecomment-576589529] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30534) Use mvn in `dev/scalastyle`
[ https://issues.apache.org/jira/browse/SPARK-30534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020076#comment-17020076 ] Hyukjin Kwon commented on SPARK-30534: -- [https://github.com/apache/spark/pull/27281] fixed this issue. > Use mvn in `dev/scalastyle` > --- > > Key: SPARK-30534 > URL: https://issues.apache.org/jira/browse/SPARK-30534 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > This issue aims to use `mvn` instead of `sbt`. > As of now, Apache Spark sbt build is broken by the Maven Central repository > policy. > - > https://stackoverflow.com/questions/59764749/requests-to-http-repo1-maven-org-maven2-return-a-501-https-required-status-an > > Effective January 15, 2020, The Central Maven Repository no longer supports > > insecure > > communication over plain HTTP and requires that all requests to the > > repository are > > encrypted over HTTPS. > We can reproduce this locally by the following. > $ rm -rf ~/.m2/repository/org/apache/apache/18/ > $ build/sbt clean > This issue aims to recover GitHub Action `lint-scala` first by using mvn. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-30572) Add a fallback Maven repository
[ https://issues.apache.org/jira/browse/SPARK-30572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30572: - Comment: was deleted (was: [https://github.com/apache/spark/pull/27281] fixed this issue.) > Add a fallback Maven repository > --- > > Key: SPARK-30572 > URL: https://issues.apache.org/jira/browse/SPARK-30572 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > This issue aims to add a fallback Maven repository when a mirror to `central` > fail. > For example, when we use Google Maven Central in GitHub Action as a mirror of > `central`, > this will be used when Google Maven Central is out of sync due to its late > sync cycle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30572) Add a fallback Maven repository
[ https://issues.apache.org/jira/browse/SPARK-30572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020073#comment-17020073 ] Hyukjin Kwon commented on SPARK-30572: -- [https://github.com/apache/spark/pull/27281] fixed this issue. > Add a fallback Maven repository > --- > > Key: SPARK-30572 > URL: https://issues.apache.org/jira/browse/SPARK-30572 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.4, 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > This issue aims to add a fallback Maven repository when a mirror to `central` > fail. > For example, when we use Google Maven Central in GitHub Action as a mirror of > `central`, > this will be used when Google Maven Central is out of sync due to its late > sync cycle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30296) Dataset diffing transformation
[ https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack updated SPARK-30296: -- Priority: Minor (was: Major) > Dataset diffing transformation > -- > > Key: SPARK-30296 > URL: https://issues.apache.org/jira/browse/SPARK-30296 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Minor > > Evolving Spark code needs frequent regression testing to prove it still > produces identical results, or if changes are expected, to investigate those > changes. Diffing the Datasets of two code paths provides confidence. > Diffing small schemata is easy, but with wide schema the Spark query becomes > laborious and error-prone. With a single proven and tested method, diffing > becomes easier and a more reliable operation. As a Dataset transformation, > you get this operation first hand with your Dataset API. > This has proven to be useful for interactive spark as well as deployed > production code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30296) Dataset diffing transformation
[ https://issues.apache.org/jira/browse/SPARK-30296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Enrico Minack resolved SPARK-30296. --- Resolution: Won't Do > Dataset diffing transformation > -- > > Key: SPARK-30296 > URL: https://issues.apache.org/jira/browse/SPARK-30296 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Enrico Minack >Priority: Major > > Evolving Spark code needs frequent regression testing to prove it still > produces identical results, or if changes are expected, to investigate those > changes. Diffing the Datasets of two code paths provides confidence. > Diffing small schemata is easy, but with wide schema the Spark query becomes > laborious and error-prone. With a single proven and tested method, diffing > becomes easier and a more reliable operation. As a Dataset transformation, > you get this operation first hand with your Dataset API. > This has proven to be useful for interactive spark as well as deployed > production code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30595) Unable to create local temp dir on spark on k8s mode, with defaults.
Prashant Sharma created SPARK-30595: --- Summary: Unable to create local temp dir on spark on k8s mode, with defaults. Key: SPARK-30595 URL: https://issues.apache.org/jira/browse/SPARK-30595 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.0 Reporter: Prashant Sharma Unless we configure the property, {code} spark.local.dir /tmp {code} following error occurs: {noformat} *20/01/21 08:33:17 INFO SparkEnv: Registering BlockManagerMasterHeartbeat* *20/01/21 08:33:17 ERROR DiskBlockManager: Failed to create local dir in /var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4. Ignoring this directory.* *java.io.IOException: Failed to create a temp directory (under /var/data/spark-284c6844-8969-4288-9a6b-b72679c5b8e4) after 10 attempts!* *at org.apache.spark.util.Utils$.createDirectory(Utils.scala:304)* *at org.apache.spark.storage.DiskBlockManager.$anonfun$createLocalDirs$1(DiskBlockManager.scala:164)* *at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)* *at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)* {noformat} I have not yet fully understood the root cause, will post my findings once it is clear. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30594) Do not post SparkListenerBlockUpdated when updateBlockInfo returns false
wuyi created SPARK-30594: Summary: Do not post SparkListenerBlockUpdated when updateBlockInfo returns false Key: SPARK-30594 URL: https://issues.apache.org/jira/browse/SPARK-30594 Project: Spark Issue Type: Improvement Components: Spark Core, Web UI Affects Versions: 3.0.0 Reporter: wuyi We should not post SparkListenerBlockUpdated event when updateBlockInfo returns false, which may possible show negative memory in UI(see snapshot in PR of SPARK-30465). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org