[jira] [Created] (SPARK-41200) BytesToBytesMap's longArray size can be up to MAX_CAPACITY
EdisonWang created SPARK-41200: -- Summary: BytesToBytesMap's longArray size can be up to MAX_CAPACITY Key: SPARK-41200 URL: https://issues.apache.org/jira/browse/SPARK-41200 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40035) Avoid apply filter twice when listing files
EdisonWang created SPARK-40035: -- Summary: Avoid apply filter twice when listing files Key: SPARK-40035 URL: https://issues.apache.org/jira/browse/SPARK-40035 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39476) Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float
EdisonWang created SPARK-39476: -- Summary: Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float Key: SPARK-39476 URL: https://issues.apache.org/jira/browse/SPARK-39476 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.3.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39249) Improve subexpression elimination for conditional expressions
[ https://issues.apache.org/jira/browse/SPARK-39249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-39249: --- Description: Currently we can do subexpression elimination for conditional expressions when the subexpression is common across all {{{}branchGroups{}}}. In fact, we can farther improve this when there are common expressions between {{alwaysEvaluatedInputs}} and {{{}branchGroups{}}}. > Improve subexpression elimination for conditional expressions > - > > Key: SPARK-39249 > URL: https://issues.apache.org/jira/browse/SPARK-39249 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: EdisonWang >Priority: Minor > > Currently we can do subexpression elimination for conditional expressions > when the subexpression is common across all {{{}branchGroups{}}}. In fact, we > can farther improve this when there are common expressions between > {{alwaysEvaluatedInputs}} and {{{}branchGroups{}}}. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39249) Improve subexpression elimination for conditional expressions
EdisonWang created SPARK-39249: -- Summary: Improve subexpression elimination for conditional expressions Key: SPARK-39249 URL: https://issues.apache.org/jira/browse/SPARK-39249 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39002) StringEndsWith/Contains support push down to Parquet so that we can leverage dictionary filter
EdisonWang created SPARK-39002: -- Summary: StringEndsWith/Contains support push down to Parquet so that we can leverage dictionary filter Key: SPARK-39002 URL: https://issues.apache.org/jira/browse/SPARK-39002 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: EdisonWang Support push down StringEndsWith/StringContains to Parquet so that we can leverage parquet dictionary filtering -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend
EdisonWang created SPARK-38160: -- Summary: Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend Key: SPARK-38160 URL: https://issues.apache.org/jira/browse/SPARK-38160 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: EdisonWang When we do shuffle on indeterminate expressions such as rand, and ShuffleFetchFailed happend, we may get incorrent result since it only retries failed map tasks. We try to fix this by retry all upstream map tasks in this situation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste when using G1GC
[ https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-37593: --- Description: Spark's tungsten memory model usually tries to allocate memory by one `page` each time and allocated by long[pageSizeBytes/8] in HeapMemoryAllocator.allocate. Remember that java long array needs extra object header (usually 16 bytes in 64bit system), so the really bytes allocated is pageSize+16. Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since every time we need to allocate 4M+16byte memory, so two regions are used with one region only occupies 16byte. Then there are about 50% memory waste. It can happenes under different combinations of G1HeapRegionSize (varies from 1M to 32M) and pageSizeBytes (varies from 1M to 64M). was: As we may know, a phenomenon called humongous allocations exists in G1GC when allocations that are larger than 50% of the region size. Spark's tungsten memory model usually tries to allocate memory by one `page` each time and allocated by long[pageSizeBytes/8] in HeapMemoryAllocator.allocate. Remember that java long array needs extra object header (usually 16 bytes in 64bit system), so the really bytes allocated is pageSize+16. Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since every time we need to allocate 4M+16byte memory, so two regions are used with one region only occupies 16byte. Then there are about 50% memory waste. It can happenes under different combinations of G1HeapRegionSize (varies from 1M to 32M) and pageSizeBytes (varies from 1M to 64M). > Optimize HeapMemoryAllocator to avoid memory waste when using G1GC > -- > > Key: SPARK-37593 > URL: https://issues.apache.org/jira/browse/SPARK-37593 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > Fix For: 3.3.0 > > > Spark's tungsten memory model usually tries to allocate memory by one `page` > each time and allocated by long[pageSizeBytes/8] in > HeapMemoryAllocator.allocate. > Remember that java long array needs extra object header (usually 16 bytes in > 64bit system), so the really bytes allocated is pageSize+16. > Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since > every time we need to allocate 4M+16byte memory, so two regions are used with > one region only occupies 16byte. Then there are about 50% memory waste. > It can happenes under different combinations of G1HeapRegionSize (varies from > 1M to 32M) and pageSizeBytes (varies from 1M to 64M). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste when using G1GC
[ https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-37593: --- Summary: Optimize HeapMemoryAllocator to avoid memory waste when using G1GC (was: Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC) > Optimize HeapMemoryAllocator to avoid memory waste when using G1GC > -- > > Key: SPARK-37593 > URL: https://issues.apache.org/jira/browse/SPARK-37593 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > Fix For: 3.3.0 > > > As we may know, a phenomenon called humongous allocations exists in G1GC when > allocations that are larger than 50% of the region size. > Spark's tungsten memory model usually tries to allocate memory by one `page` > each time and allocated by long[pageSizeBytes/8] in > HeapMemoryAllocator.allocate. > Remember that java long array needs extra object header (usually 16 bytes in > 64bit system), so the really bytes allocated is pageSize+16. > Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since > every time we need to allocate 4M+16byte memory, so two regions are used with > one region only occupies 16byte. Then there are about 50% memory waste. > It can happenes under different combinations of G1HeapRegionSize (varies from > 1M to 32M) and pageSizeBytes (varies from 1M to 64M). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37593) Optimize HeapMmeoryAllocator to avoid memory waste in humongous allocation when using G1GC
EdisonWang created SPARK-37593: -- Summary: Optimize HeapMmeoryAllocator to avoid memory waste in humongous allocation when using G1GC Key: SPARK-37593 URL: https://issues.apache.org/jira/browse/SPARK-37593 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.3.0 Reporter: EdisonWang Fix For: 3.3.0 As we may know, a phenomenon called humongous allocations exists in G1GC when allocations that are larger than 50% of the region size. Spark's tungsten memory model usually tries to allocate memory by one `page` each time and allocated by long[pageSizeBytes/8] in HeapMemoryAllocator.allocate. Remember that java long array needs extra object header (usually 16 bytes in 64bit system), so the really bytes allocated is pageSize+16. Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since every time we need to allocate 4M+16byte memory, so two regions are used with one region only occupies 16byte. Then there are about 50% memory waste. It can happenes under different combinations of G1HeapRegionSize (varies from 1M to 32M) and pageSizeBytes (varies from 1M to 64M). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC
[ https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-37593: --- Summary: Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC (was: Optimize HeapMmeoryAllocator to avoid memory waste in humongous allocation when using G1GC) > Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation > when using G1GC > -- > > Key: SPARK-37593 > URL: https://issues.apache.org/jira/browse/SPARK-37593 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: EdisonWang >Priority: Minor > Fix For: 3.3.0 > > > As we may know, a phenomenon called humongous allocations exists in G1GC when > allocations that are larger than 50% of the region size. > Spark's tungsten memory model usually tries to allocate memory by one `page` > each time and allocated by long[pageSizeBytes/8] in > HeapMemoryAllocator.allocate. > Remember that java long array needs extra object header (usually 16 bytes in > 64bit system), so the really bytes allocated is pageSize+16. > Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since > every time we need to allocate 4M+16byte memory, so two regions are used with > one region only occupies 16byte. Then there are about 50% memory waste. > It can happenes under different combinations of G1HeapRegionSize (varies from > 1M to 32M) and pageSizeBytes (varies from 1M to 64M). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34819) MapType supports orderable semantics
[ https://issues.apache.org/jira/browse/SPARK-34819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-34819: --- Issue Type: New Feature (was: Bug) > MapType supports orderable semantics > > > Key: SPARK-34819 > URL: https://issues.apache.org/jira/browse/SPARK-34819 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.1 >Reporter: EdisonWang >Priority: Minor > > Comparable/orderable semantics for map types is useful in some scenarios, and > it's implemented in hive/presto. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34819) MapType supports orderable semantics
EdisonWang created SPARK-34819: -- Summary: MapType supports orderable semantics Key: SPARK-34819 URL: https://issues.apache.org/jira/browse/SPARK-34819 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1 Reporter: EdisonWang Comparable/orderable semantics for map types is useful in some scenarios, and it's implemented in hive/presto. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly
[ https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-34634: --- Affects Version/s: 2.4.0 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 > Self-join with script transformation failed to resolve attribute correctly > -- > > Key: SPARK-34634 > URL: https://issues.apache.org/jira/browse/SPARK-34634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, > 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > > To reproduce, > {code:java} > // code placeholder > create temporary view t as select * from values 0, 1, 2 as t(a); > WITH temp AS ( > SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t > ) > SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b > {code} > > Spark will throw AnalysisException > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly
[ https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-34634: --- Description: To reproduce, ``` create temporary view t as select * from values 0, 1, 2 as t(a); WITH temp AS ( SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t ) SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b ```, Spark will throw AnalysisException > Self-join with script transformation failed to resolve attribute correctly > -- > > Key: SPARK-34634 > URL: https://issues.apache.org/jira/browse/SPARK-34634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > > To reproduce, > ``` > create temporary view t as select * from values 0, 1, 2 as t(a); > WITH temp AS ( > SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t > ) > SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b > ```, > > Spark will throw AnalysisException > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly
[ https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-34634: --- Description: To reproduce, {code:java} // code placeholder create temporary view t as select * from values 0, 1, 2 as t(a); WITH temp AS ( SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t ) SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b {code} Spark will throw AnalysisException was: To reproduce, {code:java} // code placeholder create temporary view t as select * from values 0, 1, 2 as t(a); WITH temp AS ( SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t ) SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b {code} ``` ```, Spark will throw AnalysisException > Self-join with script transformation failed to resolve attribute correctly > -- > > Key: SPARK-34634 > URL: https://issues.apache.org/jira/browse/SPARK-34634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > > To reproduce, > {code:java} > // code placeholder > create temporary view t as select * from values 0, 1, 2 as t(a); > WITH temp AS ( > SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t > ) > SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b > {code} > > Spark will throw AnalysisException > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly
[ https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-34634: --- Description: To reproduce, {code:java} // code placeholder create temporary view t as select * from values 0, 1, 2 as t(a); WITH temp AS ( SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t ) SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b {code} ``` ```, Spark will throw AnalysisException was: To reproduce, ``` create temporary view t as select * from values 0, 1, 2 as t(a); WITH temp AS ( SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t ) SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b ```, Spark will throw AnalysisException > Self-join with script transformation failed to resolve attribute correctly > -- > > Key: SPARK-34634 > URL: https://issues.apache.org/jira/browse/SPARK-34634 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1 >Reporter: EdisonWang >Assignee: Apache Spark >Priority: Minor > > To reproduce, > {code:java} > // code placeholder > create temporary view t as select * from values 0, 1, 2 as t(a); > WITH temp AS ( > SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t > ) > SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b > {code} > ``` > > ```, > > Spark will throw AnalysisException > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly
EdisonWang created SPARK-34634: -- Summary: Self-join with script transformation failed to resolve attribute correctly Key: SPARK-34634 URL: https://issues.apache.org/jira/browse/SPARK-34634 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34633) Self-join with script transformation failed to resolve attribute correctly
EdisonWang created SPARK-34633: -- Summary: Self-join with script transformation failed to resolve attribute correctly Key: SPARK-34633 URL: https://issues.apache.org/jira/browse/SPARK-34633 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33306) TimezoneID is needed when there cast from Date to String
[ https://issues.apache.org/jira/browse/SPARK-33306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-33306: --- Description: A simple way to reproduce this is ``` spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled scala> sql(""" select a.d1 from (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a join (select concat('2000-01-0', id) as d2 from range(1, 2)) b on a.d1 = b.d2 """).show ``` it will throw ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) ``` was: A simple way to reproduce this is ``` spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled >> sql(""" select a.d1 from (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a join (select concat('2000-01-0', id) as d2 from range(1, 2)) b on a.d1 = b.d2 """).show ``` it will throw ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) ``` > TimezoneID is needed when there cast from Date to String > > > Key: SPARK-33306 > URL: https://issues.apache.org/jira/browse/SPARK-33306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Major > > A simple way to reproduce this is > ``` > spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled > scala> sql(""" > select a.d1 from > (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a > join > (select concat('2000-01-0', id) as d2 from range(1, 2)) b > on a.d1 = b.d2 > """).show > ``` > > it will throw > ``` > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) > at > org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) > at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) > at > org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33306) TimezoneID is needed when there cast from Date to String
[ https://issues.apache.org/jira/browse/SPARK-33306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-33306: --- Description: A simple way to reproduce this is ``` spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled >> sql(""" select a.d1 from (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a join (select concat('2000-01-0', id) as d2 from range(1, 2)) b on a.d1 = b.d2 """).show ``` it will throw ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) ``` > TimezoneID is needed when there cast from Date to String > > > Key: SPARK-33306 > URL: https://issues.apache.org/jira/browse/SPARK-33306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Major > > A simple way to reproduce this is > ``` > spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled > >> sql(""" > select a.d1 from > (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a > join > (select concat('2000-01-0', id) as d2 from range(1, 2)) b > on a.d1 = b.d2 > """).show > ``` > > it will throw > ``` > java.util.NoSuchElementException: None.get > at scala.None$.get(Option.scala:529) > at scala.None$.get(Option.scala:527) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) > at > org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) > at > org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) > at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) > at > org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) > at > org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33306) TimezoneID is needed when there cast from Date to String
EdisonWang created SPARK-33306: -- Summary: TimezoneID is needed when there cast from Date to String Key: SPARK-33306 URL: https://issues.apache.org/jira/browse/SPARK-33306 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly
[ https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-32559: --- Description: The trim logic in Cast expression introduced in [https://github.com/apache/spark/pull/26622] will trim chinese characters unexpectly. For example, sql select cast("1中文" as float) gives 1 instead of null was: The trim logic in Cast expression introduced in [https://github.com/apache/spark/pull/26622] will trim chinese characters unexpectly. For example, !image-2020-08-06-17-01-48-646.png! > Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters > correctly > --- > > Key: SPARK-32559 > URL: https://issues.apache.org/jira/browse/SPARK-32559 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > > The trim logic in Cast expression introduced in > [https://github.com/apache/spark/pull/26622] will trim chinese characters > unexpectly. > For example, sql select cast("1中文" as float) gives 1 instead of null > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly
EdisonWang created SPARK-32559: -- Summary: Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly Key: SPARK-32559 URL: https://issues.apache.org/jira/browse/SPARK-32559 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang The trim logic in Cast expression introduced in [https://github.com/apache/spark/pull/26622] will trim chinese characters unexpectly. For example, !image-2020-08-06-17-01-48-646.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31952) The metric of MemoryBytesSpill is incorrect when doing Aggregate
[ https://issues.apache.org/jira/browse/SPARK-31952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-31952: --- Description: When doing Aggregate and spill occurs, the Spill(memory) metric is zero while Spill(disk) metric is right. As shown below !image-2020-06-10-16-35-58-002.png! > The metric of MemoryBytesSpill is incorrect when doing Aggregate > > > Key: SPARK-31952 > URL: https://issues.apache.org/jira/browse/SPARK-31952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > Attachments: image-2020-06-10-16-35-58-002.png > > > When doing Aggregate and spill occurs, the Spill(memory) metric is zero while > Spill(disk) metric is right. As shown below > !image-2020-06-10-16-35-58-002.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31952) The metric of MemoryBytesSpill is incorrect when doing Aggregate
[ https://issues.apache.org/jira/browse/SPARK-31952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-31952: --- Attachment: image-2020-06-10-16-35-58-002.png > The metric of MemoryBytesSpill is incorrect when doing Aggregate > > > Key: SPARK-31952 > URL: https://issues.apache.org/jira/browse/SPARK-31952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > Attachments: image-2020-06-10-16-35-58-002.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31952) Fix incorrect MemoryBytesSpill metric when doing Aggregate
EdisonWang created SPARK-31952: -- Summary: Fix incorrect MemoryBytesSpill metric when doing Aggregate Key: SPARK-31952 URL: https://issues.apache.org/jira/browse/SPARK-31952 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31952) The metric of MemoryBytesSpill is incorrect when doing Aggregate
[ https://issues.apache.org/jira/browse/SPARK-31952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-31952: --- Summary: The metric of MemoryBytesSpill is incorrect when doing Aggregate (was: Fix incorrect MemoryBytesSpill metric when doing Aggregate) > The metric of MemoryBytesSpill is incorrect when doing Aggregate > > > Key: SPARK-31952 > URL: https://issues.apache.org/jira/browse/SPARK-31952 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30806) Evaluate once per group in UnboundedWindowFunctionFrame
EdisonWang created SPARK-30806: -- Summary: Evaluate once per group in UnboundedWindowFunctionFrame Key: SPARK-30806 URL: https://issues.apache.org/jira/browse/SPARK-30806 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28332) SQLMetric wrong initValue
[ https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002148#comment-17002148 ] EdisonWang commented on SPARK-28332: I've taken it [~cloud_fan] > SQLMetric wrong initValue > -- > > Key: SPARK-28332 > URL: https://issues.apache.org/jira/browse/SPARK-28332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Song Jun >Priority: Minor > Fix For: 3.0.0 > > > Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set > to -1. > If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, > these tasks will send the metric(the metric value still be the initValue with > -1) to Driver, then Driver do metric merge for this Stage in > DAGScheduler.updateAccumulators, this will cause the merged metric value of > this Stage set to be a negative value. > This is incorrect, we should set the initValue to 0 . > Another same case in SQLMetrics.createTimingMetric. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30088) Adaptive execution should convert SortMergeJoin to BroadcastJoin when plan generates empty result
EdisonWang created SPARK-30088: -- Summary: Adaptive execution should convert SortMergeJoin to BroadcastJoin when plan generates empty result Key: SPARK-30088 URL: https://issues.apache.org/jira/browse/SPARK-30088 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang Adaptive execution try to convert SortMergeJoin to BroadcastJoin by checking the `dataSize` metrics of spark plan. However, if the spark plan generates empty result, the `dataSize` metrics is empty due to SQLMetrics's initial value is -1, which could lead to the following check return false ``` private def canBroadcast(plan: LogicalPlan): Boolean = { plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= conf.autoBroadcastJoinThreshold } ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long
[ https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-29918: --- Labels: correctness (was: ) > RecordBinaryComparator should check endianness when compared by long > > > Key: SPARK-29918 > URL: https://issues.apache.org/jira/browse/SPARK-29918 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > Labels: correctness > > If the architecture supports unaligned or the offset is 8 bytes aligned, > RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a > long. Otherwise, it will compare bytes by bytes. > However, on little-endian machine, the result of compared by a long value > and compared bytes by bytes maybe different. If the architectures in a yarn > cluster is different(Some is unaligned-access capable while others not), then > the sequence of two records after sorted is undetermined, which will result > in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long
EdisonWang created SPARK-29918: -- Summary: RecordBinaryComparator should check endianness when compared by long Key: SPARK-29918 URL: https://issues.apache.org/jira/browse/SPARK-29918 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang If the architecture supports unaligned or the offset is 8 bytes aligned, RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a long. Otherwise, it will compare bytes by bytes. However, on little-endian machine, the result of compared by a long value and compared bytes by bytes maybe different. If the architectures in a yarn cluster is different(Some is unaligned-access capable while others not), then the sequence of two records after sorted is undetermined, which will result in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27789) Use stopEarly in codegen of ColumnarBatchScan
[ https://issues.apache.org/jira/browse/SPARK-27789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang resolved SPARK-27789. Resolution: Not A Problem > Use stopEarly in codegen of ColumnarBatchScan > - > > Key: SPARK-27789 > URL: https://issues.apache.org/jira/browse/SPARK-27789 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > > Suppose that we have a hive table like this > ```sql("create table parquet_test (id int) using parquet")```, and our query > sql is `select id from parquet_test limit 10`. > With `spark.sql.hive.convertMetastoreParquet` set to false, the sql > execution will go into `InputAdapter`, in its codegen, it can use `stopEarly` > to accelerate local limit. > But if we set `spark.sql.hive.convertMetastoreParquet` to true, the sql > exectuion will go into `ColumnarBatchScan`, which didn't optimize local limit. > In this patch, We use `stopEarly` in `ColumnarBatchScan` as well -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29343) Eliminate sorts without limit in the subquery of Join/Aggregation
EdisonWang created SPARK-29343: -- Summary: Eliminate sorts without limit in the subquery of Join/Aggregation Key: SPARK-29343 URL: https://issues.apache.org/jira/browse/SPARK-29343 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang The {{Sort}} without {{Limit}} operator in {{Join/GroupBy}} subquery is useless. For example, {{select count(1) from (select a from test1 order by a)}} is equal to {{select count(1) from (select a from test1)}}. 'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to {{select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b}}. Remove useless {{Sort}} operator can import performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28123) String Functions: Add support btrim
[ https://issues.apache.org/jira/browse/SPARK-28123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899349#comment-16899349 ] EdisonWang commented on SPARK-28123: seems it is the same with trim() on both sides? > String Functions: Add support btrim > --- > > Key: SPARK-28123 > URL: https://issues.apache.org/jira/browse/SPARK-28123 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Function||Return Type||Description||Example||Result|| > |{{btrim(_{{string}}_}}{{bytea}}{{, > _{{bytes}}_}}{{bytea}}{{)}}|{{bytea}}|Remove the longest string containing > only bytes appearing in _{{bytes}}_from the start and end of > _{{string}}_|{{btrim('\000trim\001'::bytea, '\000\001'::bytea)}}|{{trim}}| > More details: https://www.postgresql.org/docs/11/functions-binarystring.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28257) Use ConfigEntry for hardcoded configs in SQL module
EdisonWang created SPARK-28257: -- Summary: Use ConfigEntry for hardcoded configs in SQL module Key: SPARK-28257 URL: https://issues.apache.org/jira/browse/SPARK-28257 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: EdisonWang Use ConfigEntry for hardcoded configs in SQL module -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27789) Use stopEarly in codegen of ColumnarBatchScan
EdisonWang created SPARK-27789: -- Summary: Use stopEarly in codegen of ColumnarBatchScan Key: SPARK-27789 URL: https://issues.apache.org/jira/browse/SPARK-27789 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.3, 2.4.2, 2.4.1, 2.4.0 Reporter: EdisonWang Suppose that we have a hive table like this ```sql("create table parquet_test (id int) using parquet")```, and our query sql is `select id from parquet_test limit 10`. With `spark.sql.hive.convertMetastoreParquet` set to false, the sql execution will go into `InputAdapter`, in its codegen, it can use `stopEarly` to accelerate local limit. But if we set `spark.sql.hive.convertMetastoreParquet` to true, the sql exectuion will go into `ColumnarBatchScan`, which didn't optimize local limit. In this patch, We use `stopEarly` in `ColumnarBatchScan` as well -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26500) Add conf to support ignore hdfs data locality
[ https://issues.apache.org/jira/browse/SPARK-26500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang closed SPARK-26500. -- > Add conf to support ignore hdfs data locality > - > > Key: SPARK-26500 > URL: https://issues.apache.org/jira/browse/SPARK-26500 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Trivial > > When reading a large hive table/directory with thousands of files, it will > cost up to several minutes or even hours to calculate data locality for each > split in driver, while executors are in idle status. > This situation is even worth when running in SparkThriftServer mode, because > handleJobSubmitted(it will call getPreferedLocation) is handled in a single > thread. One big sql will block all the following sqls. > At the same time, most companies's internal networks are all gigabit network > cards, so it is ok to read a data not locality. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27232) Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to
[ https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27232: --- Summary: Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to (was: Skip to get file block location if locality is ignored) > Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to > -- > > Key: SPARK-27232 > URL: https://issues.apache.org/jira/browse/SPARK-27232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27232) Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to
[ https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27232: --- Description: `InMemoryFileIndex` needs to request file block location information in order to do locality schedule in `TaskSetManager`. Usually this is a time-cost task. For example, In our production env, there are 24 partitions, with totally 149925 files and 83TB in size. It costs about 10 minutes to request file block locations before submit a spark job. Even though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 to make it parallelized, it also needs 2 minutes. Anyway, this is a waste if we don't care about the locality of files(for example, storage and computation are separate). So there should be a conf to control whether we need to send `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` to 0, file block location information is meaningless. Here in this PR, if `spark.locality.wait` is set to 0, it will not request file location information anymore, which will save several seconds to minutes. > Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to > -- > > Key: SPARK-27232 > URL: https://issues.apache.org/jira/browse/SPARK-27232 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Minor > > `InMemoryFileIndex` needs to request file block location information in order > to do locality schedule in `TaskSetManager`. > Usually this is a time-cost task. For example, In our production env, there > are 24 partitions, with totally 149925 files and 83TB in size. It costs about > 10 minutes to request file block locations before submit a spark job. Even > though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 > to make it parallelized, it also needs 2 minutes. > Anyway, this is a waste if we don't care about the locality of files(for > example, storage and computation are separate). > So there should be a conf to control whether we need to send > `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` > to 0, file block location information is meaningless. > Here in this PR, if `spark.locality.wait` is set to 0, it will not request > file location information anymore, which will save several seconds to minutes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27232) Skip to get file block location if locality is ignored
EdisonWang created SPARK-27232: -- Summary: Skip to get file block location if locality is ignored Key: SPARK-27232 URL: https://issues.apache.org/jira/browse/SPARK-27232 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: EdisonWang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27202) update comments to keep according with code
EdisonWang created SPARK-27202: -- Summary: update comments to keep according with code Key: SPARK-27202 URL: https://issues.apache.org/jira/browse/SPARK-27202 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: EdisonWang update comments in InMemoryFileIndex.scala to keep according with code -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27079) Fix typo & Remove useless imports
EdisonWang created SPARK-27079: -- Summary: Fix typo & Remove useless imports Key: SPARK-27079 URL: https://issues.apache.org/jira/browse/SPARK-27079 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: EdisonWang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27033: --- Affects Version/s: (was: 3.0.0) 2.4.0 > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Minor > > Currently, filters like this "select * from table where a + 1 >= 3" cannot be > pushed down, this optimizer can convert it to "select * from table where a >= > 3 - 1", and then be "select * from table where a >= 2". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27033: --- Description: Currently, filters like this "select * from table where a + 1 >= 3" cannot be pushed down, this optimizer can convert it to "select * from table where a >= 3 - 1", and then be "select * from table where a >= 2". (was: _emphasized text_) > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > > Currently, filters like this "select * from table where a + 1 >= 3" cannot be > pushed down, this optimizer can convert it to "select * from table where a >= > 3 - 1", and then be "select * from table where a >= 2". -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
[ https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-27033: --- Description: _emphasized text_ > Add rule to optimize binary comparisons to its push down format > --- > > Key: SPARK-27033 > URL: https://issues.apache.org/jira/browse/SPARK-27033 > Project: Spark > Issue Type: Improvement > Components: Optimizer >Affects Versions: 3.0.0 >Reporter: EdisonWang >Priority: Minor > > _emphasized text_ -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27033) Add rule to optimize binary comparisons to its push down format
EdisonWang created SPARK-27033: -- Summary: Add rule to optimize binary comparisons to its push down format Key: SPARK-27033 URL: https://issues.apache.org/jira/browse/SPARK-27033 Project: Spark Issue Type: Improvement Components: Optimizer Affects Versions: 3.0.0 Reporter: EdisonWang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26544) escape string when serialize map/array make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-26544: --- Summary: escape string when serialize map/array make it a valid json (keep alignment with hive) (was: the string serialized from map/array type is not a valid json (while hive is)) > escape string when serialize map/array make it a valid json (keep alignment > with hive) > -- > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)
[ https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang updated SPARK-26544: --- Summary: escape string when serialize map/array to make it a valid json (keep alignment with hive) (was: escape string when serialize map/array make it a valid json (keep alignment with hive)) > escape string when serialize map/array to make it a valid json (keep > alignment with hive) > - > > Key: SPARK-26544 > URL: https://issues.apache.org/jira/browse/SPARK-26544 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Major > > when reading a hive table with map/array type, the string serialized by > thrift server is not a valid json, while hive is. > For example, select a field whose type is map, the spark > thrift server returns > > {code:java} > {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} > {code} > > while hive thriftserver returns > > {code:java} > {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26544) the string serialized from map/array type is not a valid json (while hive is)
EdisonWang created SPARK-26544: -- Summary: the string serialized from map/array type is not a valid json (while hive is) Key: SPARK-26544 URL: https://issues.apache.org/jira/browse/SPARK-26544 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: EdisonWang when reading a hive table with map/array type, the string serialized by thrift server is not a valid json, while hive is. For example, select a field whose type is map, the spark thrift server returns {code:java} {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"} {code} while hive thriftserver returns {code:java} {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"} {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26500) Add conf to support ignore hdfs data locality
[ https://issues.apache.org/jira/browse/SPARK-26500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] EdisonWang resolved SPARK-26500. Resolution: Not A Problem > Add conf to support ignore hdfs data locality > - > > Key: SPARK-26500 > URL: https://issues.apache.org/jira/browse/SPARK-26500 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: EdisonWang >Priority: Trivial > > When reading a large hive table/directory with thousands of files, it will > cost up to several minutes or even hours to calculate data locality for each > split in driver, while executors are in idle status. > This situation is even worth when running in SparkThriftServer mode, because > handleJobSubmitted(it will call getPreferedLocation) is handled in a single > thread. One big sql will block all the following sqls. > At the same time, most companies's internal networks are all gigabit network > cards, so it is ok to read a data not locality. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26500) Add conf to support ignore hdfs data locality
EdisonWang created SPARK-26500: -- Summary: Add conf to support ignore hdfs data locality Key: SPARK-26500 URL: https://issues.apache.org/jira/browse/SPARK-26500 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.4.0 Reporter: EdisonWang When reading a large hive table/directory with thousands of files, it will cost up to several minutes or even hours to calculate data locality for each split in driver, while executors are in idle status. This situation is even worth when running in SparkThriftServer mode, because handleJobSubmitted(it will call getPreferedLocation) is handled in a single thread. One big sql will block all the following sqls. At the same time, most companies's internal networks are all gigabit network cards, so it is ok to read a data not locality. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org