[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
[ https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25728: - External issue ID: https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing > SPIP: Structured Intermediate Representation (Tungsten IR) for generating > Java code > --- > > Key: SPARK-25728 > URL: https://issues.apache.org/jira/browse/SPARK-25728 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 > Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code
Kazuaki Ishizaki created SPARK-25728: Summary: SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code Key: SPARK-25728 URL: https://issues.apache.org/jira/browse/SPARK-25728 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform
Kazuaki Ishizaki created ARROW-3476: --- Summary: [Java] mvn test in memory fails on a big-endian platform Key: ARROW-3476 URL: https://issues.apache.org/jira/browse/ARROW-3476 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Kazuaki Ishizaki On a big-endian platform, {{mvn test}} in memory causes a failure due to an assertion. In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during an allocation of a {{RootAllocator}} class. {code} $ uname -a Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 2016 ppc64 ppc64 ppc64 GNU/Linux $ arch ppc64 $ cd java/memory $ mvn test [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Arrow Memory 0.12.0-SNAPSHOT [INFO] [INFO] ... [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 s - in org.apache.arrow.memory.TestAccountant [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 s - in org.apache.arrow.memory.TestLowCostIdentityHashMap [INFO] Running org.apache.arrow.memory.TestBaseAllocator [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess) Time elapsed: 0.313 s <<< ERROR! java.lang.ExceptionInInitializerError at org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian systems. at org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator ... {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform
Kazuaki Ishizaki created ARROW-3476: --- Summary: [Java] mvn test in memory fails on a big-endian platform Key: ARROW-3476 URL: https://issues.apache.org/jira/browse/ARROW-3476 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Kazuaki Ishizaki On a big-endian platform, {{mvn test}} in memory causes a failure due to an assertion. In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during an allocation of a {{RootAllocator}} class. {code} $ uname -a Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 2016 ppc64 ppc64 ppc64 GNU/Linux $ arch ppc64 $ cd java/memory $ mvn test [INFO] Scanning for projects... [INFO] [INFO] [INFO] Building Arrow Memory 0.12.0-SNAPSHOT [INFO] [INFO] ... [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 s - in org.apache.arrow.memory.TestAccountant [INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 s - in org.apache.arrow.memory.TestLowCostIdentityHashMap [INFO] Running org.apache.arrow.memory.TestBaseAllocator [ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 s <<< FAILURE! - in org.apache.arrow.memory.TestEndianess [ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess) Time elapsed: 0.313 s <<< ERROR! java.lang.ExceptionInInitializerError at org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian systems. at org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31) [ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 0.055 s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator ... {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs
[ https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-25497. -- Resolution: Fixed Fix Version/s: 3.0.0 > limit operation within whole stage codegen should not consume all the inputs > > > Key: SPARK-25497 > URL: https://issues.apache.org/jira/browse/SPARK-25497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > This issue was discovered during https://github.com/apache/spark/pull/21738 . > It turns out that limit is not whole-stage-codegened correctly and always > consume all the inputs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs
[ https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reassigned SPARK-25497: Assignee: Wenchen Fan > limit operation within whole stage codegen should not consume all the inputs > > > Key: SPARK-25497 > URL: https://issues.apache.org/jira/browse/SPARK-25497 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > This issue was discovered during https://github.com/apache/spark/pull/21738 . > It turns out that limit is not whole-stage-codegened correctly and always > consume all the inputs -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344 ] Kazuaki Ishizaki edited comment on SPARK-25538 at 10/1/18 5:21 PM: --- This test case does not print {{63}} using master branch. {code} test("test2") { val df = spark.read.parquet("file:///SPARK-25538-repro") val c1 = df.distinct.count val c2 = df.sort("col_0").distinct.count val c3 = df.withColumnRenamed("col_0", "new").distinct.count val c0 = df.count print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n") } c1=64, c2=73, c3=64, c0=123 {code} was (Author: kiszk): This test case does not print {{63}}. {code} test("test2") { val df = spark.read.parquet("file:///SPARK-25538-repro") val c1 = df.distinct.count val c2 = df.sort("col_0").distinct.count val c3 = df.withColumnRenamed("col_0", "new").distinct.count val c0 = df.count print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n") } c1=64, c2=73, c3=64, c0=123 {code} > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Blocker > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344 ] Kazuaki Ishizaki commented on SPARK-25538: -- This test case does not print {{63}}. {code} test("test2") { val df = spark.read.parquet("file:///SPARK-25538-repro") val c1 = df.distinct.count val c2 = df.sort("col_0").distinct.count val c3 = df.withColumnRenamed("col_0", "new").distinct.count val c0 = df.count print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n") } c1=64, c2=73, c3=64, c0=123 {code} > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Blocker > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633568#comment-16633568 ] Kazuaki Ishizaki commented on SPARK-25538: -- Thank you. I will check it tonight in Japan. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > Attachments: SPARK-25538-repro.tgz > > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631605#comment-16631605 ] Kazuaki Ishizaki commented on SPARK-25538: -- Thank for upload a schema. While I looked at the schema, I am still not sure about the reason of this problem. I would appreciate it if you could find a good input data that can reproduce a problem. > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()
[ https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629281#comment-16629281 ] Kazuaki Ishizaki commented on SPARK-25538: -- Hi [~Steven Rand], would it be possible to share the schema of this DataFrame? > incorrect row counts after distinct() > - > > Key: SPARK-25538 > URL: https://issues.apache.org/jira/browse/SPARK-25538 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: Reproduced on a Centos7 VM and from source in Intellij > on OS X. >Reporter: Steven Rand >Priority: Major > Labels: correctness > > It appears that {{df.distinct.count}} can return incorrect values after > SPARK-23713. It's possible that other operations are affected as well; > {{distinct}} just happens to be the one that we noticed. I believe that this > issue was introduced by SPARK-23713 because I can't reproduce it until that > commit, and I've been able to reproduce it after that commit as well as with > {{tags/v2.4.0-rc1}}. > Below are example spark-shell sessions to illustrate the problem. > Unfortunately the data used in these examples can't be uploaded to this Jira > ticket. I'll try to create test data which also reproduces the issue, and > will upload that if I'm able to do so. > Example from Spark 2.3.1, which behaves correctly: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 115 > {code} > Example from Spark 2.4.0-rc1, which returns different output: > {code} > scala> val df = spark.read.parquet("hdfs:///data") > df: org.apache.spark.sql.DataFrame = [] > scala> df.count > res0: Long = 123 > scala> df.distinct.count > res1: Long = 116 > scala> df.sort("col_0").distinct.count > res2: Long = 123 > scala> df.withColumnRenamed("col_0", "newName").distinct.count > res3: Long = 115 > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25487) Refactor PrimitiveArrayBenchmark
[ https://issues.apache.org/jira/browse/SPARK-25487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-25487. -- Resolution: Fixed Assignee: Chenxiao Mao Fix Version/s: 2.5.0 Issue resolved by pull request 22497 https://github.com/apache/spark/pull/22497 > Refactor PrimitiveArrayBenchmark > > > Key: SPARK-25487 > URL: https://issues.apache.org/jira/browse/SPARK-25487 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Chenxiao Mao >Assignee: Chenxiao Mao >Priority: Major > Fix For: 2.5.0 > > > Refactor PrimitiveArrayBenchmark to use main method and print the output as a > separate file. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code
[ https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621416#comment-16621416 ] Kazuaki Ishizaki commented on SPARK-25432: -- nit: description seems to be in {{environment}} now. > Consider if using standard getOrCreate from PySpark into JVM SparkSession > would simplify code > - > > Key: SPARK-25432 > URL: https://issues.apache.org/jira/browse/SPARK-25432 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.0.0 > Environment: As we saw in > [https://github.com/apache/spark/pull/22295/files] the logic can get a bit > out of sync. It _might_ make sense to try and simplify this so there's less > duplicated logic in Python & Scala around session set up. >Reporter: holdenk >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance
[ https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25437: - Comment: was deleted (was: Is such a feature for major release, not for maintenance release?) > Using OpenHashMap replace HashMap improve Encoder Performance > - > > Key: SPARK-25437 > URL: https://issues.apache.org/jira/browse/SPARK-25437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance
[ https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617102#comment-16617102 ] Kazuaki Ishizaki commented on SPARK-25437: -- Is such a feature for major release, not for maintenance release? > Using OpenHashMap replace HashMap improve Encoder Performance > - > > Key: SPARK-25437 > URL: https://issues.apache.org/jira/browse/SPARK-25437 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: wangjiaochun >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method
Kazuaki Ishizaki created SPARK-25444: Summary: Refactor GenArrayData.genCodeToCreateArrayData() method Key: SPARK-25444 URL: https://issues.apache.org/jira/browse/SPARK-25444 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.5.0 Reporter: Kazuaki Ishizaki {{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a temporary Java array to create {{ArrayData}}. It can be eliminated by using {{ArrayData.createArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611717#comment-16611717 ] Kazuaki Ishizaki commented on SPARK-20184: -- In {{branch-2.4}}, we still see the performance degradation compared to w/o codegen {code:java} OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz SPARK-20184: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative codegen = T 2915 / 3204 0.0 2915001883.0 1.0X codegen = F 1178 / 1368 0.0 1178020462.0 2.5X {code} > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang >Priority: Major > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches
[ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611633#comment-16611633 ] Kazuaki Ishizaki commented on SPARK-16196: -- [~cloud_fan] This PR in the Jira entry proposes two fixes # Read data in a table cache directry from a columnar storage # Generate code to build a table cache We already implemented 1. But, we have not implmented 2. yet. Let us address 2. in the next release. > Optimize in-memory scan performance using ColumnarBatches > - > > Key: SPARK-16196 > URL: https://issues.apache.org/jira/browse/SPARK-16196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Major > > A simple benchmark such as the following reveals inefficiencies in the > existing in-memory scan implementation: > {code} > spark.range(N) > .selectExpr("id", "floor(rand() * 1) as k") > .createOrReplaceTempView("test") > val ds = spark.sql("select count(k), count(id) from test").cache() > ds.collect() > ds.collect() > {code} > There are many reasons why caching is slow. The biggest is that compression > takes a long time. The second is that there are a lot of virtual function > calls in this hot code path since the rows are processed using iterators. > Further, the rows are converted to and from ByteBuffers, which are slow to > read in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen
[ https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611502#comment-16611502 ] Kazuaki Ishizaki commented on SPARK-20184: -- Although I created another JIRA https://issues.apache.org/jira/browse/SPARK-20479, there is no PR. Let me check the performance in 2.4 branch. > performance regression for complex/long sql when enable whole stage codegen > --- > > Key: SPARK-20184 > URL: https://issues.apache.org/jira/browse/SPARK-20184 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0, 2.1.0 >Reporter: Fei Wang >Priority: Major > > The performance of following SQL get much worse in spark 2.x in contrast > with codegen off. > SELECT >sum(COUNTER_57) > ,sum(COUNTER_71) > ,sum(COUNTER_3) > ,sum(COUNTER_70) > ,sum(COUNTER_66) > ,sum(COUNTER_75) > ,sum(COUNTER_69) > ,sum(COUNTER_55) > ,sum(COUNTER_63) > ,sum(COUNTER_68) > ,sum(COUNTER_56) > ,sum(COUNTER_37) > ,sum(COUNTER_51) > ,sum(COUNTER_42) > ,sum(COUNTER_43) > ,sum(COUNTER_1) > ,sum(COUNTER_76) > ,sum(COUNTER_54) > ,sum(COUNTER_44) > ,sum(COUNTER_46) > ,DIM_1 > ,DIM_2 > ,DIM_3 > FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100; > Num of rows of aggtable is about 3500. > whole stage codegen on(spark.sql.codegen.wholeStage = true):40s > whole stage codegen off(spark.sql.codegen.wholeStage = false):6s > After some analysis i think this is related to the huge java method(a java > method of thousand lines) which generated by codegen. > And If i config -XX:-DontCompileHugeMethods the performance get much > better(about 7s). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches
[ https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611494#comment-16611494 ] Kazuaki Ishizaki commented on SPARK-16196: -- I see. I will check this. > Optimize in-memory scan performance using ColumnarBatches > - > > Key: SPARK-16196 > URL: https://issues.apache.org/jira/browse/SPARK-16196 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Major > > A simple benchmark such as the following reveals inefficiencies in the > existing in-memory scan implementation: > {code} > spark.range(N) > .selectExpr("id", "floor(rand() * 1) as k") > .createOrReplaceTempView("test") > val ds = spark.sql("select count(k), count(id) from test").cache() > ds.collect() > ds.collect() > {code} > There are many reasons why caching is slow. The biggest is that compression > takes a long time. The second is that there are a lot of virtual function > calls in this hot code path since the rows are processed using iterators. > Further, the rows are converted to and from ByteBuffers, which are slow to > read in general. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result
Kazuaki Ishizaki created SPARK-25388: Summary: checkEvaluation may miss incorrect nullable of DataType in the result Key: SPARK-25388 URL: https://issues.apache.org/jira/browse/SPARK-25388 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.0.0 Reporter: Kazuaki Ishizaki Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in {{checkEvaluationWithUnsafeProjection}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604120#comment-16604120 ] Kazuaki Ishizaki commented on SPARK-25317: -- When I have been investigating this issue, I realized that # of Javabyte code size in a method can change performance. I guess that this issue is related to method inlining. However, I have not found the root cause yet. [~mgaido] Would it be possible to submit a PR to fix this issue if possible? > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method
Kazuaki Ishizaki created SPARK-25338: Summary: Several tests miss calling super.afterAll() in their afterAll() method Key: SPARK-25338 URL: https://issues.apache.org/jira/browse/SPARK-25338 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki The following tests under {{external}} may not call {{super.afterAll()}} in their {{afterAll()}} method. {code} external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Spark JIRA tags clarification and management
Of course, we would like to eliminate all of the following tags "flanky" or "flankytest" Kazuaki Ishizaki From: Hyukjin Kwon To: dev Cc: Xiao Li , Wenchen Fan Date: 2018/09/04 14:20 Subject:Re: Spark JIRA tags clarification and management Thanks, Reynold. +Adding Xiao and Wenchen who I saw often used tags. Would you have some tags you think we should document more? 2018년 9월 4일 (화) 오전 9:27, Reynold Xin 님이 작성: The most common ones we do are: releasenotes correctness On Mon, Sep 3, 2018 at 6:23 PM Hyukjin Kwon wrote: Thanks, Felix and Reynold. Would you guys mind if I ask this to anyone who use the tags frequently? Frankly, I don't use the tags often .. 2018년 9월 4일 (화) 오전 2:04, Felix Cheung 님 이 작성: +1 good idea. There are a few for organizing but some also are critical to the release process, like rel note. Would be good to clarify. From: Reynold Xin Sent: Sunday, September 2, 2018 11:50 PM To: Hyukjin Kwon Cc: dev Subject: Re: Spark JIRA tags clarification and management It would be great to document the common ones. On Sun, Sep 2, 2018 at 11:49 PM Hyukjin Kwon wrote: Hi all, I lately noticed tags are often used to classify JIRAs. I was thinking we better explicitly document what tags are used and explain which tag means what. For instance, we documented "Contributing to JIRA Maintenance" at https://spark.apache.org/contributing.html before (thanks, Sean Owen) - this helps me a lot to managing JIRAs, and they are good standards for, at least, me to take an action. It doesn't necessarily mean we should clarify everything but it might be good to document tags used often. We can leave this for committer's scope as well, if that's preferred - I don't have a strong opinion on this. My point is, can we clarify this in the contributing guide so that we can reduce the maintenance cost?
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602567#comment-16602567 ] Kazuaki Ishizaki commented on SPARK-25317: -- I confirmed this performance difference even after adding warmup. Let me investigate furthermore. > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602506#comment-16602506 ] Kazuaki Ishizaki commented on SPARK-25317: -- Let me run this on 2.3 and master. One question. This benchmark does not have an warm up loop. In other words, this benchmark may include execution time on an interpreter, too. Is this behavior intentional? > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25317: - Description: eThere is a performance regression when calculating hash code for UTF8String: {code:java} test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 10 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") } {code} To run this test in 2.3, we need to add {code:java} public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); } {code} to `Murmur3_x86_32` In my laptop, the result for master vs 2.3 is: 120 us vs 40 us was: There is a performance regression when calculating hash code for UTF8String: {code} test("hashing") { import org.apache.spark.unsafe.hash.Murmur3_x86_32 import org.apache.spark.unsafe.types.UTF8String val hasher = new Murmur3_x86_32(0) val str = UTF8String.fromString("b" * 10001) val numIter = 10 val start = System.nanoTime for (i <- 0 until numIter) { Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) Murmur3_x86_32.hashUTF8String(str, 0) } val duration = (System.nanoTime() - start) / 1000 / numIter println(s"duration $duration us") } {code} To run this test in 2.3, we need to add {code} public static int hashUTF8String(UTF8String str, int seed) { return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), str.numBytes(), seed); } {code} to `Murmur3_x86_32` In my laptop, the result for master vs 2.3 is: 120 us vs 40 us > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performan
[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25310: - Description: Invoking {{ArraysOverlap}} function with non-nullable array type throws the following error in the code generation phase. {code:java} Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an rvalue at com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) at com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) at com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) at com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) at com.google.common.cache.LocalCache.get(LocalCache.java:4000) at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260) {code} > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > Invoking {{ArraysOverlap}} function with non-nullable array type throws the > following error in the code generation phase. > {code:java} > Code generation of arrays_overlap([1,2,3], [4,5,3]) failed: > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 56, Column 11: Expression "isNull_0" is not an rvalue > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > at > com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293) > at > com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135) > at > com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410) > at > com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380) > at > com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342) > at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257) > at com.google.common.cache.LocalCache.get(LocalCache.java:4000) > at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004) > at > com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874) > at > org.apache.spar
[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException
[ https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25310: - Summary: ArraysOverlap may throw a CompileException (was: ArraysOverlap throws an Exception) > ArraysOverlap may throw a CompileException > -- > > Key: SPARK-25310 > URL: https://issues.apache.org/jira/browse/SPARK-25310 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25310) ArraysOverlap throws an Exception
Kazuaki Ishizaki created SPARK-25310: Summary: ArraysOverlap throws an Exception Key: SPARK-25310 URL: https://issues.apache.org/jira/browse/SPARK-25310 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25178) Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25178: - Summary: Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator (was: Use dummy name for xxxHashMapGenerator key/value schema field) > Directly ship the StructType objects of the keySchema / valueSchema for > xxxHashMapGenerator > --- > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field
[ https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587897#comment-16587897 ] Kazuaki Ishizaki commented on SPARK-25178: -- [~rednaxelafx] Thank you for opening a JIRA entry :) [~smilegator] I can take this. > Use dummy name for xxxHashMapGenerator key/value schema field > - > > Key: SPARK-25178 > URL: https://issues.apache.org/jira/browse/SPARK-25178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > Following SPARK-18952 and SPARK-22273, this ticket proposes to change the > generated field name of the keySchema / valueSchema to a dummy name instead > of using {{key.name}}. > In previous discussion from SPARK-18952's PR [1], it was already suggested > that the field names were being used, so it's not worth capturing the strings > as reference objects here. Josh suggested merging the original fix as-is due > to backportability / pickability concerns. Now that we're coming up to a new > release, this can be revisited. > [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
[ https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25036: - Description: When compiling with sbt, the following errors occur: There are -two- three types: 1. {{ExprValue.isNull}} is compared with unexpected type. 2. {{match may not be exhaustive}} is detected at {{match}} 3. discarding unmoored doc comment The first one is more serious since it may also generate incorrect code in Spark 2.3. {code:java} [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): Boolean = (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: match may not be exhaustive. [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, ArrayData()), (_, _) [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: match may not be exhaustive. [error] It would fail on the following inputs: NewFunctionSpec(_, None, Some(_)), NewFunctionSpec(_, Some(_), None) [error] [warn] newFunction match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely always compare unequal [error] [warn] if (eval.isNull != "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: match may not be exhaustive. [error] It would fail on the following input: Schema((x: org.apache.spark.sql.types.DataType forSome x not in org.apache.spark.sql.types.StructType), _) [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (inputs.map(_.isNull).forall(_ == "false")) { [error] [warn] {code} {code:java} [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410: discarding unmoored doc comment [error] [warn] /** [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441: discarding unmoored doc comment [error] [warn] /** [error] [warn] ... [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440: discarding unmoored doc comment [error] [warn] /** [error] [warn] {code} was: When compiling with sbt, the following errors occur: There are two types: 1. {{ExprValue.isNull}} is compared with unexpected type. 1. {{match may not be exhaustive}} is detected at {{match}} The first one is more serious since it may also generate incorrect code in Spark 2.3. {code}
[jira] [Comment Edited] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
[ https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575137#comment-16575137 ] Kazuaki Ishizaki edited comment on SPARK-25036 at 8/9/18 5:05 PM: -- Another type of compilation error is found. Added the log to the description was (Author: kiszk): Another type of compilation error is found > Scala 2.12 issues: Compilation error with sbt > - > > Key: SPARK-25036 > URL: https://issues.apache.org/jira/browse/SPARK-25036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Kazuaki Ishizaki > Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > When compiling with sbt, the following errors occur: > There are two types: > 1. {{ExprValue.isNull}} is compared with unexpected type. > 1. {{match may not be exhaustive}} is detected at {{match}} > The first one is more serious since it may also generate incorrect code in > Spark 2.3. > {code} > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): > Boolean = (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: > match may not be exhaustive. > [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, > ArrayData()), (_, _) > [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: > match may not be exhaustive. > [error] It would fail on the following inputs: NewFunctionSpec(_, None, > Some(_)), NewFunctionSpec(_, Some(_), None) > [error] [warn] newFunction match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely always compare unequal > [error] [warn] if (eval.isNull != "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: > match may not be exhaustive. > [error] It would fail on the following input: Schema((x: > org.apache.spark.sql.types.DataType forSome x not in > org.apache.spark.sql.types.StructType), _) > [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn]
[jira] [Reopened] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
[ https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reopened SPARK-25036: -- Another type of compilation error is found > Scala 2.12 issues: Compilation error with sbt > - > > Key: SPARK-25036 > URL: https://issues.apache.org/jira/browse/SPARK-25036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: Kazuaki Ishizaki > Assignee: Kazuaki Ishizaki >Priority: Major > Fix For: 2.4.0 > > > When compiling with sbt, the following errors occur: > There are two types: > 1. {{ExprValue.isNull}} is compared with unexpected type. > 1. {{match may not be exhaustive}} is detected at {{match}} > The first one is more serious since it may also generate incorrect code in > Spark 2.3. > {code} > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): > Boolean = (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: > match may not be exhaustive. > [error] It would fail on the following inputs: (NumericValueInterval(_, _), > _), (_, NumericValueInterval(_, _)), (_, _) > [error] [warn] (r1, r2) match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: > match may not be exhaustive. > [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, > ArrayData()), (_, _) > [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: > match may not be exhaustive. > [error] It would fail on the following inputs: NewFunctionSpec(_, None, > Some(_)), NewFunctionSpec(_, Some(_), None) > [error] [warn] newFunction match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely always compare unequal > [error] [warn] if (eval.isNull != "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (eval.isNull == "true") { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: > match may not be exhaustive. > [error] It would fail on the following input: Schema((x: > org.apache.spark.sql.types.DataType forSome x not in > org.apache.spark.sql.types.StructType), _) > [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] > match { > [error] [warn] > [error] [warn] > /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: > org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are > unrelated: they will most likely never compare equal > [error] [warn] if (inputs.map(_.isNull).forall(_ == "false")) { > [error] [warn] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25059) Exception while executing an action on DataFrame that read Json
[ https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575129#comment-16575129 ] Kazuaki Ishizaki commented on SPARK-25059: -- Thank you for reporting the issue. Could you please try this using Spark 2.3? This is because the community extensively investigated and fixed these issues in Spark 2.3 > Exception while executing an action on DataFrame that read Json > --- > > Key: SPARK-25059 > URL: https://issues.apache.org/jira/browse/SPARK-25059 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.2.0 > Environment: AWS EMR 5.8.0 > Spark 2.2.0 > >Reporter: Kunal Goswami >Priority: Major > Labels: Spark-SQL > > When I try to read ~9600 Json files using > {noformat} > val test = spark.read.option("header", true).option("inferSchema", > true).json(paths: _*) {noformat} > > Any action on the above created data frame results in: > {noformat} > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V" > of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850] > pecificUnsafeProjection" grows beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949) > at org.codehaus.janino.CodeContext.write(CodeContext.java:839) > at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546) > at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762) > at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180) > at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151) > at > org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) > at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377) > at > org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436) > at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370) > at org.codehaus.janino.Java$Block.accept(Java.java:2471) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220) > at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378) > at > org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370) > at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436) > at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376) > at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370) > at org.codehaus.janino.Java$Block.accept(Java.java:2471) > at org.codehaus.janino.
[jira] [Updated] (SPARK-25041) genjavadoc-plugin_0.10 is not found with sbt in scala-2.12
[ https://issues.apache.org/jira/browse/SPARK-25041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-25041: - Summary: genjavadoc-plugin_0.10 is not found with sbt in scala-2.12 (was: genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12) > genjavadoc-plugin_0.10 is not found with sbt in scala-2.12 > -- > > Key: SPARK-25041 > URL: https://issues.apache.org/jira/browse/SPARK-25041 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > When the master is build with sbt in scala-2.12, the following error occurs: > {code} > [warn]module not found: > com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10 > [warn] public: tried > [warn] > https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom > [warn] Maven2 Local: tried > [warn] > file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom > [warn] local: tried > [warn] > /gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml > [info] Resolving jline#jline;2.14.3 ... > [warn]:: > [warn]:: UNRESOLVED DEPENDENCIES :: > [warn]:: > [warn]:: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not > found > [warn]:: > [warn] > [warn]Note: Unresolved dependencies path: > [warn]com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 > (/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118) > [warn] +- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT > sbt.ResolveException: unresolved dependency: > com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found > at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320) > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191) > at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168) > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) > at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) > at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133) > at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57) > at sbt.IvySbt$$anon$4.call(Ivy.scala:65) > at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) > at > xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78) > at > xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97) > at xsbt.boot.Using$.withResource(Using.scala:10) > at xsbt.boot.Using$.apply(Using.scala:9) > at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) > at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) > at xsbt.boot.Locks$.apply0(Locks.scala:31) > at xsbt.boot.Locks$.apply(Locks.scala:28) > at sbt.IvySbt.withDefaultLogger(Ivy.scala:65) > at sbt.IvySbt.withIvy(Ivy.scala:128) > at sbt.IvySbt.withIvy(Ivy.scala:125) > at sbt.IvySbt$Module.withModule(Ivy.scala:156) > at sbt.IvyActions$.updateEither(IvyActions.scala:168) > at > sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555) > at > sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551) > at > sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586) > at > sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584) > at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37) > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589) > at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583) > at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60) > at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606) > at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533) > at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485) > at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) > at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) > at sbt.std.Transform$$anon$4.work(System.scala:63) > at > sbt.Execute$$anonfun
[jira] [Created] (SPARK-25041) genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12
Kazuaki Ishizaki created SPARK-25041: Summary: genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12 Key: SPARK-25041 URL: https://issues.apache.org/jira/browse/SPARK-25041 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki When the master is build with sbt in scala-2.12, the following error occurs: {code} [warn] module not found: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10 [warn] public: tried [warn] https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom [warn] Maven2 Local: tried [warn] file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom [warn] local: tried [warn] /gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml [info] Resolving jline#jline;2.14.3 ... [warn] :: [warn] :: UNRESOLVED DEPENDENCIES :: [warn] :: [warn] :: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found [warn] :: [warn] [warn] Note: Unresolved dependencies path: [warn] com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 (/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118) [warn]+- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT sbt.ResolveException: unresolved dependency: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191) at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156) at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133) at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57) at sbt.IvySbt$$anon$4.call(Ivy.scala:65) at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93) at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78) at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97) at xsbt.boot.Using$.withResource(Using.scala:10) at xsbt.boot.Using$.apply(Using.scala:9) at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58) at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48) at xsbt.boot.Locks$.apply0(Locks.scala:31) at xsbt.boot.Locks$.apply(Locks.scala:28) at sbt.IvySbt.withDefaultLogger(Ivy.scala:65) at sbt.IvySbt.withIvy(Ivy.scala:128) at sbt.IvySbt.withIvy(Ivy.scala:125) at sbt.IvySbt$Module.withModule(Ivy.scala:156) at sbt.IvyActions$.updateEither(IvyActions.scala:168) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555) at sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586) at sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584) at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589) at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583) at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60) at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533) at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485) at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47) at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40) at sbt.std.Transform$$anon$4.work(System.scala:63) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) at sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228) at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17) at sbt.Execute.work(Execute.scala:237) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228) at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228) at sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159) at sbt.CompletionService$$anon$2.call(CompletionService.scala:28) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511
[jira] [Created] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt
Kazuaki Ishizaki created SPARK-25036: Summary: Scala 2.12 issues: Compilation error with sbt Key: SPARK-25036 URL: https://issues.apache.org/jira/browse/SPARK-25036 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.4.0 Reporter: Kazuaki Ishizaki When compiling with sbt, the following errors occur: There are two types: 1. {{ExprValue.isNull}} is compared with unexpected type. 1. {{match may not be exhaustive}} is detected at {{match}} The first one is more serious since it may also generate incorrect code in Spark 2.3. {code} [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] def isIntersected(r1: ValueInterval, r2: ValueInterval): Boolean = (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79: match may not be exhaustive. [error] It would fail on the following inputs: (NumericValueInterval(_, _), _), (_, NumericValueInterval(_, _)), (_, _) [error] [warn] (r1, r2) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67: match may not be exhaustive. [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, ArrayData()), (_, _) [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470: match may not be exhaustive. [error] It would fail on the following inputs: NewFunctionSpec(_, None, Some(_)), NewFunctionSpec(_, Some(_), None) [error] [warn] newFunction match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely always compare unequal [error] [warn] if (eval.isNull != "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (eval.isNull == "true") { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709: match may not be exhaustive. [error] It would fail on the following input: Schema((x: org.apache.spark.sql.types.DataType forSome x not in org.apache.spark.sql.types.StructType), _) [error] [warn] def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] match { [error] [warn] [error] [warn] /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90: org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are unrelated: they will most likely never compare equal [error] [warn] if (inputs.map(_.isNull).forall(_ == "false")) { [error] [warn] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570440#comment-16570440 ] Kazuaki Ishizaki commented on SPARK-25029: -- [~srowen][~skonto] Thank you for your investigations while I am creating scala-2.12 environment ( I still get compilation errors with scala-2.12 using sbt.) I got the situation... It is related to {{default}} method. We may have to update a method lookup algorithm to consider {{default}} in janino. > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070) > at org.codehaus
[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors
[ https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569621#comment-16569621 ] Kazuaki Ishizaki commented on SPARK-25029: -- [~srowen] I see. The following parts have the method. I will try to see it. My first feeling is that the problem may be in the scala collection library or catalyst Java code generator. {code} ... /* 146 */ final int length_1 = MapObjects_loopValue140.size(); ... /* 315 */ final int length_0 = MapObjects_loopValue140.size(); ... {code} > Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods > ..." errors > --- > > Key: SPARK-25029 > URL: https://issues.apache.org/jira/browse/SPARK-25029 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > We actually still have some test failures in the Scala 2.12 build. There seem > to be two types. First are that some tests fail with "TaskNotSerializable" > because some code construct now captures a reference to scalatest's > AssertionHelper. Example: > {code:java} > - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode > *** FAILED *** java.io.NotSerializableException: > org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not > serializable (class: org.scalatest.Assertions$AssertionsHelper, value: > org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code} > These seem generally easy to fix by tweaking the test code. It's not clear if > something about closure cleaning in 2.12 could be improved to detect this > situation automatically; given that yet only a handful of tests fail for this > reason, it's unlikely to be a systemic problem. > > The other error is curioser. Janino fails to compile generate code in many > cases with errors like: > {code:java} > - encode/decode for seq of string: List(abc, xyz) *** FAILED *** > java.lang.RuntimeException: Error while encoding: > org.codehaus.janino.InternalCompilerException: failed to compile: > org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": > Two non-abstract methods "public int scala.collection.TraversableOnce.size()" > have the same parameter types, declaring type and return type{code} > > I include the full generated code that failed in one case below. There is no > {{size()}} in the generated code. It's got to be down to some difference in > Scala 2.12, potentially even a Janino problem. > > {code:java} > Caused by: org.codehaus.janino.InternalCompilerException: Compiling > "GeneratedClass": Two non-abstract methods "public int > scala.collection.TraversableOnce.size()" have the same parameter types, > declaring type and return type > at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) > at > org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) > at > org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) > at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) > at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) > at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342) > ... 30 more > Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract > methods "public int scala.collection.TraversableOnce.size()" have the same > parameter types, declaring type and return type > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112) > at > org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770) > at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672) > at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737) > at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097) > at > org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070) > at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902) > at org.codehaus.janino.UnitCompiler.compi
[jira] [Created] (SPARK-24962) refactor CodeGenerator.createUnsafeArray
Kazuaki Ishizaki created SPARK-24962: Summary: refactor CodeGenerator.createUnsafeArray Key: SPARK-24962 URL: https://issues.apache.org/jira/browse/SPARK-24962 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki {{CodeGenerator.createUnsafeArray()}} generates code for allocating {{UnsafeArrayData}}. This method can support to generate code for allocating {{UnsafeArrayData}} or {{GenericArrayData}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560600#comment-16560600 ] Kazuaki Ishizaki commented on SPARK-24895: -- [~ericfchang] Thank you very much for your suggestion. As the first step, I created [a PR|https://github.com/apache/spark/pull/21905] to upgrade maven. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4
Kazuaki Ishizaki created SPARK-24956: Summary: Upgrade maven from 3.3.9 to 3.5.4 Key: SPARK-24956 URL: https://issues.apache.org/jira/browse/SPARK-24956 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest. As suggest in SPARK-24895, the current maven will see a problem with some plugins. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559987#comment-16559987 ] Kazuaki Ishizaki commented on SPARK-24895: -- I see. Thank you very much. At first, I will try to make a PR to upgrade a maven. BTW, I have no idea to make sure maven central repo works well for now. > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
[ https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559974#comment-16559974 ] Kazuaki Ishizaki commented on SPARK-24895: -- [~yhuai] Thank you. BTW, how can I re-enable spotbugs without this problem? Do you have any suggestion? cc: [~hyukjin.kwon] > Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames > -- > > Key: SPARK-24895 > URL: https://issues.apache.org/jira/browse/SPARK-24895 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Eric Chang >Assignee: Eric Chang >Priority: Major > Fix For: 2.4.0 > > > Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven > repo has mismatched filenames: > {noformat} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce > (enforce-banned-dependencies) on project spark_2.4: Execution > enforce-banned-dependencies of goal > org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: > org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: > Could not resolve following dependencies: > [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not > resolve dependencies for project com.databricks:spark_2.4:pom:1: The > following artifacts could not be resolved: > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, > org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find > artifact > org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in > apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1] > {noformat} > > If you check the artifact metadata you will see the pom and jar files are > 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177: > {code:xml} > > org.apache.spark > spark-mllib-local_2.11 > 2.4.0-SNAPSHOT > > > 20180723.232411 > 177 > > 20180723232411 > > > jar > 2.4.0-20180723.232411-177 > 20180723232411 > > > pom > 2.4.0-20180723.232411-177 > 20180723232411 > > > tests > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > test-sources > jar > 2.4.0-20180723.232410-177 > 20180723232411 > > > > > {code} > > This behavior is very similar to this issue: > https://issues.apache.org/jira/browse/MDEPLOY-221 > Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy > 2.8.2 plugin, it is highly possible that we introduced a new plugin that > causes this. > The most recent addition is the spot-bugs plugin, which is known to have > incompatibilities with other plugins: > [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21] > We may want to try building without it to sanity check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time
[ https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559972#comment-16559972 ] Kazuaki Ishizaki commented on SPARK-24925: -- Do we need a test case or which test case covers this PR? > input bytesRead metrics fluctuate from time to time > --- > > Key: SPARK-24925 > URL: https://issues.apache.org/jira/browse/SPARK-24925 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: yucai >Priority: Major > Attachments: bytesRead.gif > > > input bytesRead metrics fluctuate from time to time, it is worse when > pushdown enabled. > Query > {code:java} > CREATE TABLE dev AS > SELECT > ... > FROM lstg_item cold, lstg_item_vrtn v > WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE) > ... > {code} > Issue > See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, > 54GB, 53GB ... -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24841) Memory leak in converting spark dataframe to pandas dataframe
[ https://issues.apache.org/jira/browse/SPARK-24841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552231#comment-16552231 ] Kazuaki Ishizaki commented on SPARK-24841: -- Thank you for reporting an issue with heap profiling. Would it be possible to post a standalone program that can reproduce this problem? > Memory leak in converting spark dataframe to pandas dataframe > - > > Key: SPARK-24841 > URL: https://issues.apache.org/jira/browse/SPARK-24841 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 > Environment: Running PySpark in standalone mode >Reporter: Piyush Seth >Priority: Minor > > I am running a continuous running application using PySpark. In one of the > operations I have to convert PySpark data frame to Pandas data frame using > toPandas API on pyspark driver. After running for a while I am getting > "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. > I tried running this in a loop and could see that the heap memory is > increasing continuously. When I ran jmap for the first time I had the > following top rows: > num #instances #bytes class name > -- > 1: 1757 411477568 [J > {color:#FF} *2: 124188 266323152 [C*{color} > 3: 167219 46821320 org.apache.spark.status.TaskDataWrapper > 4: 69683 27159536 [B > 5: 359278 8622672 java.lang.Long > 6: 221808 7097856 > java.util.concurrent.ConcurrentHashMap$Node > 7: 283771 6810504 scala.collection.immutable.$colon$colon > After running several iterations I had the following > num #instances #bytes class name > -- > {color:#FF} *1: 110760 3439887928 [C*{color} > 2: 698 411429088 [J > 3: 238096 6880 org.apache.spark.status.TaskDataWrapper > 4: 68819 24050520 [B > 5: 498308 11959392 java.lang.Long > 6: 292741 9367712 > java.util.concurrent.ConcurrentHashMap$Node > 7: 282878 6789072 scala.collection.immutable.$colon$colon -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24754) Minhash integer overflow
[ https://issues.apache.org/jira/browse/SPARK-24754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535657#comment-16535657 ] Kazuaki Ishizaki commented on SPARK-24754: -- In test cases, we would appreciate it if you will compare values with them by other implementations. > Minhash integer overflow > > > Key: SPARK-24754 > URL: https://issues.apache.org/jira/browse/SPARK-24754 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Jiayuan Ma >Priority: Minor > > Hash computation in MinHashLSHModel has integer overflow bug. > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [SPARK ML] Minhash integer overflow
Of course, the hash value can just be negative. I thought that it would be after computation without overflow. When I checked another implementation, it performs computations with int. https://github.com/ALShum/MinHashLSH/blob/master/LSH.java#L89 By copy to Xjiayuan, did you compare the hash value generated by Spark with it generated by other implementations? Regards, Kazuaki Ishizaki From: Sean Owen To: jiayuanm Cc: dev@spark.apache.org Date: 2018/07/07 15:46 Subject:Re: [SPARK ML] Minhash integer overflow I think it probably still does its.job; the hash value can just be negative. It is likely to be very slightly biased though. Because the intent doesn't seem to be to allow the overflow it's worth changing to use longs for the calculation. On Fri, Jul 6, 2018, 8:36 PM jiayuanm wrote: Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 ). Since "a" and "b" are from a uniform distribution of [1, MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, it's likely for the multiplication to cause Int overflow with a large sparse input vector. I wonder if this is a bug or intended. If it's a bug, one way to fix it is to compute hashes with Long and insert a couple of mod MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is to use BigInteger. Let me know what you think. Thanks, Jiayuan -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
Re: [SPARK ML] Minhash integer overflow
Thank for you reporting this issue. I think this is a bug regarding integer overflow. IMHO, it would be good to compute hashes with Long. Would it be possible to create a JIRA entry? Do you want to submit a pull request, too? Regards, Kazuaki Ishizaki From: jiayuanm To: dev@spark.apache.org Date: 2018/07/07 10:36 Subject:[SPARK ML] Minhash integer overflow Hi everyone, I was playing around with LSH/Minhash module from spark ml module. I noticed that hash computation is done with Int (see https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69 ). Since "a" and "b" are from a uniform distribution of [1, MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue, it's likely for the multiplication to cause Int overflow with a large sparse input vector. I wonder if this is a bug or intended. If it's a bug, one way to fix it is to compute hashes with Long and insert a couple of mod MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be smaller than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is to use BigInteger. Let me know what you think. Thanks, Jiayuan -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
[ https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530640#comment-16530640 ] Kazuaki Ishizaki commented on SPARK-24579: -- I cannot see comments on the doc, too. > SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks > > > Key: SPARK-24579 > URL: https://issues.apache.org/jira/browse/SPARK-24579 > Project: Spark > Issue Type: Epic > Components: ML, PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Labels: Hydrogen > Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange > between Apache Spark and DL%2FAI Frameworks .pdf > > > (see attached SPIP pdf for more details) > At the crossroads of big data and AI, we see both the success of Apache Spark > as a unified > analytics engine and the rise of AI frameworks like TensorFlow and Apache > MXNet (incubating). > Both big data and AI are indispensable components to drive business > innovation and there have > been multiple attempts from both communities to bring them together. > We saw efforts from AI community to implement data solutions for AI > frameworks like tf.data and tf.Transform. However, with 50+ data sources and > built-in SQL, DataFrames, and Streaming features, Spark remains the community > choice for big data. This is why we saw many efforts to integrate DL/AI > frameworks with Spark to leverage its power, for example, TFRecords data > source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project > Hydrogen, this SPIP takes a different angle at Spark + AI unification. > None of the integrations are possible without exchanging data between Spark > and external DL/AI frameworks. And the performance matters. However, there > doesn’t exist a standard way to exchange data and hence implementation and > performance optimization fall into pieces. For example, TensorFlowOnSpark > uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and > save data and pass the RDD records to TensorFlow in Python. And TensorFrames > converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s > Java API. How can we reduce the complexity? > The proposal here is to standardize the data exchange interface (or format) > between Spark and DL/AI frameworks and optimize data conversion from/to this > interface. So DL/AI frameworks can leverage Spark to load data virtually > from anywhere without spending extra effort building complex data solutions, > like reading features from a production data warehouse or streaming model > inference. Spark users can use DL/AI frameworks without learning specific > data APIs implemented there. And developers from both sides can work on > performance optimizations independently given the interface itself doesn’t > introduce big overhead. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: [Help] Codegen Stage grows beyond 64 KB
If it is difficult to create the small stand alone program, another approach seems to attach everything (i.e. configuration, data, program, console output, log, history server data, etc.) As a log, the community would recommend the info log with "spark.sql.codegen.logging.maxLines=2147483647". The log has to include the all of the generated Java methods. The community may take more time to address this problem than the case with the small program. Best Regards, Kazuaki Ishizaki From: Aakash Basu To: Kazuaki Ishizaki Cc: vaquar khan , Eyal Zituny , user Date: 2018/06/21 01:29 Subject:Re: [Help] Codegen Stage grows beyond 64 KB Hi Kazuaki, It would be really difficult to produce a small S-A code to reproduce this problem because, I'm running through a big pipeline of feature engineering where I derive a lot of variables based on the present ones to kind of explode the size of the table by many folds. Then, when I do any kind of join, this error shoots up. I tried with wholeStage.codegen=false, but that errors out the entire program rather than running it with a lesser optimized code. Any suggestion on how I can proceed towards a JIRA entry for this? Thanks, Aakash. On Wed, Jun 20, 2018 at 9:41 PM, Kazuaki Ishizaki wrote: Spark 2.3 tried to split a large generated Java methods into small methods as possible. However, this report may remain places that generates a large method. Would it be possible to create a JIRA entry with a small stand alone program that can reproduce this problem? It would be very helpful that the community will address this problem. Best regards, Kazuaki Ishizaki From:vaquar khan To:Eyal Zituny Cc:Aakash Basu , user < user@spark.apache.org> Date:2018/06/18 01:57 Subject:Re: [Help] Codegen Stage grows beyond 64 KB Totally agreed with Eyal . The problem is that when Java programs generated using Catalyst from programs using DataFrame and Dataset are compiled into Java bytecode, the size of byte code of one method must not be 64 KB or more, This conflicts with the limitation of the Java class file, which is an exception that occurs. In order to avoid occurrence of an exception due to this restriction, within Spark, a solution is to split the methods that compile and make Java bytecode that is likely to be over 64 KB into multiple methods when Catalyst generates Java programs It has been done. Use persist or any other logical separation in pipeline. Regards, Vaquar khan On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny wrote: Hi Akash, such errors might appear in large spark pipelines, the root cause is a 64kb jvm limitation. the reason that your job isn't failing at the end is due to spark fallback - if code gen is failing, spark compiler will try to create the flow without the code gen (less optimized) if you do not want to see this error, you can either disable code gen using the flag: spark.sql.codegen.wholeStage= "false" or you can try to split your complex pipeline into several spark flows if possible hope that helps Eyal On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu wrote: Hi, I already went through it, that's one use case. I've a complex and very big pipeline of multiple jobs under one spark session. Not getting, on how to solve this, as it is happening over Logistic Regression and Random Forest models, which I'm just using from Spark ML package rather than doing anything by myself. Thanks, Aakash. On Sun 17 Jun, 2018, 8:21 AM vaquar khan, wrote: Hi Akash, Please check stackoverflow. https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe Regards, Vaquar khan On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu wrote: Hi guys, I'm getting an error when I'm feature engineering on 30+ columns to create about 200+ columns. It is not failing the job, but the ERROR shows. I want to know how can I avoid this. Spark - 2.3.1 Python - 3.6 Cluster Config - 1 Master - 32 GB RAM, 16 Cores 4 Slaves - 16 GB RAM, 8 Cores Input data - 8 partitions of parquet file with snappy compression. My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077 --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf spark.scheduler.listenerbus.eventqueue.capacity=2 --conf spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt Stack-Trace below - ERROR CodeGenerator:91 - failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.Genera
Re: [Help] Codegen Stage grows beyond 64 KB
Spark 2.3 tried to split a large generated Java methods into small methods as possible. However, this report may remain places that generates a large method. Would it be possible to create a JIRA entry with a small stand alone program that can reproduce this problem? It would be very helpful that the community will address this problem. Best regards, Kazuaki Ishizaki From: vaquar khan To: Eyal Zituny Cc: Aakash Basu , user Date: 2018/06/18 01:57 Subject:Re: [Help] Codegen Stage grows beyond 64 KB Totally agreed with Eyal . The problem is that when Java programs generated using Catalyst from programs using DataFrame and Dataset are compiled into Java bytecode, the size of byte code of one method must not be 64 KB or more, This conflicts with the limitation of the Java class file, which is an exception that occurs. In order to avoid occurrence of an exception due to this restriction, within Spark, a solution is to split the methods that compile and make Java bytecode that is likely to be over 64 KB into multiple methods when Catalyst generates Java programs It has been done. Use persist or any other logical separation in pipeline. Regards, Vaquar khan On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny wrote: Hi Akash, such errors might appear in large spark pipelines, the root cause is a 64kb jvm limitation. the reason that your job isn't failing at the end is due to spark fallback - if code gen is failing, spark compiler will try to create the flow without the code gen (less optimized) if you do not want to see this error, you can either disable code gen using the flag: spark.sql.codegen.wholeStage= "false" or you can try to split your complex pipeline into several spark flows if possible hope that helps Eyal On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu wrote: Hi, I already went through it, that's one use case. I've a complex and very big pipeline of multiple jobs under one spark session. Not getting, on how to solve this, as it is happening over Logistic Regression and Random Forest models, which I'm just using from Spark ML package rather than doing anything by myself. Thanks, Aakash. On Sun 17 Jun, 2018, 8:21 AM vaquar khan, wrote: Hi Akash, Please check stackoverflow. https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe Regards, Vaquar khan On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu wrote: Hi guys, I'm getting an error when I'm feature engineering on 30+ columns to create about 200+ columns. It is not failing the job, but the ERROR shows. I want to know how can I avoid this. Spark - 2.3.1 Python - 3.6 Cluster Config - 1 Master - 32 GB RAM, 16 Cores 4 Slaves - 16 GB RAM, 8 Cores Input data - 8 partitions of parquet file with snappy compression. My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077 --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf spark.scheduler.listenerbus.eventqueue.capacity=2 --conf spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt Stack-Trace below - ERROR CodeGenerator:91 - failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204) at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490) at org.spark_project.guava.cache.LocalCache$LoadingValueRefer
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517097#comment-16517097 ] Kazuaki Ishizaki commented on SPARK-24498: -- [~maropu] thank you, let us use this as a start point. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process
[ https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509569#comment-16509569 ] Kazuaki Ishizaki commented on SPARK-24529: -- I am working for this > Add spotbugs into maven build process > - > > Key: SPARK-24529 > URL: https://issues.apache.org/jira/browse/SPARK-24529 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > We will enable a Java bytecode check tool > [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at > multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24529) Add spotbugs into maven build process
Kazuaki Ishizaki created SPARK-24529: Summary: Add spotbugs into maven build process Key: SPARK-24529 URL: https://issues.apache.org/jira/browse/SPARK-24529 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki We will enable a Java bytecode check tool [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at multiplication. Due to the tool limitation, some other checks will be enabled. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen
[ https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506239#comment-16506239 ] Kazuaki Ishizaki commented on SPARK-24498: -- Hi [~smilegator] Definetely, I am interested in this task. I will investigate this issue. > Add JDK compiler for runtime codegen > > > Key: SPARK-24498 > URL: https://issues.apache.org/jira/browse/SPARK-24498 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > In some cases, JDK compiler can generate smaller bytecode and take less time > in compilation compared to Janino. However, in some cases, Janino is better. > We should support both for our runtime codegen. Janino will be still our > default runtime codegen compiler. > See the related JIRAs in DRILL: > - https://issues.apache.org/jira/browse/DRILL-1155 > - https://issues.apache.org/jira/browse/DRILL-4778 > - https://issues.apache.org/jira/browse/DRILL-5696 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns
[ https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504845#comment-16504845 ] Kazuaki Ishizaki commented on SPARK-24486: -- Thank you for reporting a problem. Could you please let us know which value is shown for each of three results in `sum(...)`? > Slow performance reading ArrayType columns > -- > > Key: SPARK-24486 > URL: https://issues.apache.org/jira/browse/SPARK-24486 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0 >Reporter: Luca Canali >Priority: Minor > > We have found an issue of slow performance in one of our applications when > running on Spark 2.3.0 (the same workload does not have a performance issue > on Spark 2.2.1). We suspect a regression in the area of handling columns of > ArrayType. I have built a simplified test case showing a manifestation of the > issue to help with troubleshooting: > > > {code:java} > // prepare test data > val stringListValues=Range(1,3).mkString(",") > sql(s"select 1 as myid, Array($stringListValues) as myarray from > range(2)").repartition(1).write.parquet("file:///tmp/deleteme1") > // run test > spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code} > Performance measurements: > > On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 > (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec > on Spark 2.3.0 > > Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 > only 2 records are read by this workload, while on Spark 2.3.0 all rows in > the file are read, which appears anomalous. > Example: > {code:java} > bin/spark-shell --master local[*] --driver-memory 2g --packages > ch.cern.sparkmeasure:spark-measure_2.11:0.11 > val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) > stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show()) > {code} > > > Selected metrics from Spark 2.3.0 run: > > {noformat} > elapsedTime => 17849 (18 s) > sum(numTasks) => 11 > sum(recordsRead) => 2 > sum(bytesRead) => 1136448171 (1083.0 MB){noformat} > > > From Spark 2.2.1 run: > > {noformat} > elapsedTime => 1329 (1 s) > sum(numTasks) => 2 > sum(recordsRead) => 2 > sum(bytesRead) => 269162610 (256.0 MB) > {noformat} > > Note: Using Spark built from master (as I write this, June 7th 2018) shows > the same behavior as found in Spark 2.3.0 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
Re: Strange codegen error for SortMergeJoin in Spark 2.2.1
Thank you for reporting a problem. Would it be possible to create a JIRA entry with a small program that can reproduce this problem? Best Regards, Kazuaki Ishizaki From: Rico Bergmann To: "user@spark.apache.org" Date: 2018/06/05 19:58 Subject:Strange codegen error for SortMergeJoin in Spark 2.2.1 Hi! I get a strange error when executing a complex SQL-query involving 4 tables that are left-outer-joined: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 18: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 37, Column 18: No applicable constructor/method found for actual parameters "int"; candidates are: "org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(org.apache.spark.memory.TaskMemoryManager, org.apache.spark.storage.BlockManager, org.apache.spark.serializer.SerializerManager, org.apache.spark.TaskContext, int, long, int, int)", "org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(int, int)" ... /* 037 */ smj_matches = new org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483647); The same query works with Spark 2.2.0. I checked the Spark source code and saw that in ExternalAppendOnlyUnsafeRowArray a second int was introduced into the constructor in 2.2.1 But looking at the codegeneration part of SortMergeJoinExec: // A list to hold all matched rows from right side. val matches = ctx.freshName("matches") val clsName = classOf[ExternalAppendOnlyUnsafeRowArray].getName val spillThreshold = getSpillThreshold val inMemoryThreshold = getInMemoryThreshold ctx.addMutableState(clsName, matches, s"$matches = new $clsName($inMemoryThreshold, $spillThreshold);") it should get 2 parameters, not just one. May be anyone has an idea? Best, Rico.
[jira] [Created] (SPARK-24452) long = int*int or long = int+int may cause overflow.
Kazuaki Ishizaki created SPARK-24452: Summary: long = int*int or long = int+int may cause overflow. Key: SPARK-24452 URL: https://issues.apache.org/jira/browse/SPARK-24452 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki The following assignment cause overflow in right hand side. As a result, the result may be negative. {code:java} long = int*int long = int+int{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24323) Java lint errors
Kazuaki Ishizaki created SPARK-24323: Summary: Java lint errors Key: SPARK-24323 URL: https://issues.apache.org/jira/browse/SPARK-24323 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki The following error occurs when run lint-java {code:java} [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java:[39] (sizes) LineLength: Line is longer than 100 characters (found 104). [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[26] (sizes) LineLength: Line is longer than 100 characters (found 110). [ERROR] src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[30] (sizes) LineLength: Line is longer than 100 characters (found 104). {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480482#comment-16480482 ] Kazuaki Ishizaki commented on SPARK-24314: -- I am working for this. > interpreted element_at or GetMapValue does not work for complex types > - > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reopened SPARK-24314: -- > interpreted element_at or GetMapValue does not work for complex types > - > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-24314: - Summary: interpreted element_at or GetMapValue does not work for complex types (was: interpreted array_position does not work for complex types) > interpreted element_at or GetMapValue does not work for complex types > - > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24314) interpreted array_position does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-24314. -- Resolution: Duplicate > interpreted array_position does not work for complex types > -- > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24314) interpreted array_position does not work for complex types
Kazuaki Ishizaki created SPARK-24314: Summary: interpreted array_position does not work for complex types Key: SPARK-24314 URL: https://issues.apache.org/jira/browse/SPARK-24314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24273) Failure while using .checkpoint method
[ https://issues.apache.org/jira/browse/SPARK-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475941#comment-16475941 ] Kazuaki Ishizaki commented on SPARK-24273: -- Thank you for reporting this issue? Would it be possible to attach the standalone program that can reproduce this issue? > Failure while using .checkpoint method > -- > > Key: SPARK-24273 > URL: https://issues.apache.org/jira/browse/SPARK-24273 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.0 >Reporter: Jami Malikzade >Priority: Major > > We are getting following error: > com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS > Service: Amazon S3, AWS Request ID: > tx14126-005ae9bfd9-9ed9ac2-default, AWS Error Code: > InvalidRange, AWS Error Message: null, S3 Extended Request ID: > 9ed9ac2-default-default" > when we use checkpoint method as below. > val streamBucketDF = streamPacketDeltaDF > .filter('timeDelta > maxGap && 'timeDelta <= 3) > .withColumn("bucket", when('timeDelta <= mediumGap, "medium") > .otherwise("large") > ) > .checkpoint() > Do you have idea how to prevent invalid range in header to be sent, or how it > can be workarounded or fixed? > Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24220) java.lang.NullPointerException at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
[ https://issues.apache.org/jira/browse/SPARK-24220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472195#comment-16472195 ] Kazuaki Ishizaki commented on SPARK-24220: -- Thank you for reporting an issue. Would it be possible to post standalone reproduable program? This program seems to connect to an external database or something thru {{DriverManager.getConnection(adminUrl)}}. > java.lang.NullPointerException at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83) > > > Key: SPARK-24220 > URL: https://issues.apache.org/jira/browse/SPARK-24220 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.2.0 >Reporter: joy-m >Priority: Major > > def getInputStream(rows:Iterator[Row]): PipedInputStream ={ > printMem("before gen string") > val pipedOutputStream = new PipedOutputStream() > (new Thread() { > override def run(){ > if(rows == null){ > logError("rows is null==>") > }else{ > println(s"record-start-${rows.length}") > try { > while (rows.hasNext) { > val row = rows.next() > println(row) > val str = row.mkString("\001") + "\r\n" > println(str) > pipedOutputStream.write(str.getBytes(StandardCharsets.UTF_8)) > } > println("record-end-") > pipedOutputStream.close() > } catch { > case ex:Exception => > ex.printStackTrace() > } > } > } > }).start() > println("pipedInPutStream--") > val pipedInPutStream = new PipedInputStream() > pipedInPutStream.connect(pipedOutputStream) > println("pipedInPutStream--- conn---") > printMem("after gen string") > pipedInPutStream > } > resDf.coalesce(15).foreachPartition(rows=>{ > if(rows == null){ > logError("rows is null=>") > }else{ > val copyCmd = s"COPY ${tableName} FROM STDIN with DELIMITER as '\001' NULL > as 'null string'" > var con: Connection = null > try { > con = DriverManager.getConnection(adminUrl) > val copyManager = new CopyManager(con.asInstanceOf[BaseConnection]) > val start = System.currentTimeMillis() > var count: Long = 0 > var copyCount: Long = 0 > println("before copyManager=>") > copyCount += copyManager.copyIn(copyCmd, getInputStream(rows)) > println("after copyManager=>") > val finish = System.currentTimeMillis() > println("copyCount:" + copyCount + " count:" + count + " time(s):" + (finish > - start) / 1000) > con.close() > } catch { > case ex:Exception => > ex.printStackTrace() > println(s"copyIn error!${ex.toString}") > } finally { > try { > if (con != null) { > con.close() > } > } catch { > case ex:SQLException => > ex.printStackTrace() > println(s"copyIn error!${ex.toString}") > } > } > } > > 18/05/09 13:31:30 ERROR util.SparkUncaughtExceptionHandler: Uncaught > exception in thread Thread[Thread-4,5,main] > java.lang.NullPointerException > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83) > at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown > Source) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:392) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:389) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) >
Re: SparkR test failures in PR builder
I am not familiar with SparkR or CRAN. However, I remember that we had the similar situation. Here is a great work at that time. When I have just visited this PR, I think that we have the similar situation (i.e. format error) again. https://github.com/apache/spark/pull/20005 Any other comments are appreciated. Regards, Kazuaki Ishizaki From: Joseph Bradley <jos...@databricks.com> To: dev <dev@spark.apache.org> Cc: Hossein Falaki <hoss...@databricks.com> Date: 2018/05/03 07:31 Subject:SparkR test failures in PR builder Hi all, Does anyone know why the PR builder keeps failing on SparkR's CRAN checks? I've seen this in a lot of unrelated PRs. E.g.: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console Hossein spotted this line: ``` * checking CRAN incoming feasibility ...Error in .check_package_CRAN_incoming(pkgdir) : dims [product 24] do not match the length of object [0] ``` and suggested that it could be CRAN flakiness. I'm not familiar with CRAN, but do others have thoughts about how to fix this? Thanks! Joseph -- Joseph Bradley Software Engineer - Machine Learning Databricks, Inc.
[jira] [Commented] (SPARK-24119) Add interpreted execution to SortPrefix expression
[ https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458297#comment-16458297 ] Kazuaki Ishizaki commented on SPARK-24119: -- It seems to make sense. It would be good to set this JIRA sas a subtask of SPARK-23580 > Add interpreted execution to SortPrefix expression > -- > > Key: SPARK-24119 > URL: https://issues.apache.org/jira/browse/SPARK-24119 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Bruce Robbins >Priority: Minor > > [~hvanhovell] [~kiszk] > I noticed SortPrefix did not support interpreted execution when I was testing > the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for > adding interpreted execution (SPARK-23580) > Since I had to implement interpreted execution for SortPrefix to complete > testing, I am creating this Jira. If there's no good reason why eval wasn't > implemented, I will make the PR in a few days. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452404#comment-16452404 ] Kazuaki Ishizaki edited comment on SPARK-23933 at 4/25/18 6:48 PM: --- Thank you for your comment. The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due to a pair of key and map. We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave as currently. How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}? Or How about {{CreateMap(Seq(CreateArray(sSeq.map(Literal(\_))), CreateArray(iSeq.map(Literal(\_)}}? was (Author: kiszk): Thank you for your comment. The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due to a pair of key and map. We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave as currently. How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}? > High-order function: map(array, array) → map<K,V> > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452404#comment-16452404 ] Kazuaki Ishizaki commented on SPARK-23933: -- Thank you for your comment. The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due to a pair of key and map. We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave as currently. How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}? > High-order function: map(array, array) → map<K,V> > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450262#comment-16450262 ] Kazuaki Ishizaki commented on SPARK-23933: -- cc [~smilegator] > High-order function: map(array, array) → map<K,V> > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)
[ https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446166#comment-16446166 ] Kazuaki Ishizaki commented on SPARK-10399: -- https://issues.apache.org/jira/browse/SPARK-23879 is the following JIRA entry. > Off Heap Memory Access for non-JVM libraries (C++) > -- > > Key: SPARK-10399 > URL: https://issues.apache.org/jira/browse/SPARK-10399 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Paul Weiss >Priority: Major > > *Summary* > Provide direct off-heap memory access to an external non-JVM program such as > a c++ library within the Spark running JVM/executor. As Spark moves to > storing all data into off heap memory it makes sense to provide access points > to the memory for non-JVM programs. > > *Assumptions* > * Zero copies will be made during the call into non-JVM library > * Access into non-JVM libraries will be accomplished via JNI > * A generic JNI interface will be created so that developers will not need to > deal with the raw JNI call > * C++ will be the initial target non-JVM use case > * memory management will remain on the JVM/Spark side > * the API from C++ will be similar to dataframes as much as feasible and NOT > require expert knowledge of JNI > * Data organization and layout will support complex (multi-type, nested, > etc.) types > > *Design* > * Initially Spark JVM -> non-JVM will be supported > * Creating an embedded JVM with Spark running from a non-JVM program is > initially out of scope > > *Technical* > * GetDirectBufferAddress is the JNI call used to access byte buffer without > copy -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436976#comment-16436976 ] Kazuaki Ishizaki edited comment on SPARK-23933 at 4/18/18 4:22 PM: --- [~smilegator] [~ueshin] Could you favor us? SparkSQL already uses syntax of {{map}} function for the different purpose. Even if we limit two array in the argument list, we may have conflict between this new feature and creating a map with one entry having an array for key and value. Do you have any good idea? {code} @ExpressionDescription( usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.", examples = """ Examples: > SELECT _FUNC_(1.0, '2', 3.0, '4'); {1.0:"2",3.0:"4"} """) case class CreateMap(children: Seq[Expression]) extends Expression { ... {code} was (Author: kiszk): [~smilegator] [~ueshin] Could you favor us? SparkSQL already uses syntax of {{map}} function for the similar purpose. Even if we limit two array in the argument list, we may have conflict between this new feature and creating a map with one entry having an array for key and value. Do you have any good idea? {code} @ExpressionDescription( usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.", examples = """ Examples: > SELECT _FUNC_(1.0, '2', 3.0, '4'); {1.0:"2",3.0:"4"} """) case class CreateMap(children: Seq[Expression]) extends Expression { ... {code} > High-order function: map(array, array) → map<K,V> > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441268#comment-16441268 ] Kazuaki Ishizaki commented on SPARK-23933: -- ping [~smilegator] > High-order function: map(array, array) → map<K,V> > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438885#comment-16438885 ] Kazuaki Ishizaki commented on SPARK-23986: -- While I also checked it with branch-2.3, it works well without any exception. > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Priority: Major > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name > conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}. > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438818#comment-16438818 ] Kazuaki Ishizaki edited comment on SPARK-23986 at 4/15/18 7:36 PM: --- Thank for reporting an issue with deep dive. When I run this repro with the latest master, it works well without an exception. When I checked the generated code, I cannot find variables {{agg_expr_[21|31|41|51|61]}}. I will check it with branch-2.3 tomorrow. Would it be possible to attach the log file of the generated code? was (Author: kiszk): Thank for reporting an issue with deep dive. When I run this repro with the latest master, it works well without an exception. When I checked the generated code, I cannot find variables {{agg_expr_[21|31|41|51|61]}}. Would it be possible to attach the log file of the generated code? > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Priority: Major > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name > conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}. > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining
[ https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438818#comment-16438818 ] Kazuaki Ishizaki commented on SPARK-23986: -- Thank for reporting an issue with deep dive. When I run this repro with the latest master, it works well without an exception. When I checked the generated code, I cannot find variables {{agg_expr_[21|31|41|51|61]}}. Would it be possible to attach the log file of the generated code? > CompileException when using too many avg aggregation after joining > -- > > Key: SPARK-23986 > URL: https://issues.apache.org/jira/browse/SPARK-23986 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Michel Davit >Priority: Major > > Considering the following code: > {code:java} > val df1: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6))) > .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6") > val df2: DataFrame = sparkSession.sparkContext > .makeRDD(Seq((0, "val1", "val2"))) > .toDF("key", "dummy1", "dummy2") > val agg = df1 > .join(df2, df1("key") === df2("key"), "leftouter") > .groupBy(df1("key")) > .agg( > avg("col2").as("avg2"), > avg("col3").as("avg3"), > avg("col4").as("avg4"), > avg("col1").as("avg1"), > avg("col5").as("avg5"), > avg("col6").as("avg6") > ) > val head = agg.take(1) > {code} > This logs the following exception: > {code:java} > ERROR CodeGenerator: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 467, Column 28: Redefinition of parameter "agg_expr_11" > {code} > I am not a spark expert but after investigation, I realized that the > generated {{doConsume}} method is responsible of the exception. > Indeed, {{avg}} calls several times > {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. > The 1st time with the 'avg' Expr and a second time for the base aggregation > Expr (count and sum). > The problem comes from the generation of parameters in CodeGenerator: > {code:java} > /** >* Returns a term name that is unique within this instance of a > `CodegenContext`. >*/ > def freshName(name: String): String = synchronized { > val fullName = if (freshNamePrefix == "") { > name > } else { > s"${freshNamePrefix}_$name" > } > if (freshNameIds.contains(fullName)) { > val id = freshNameIds(fullName) > freshNameIds(fullName) = id + 1 > s"$fullName$id" > } else { > freshNameIds += fullName -> 1 > fullName > } > } > {code} > The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call. > The second call is made with {{agg_expr_[1..12]}} and generates the > following names: > {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name > conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}. > Appending the 'id' in s"$fullName$id" to generate unique term name is source > of conflict. Maybe simply using undersoce can solve this issue : > $fullName_$id" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23976) UTF8String.concat() or ByteArray.concat() may allocate shorter structure.
Kazuaki Ishizaki created SPARK-23976: Summary: UTF8String.concat() or ByteArray.concat() may allocate shorter structure. Key: SPARK-23976 URL: https://issues.apache.org/jira/browse/SPARK-23976 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki When the three inputs has `0x7FFF_FF00`, `0x7FFF_FF00`, and `0xE00`, the current algorithm allocate the result structure with 0x1000 length due to integer sum overflow. We should detect overflow. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436976#comment-16436976 ] Kazuaki Ishizaki commented on SPARK-23933: -- [~smilegator] [~ueshin] Could you favor us? SparkSQL already uses syntax of {{map}} function for the similar purpose. Even if we limit two array in the argument list, we may have conflict between this new feature and creating a map with one entry having an array for key and value. Do you have any good idea? {code} @ExpressionDescription( usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the given key/value pairs.", examples = """ Examples: > SELECT _FUNC_(1.0, '2', 3.0, '4'); {1.0:"2",3.0:"4"} """) case class CreateMap(children: Seq[Expression]) extends Expression { ... {code} > High-order function: map(array, array) → map<K,V> > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>
[ https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436136#comment-16436136 ] Kazuaki Ishizaki commented on SPARK-23933: -- I will work for this, thank you. > High-order function: map(array, array) → map<K,V> > --- > > Key: SPARK-23933 > URL: https://issues.apache.org/jira/browse/SPARK-23933 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/map.html > Returns a map created using the given key/value arrays. > {noformat} > SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4} > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23915) High-order function: array_except(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435666#comment-16435666 ] Kazuaki Ishizaki commented on SPARK-23915: -- I will work for this, thanks. > High-order function: array_except(x, y) → array > --- > > Key: SPARK-23915 > URL: https://issues.apache.org/jira/browse/SPARK-23915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of elements in x but not in y, without duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23914) High-order function: array_union(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434494#comment-16434494 ] Kazuaki Ishizaki commented on SPARK-23914: -- I will work for this, thank you. > High-order function: array_union(x, y) → array > -- > > Key: SPARK-23914 > URL: https://issues.apache.org/jira/browse/SPARK-23914 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the union of x and y, without duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23913) High-order function: array_intersect(x, y) → array
[ https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434491#comment-16434491 ] Kazuaki Ishizaki commented on SPARK-23913: -- I will work for this, thank you. > High-order function: array_intersect(x, y) → array > -- > > Key: SPARK-23913 > URL: https://issues.apache.org/jira/browse/SPARK-23913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns an array of the elements in the intersection of x and y, without > duplicates. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar
[ https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431795#comment-16431795 ] Kazuaki Ishizaki commented on SPARK-23916: -- Sorry for my mistake regarding a PR with wrong JIRA number. > High-order function: array_join(x, delimiter, null_replacement) → varchar > - > > Key: SPARK-23916 > URL: https://issues.apache.org/jira/browse/SPARK-23916 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Concatenates the elements of the given array using the delimiter and an > optional string to replace nulls. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23923) High-order function: cardinality(x) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431009#comment-16431009 ] Kazuaki Ishizaki edited comment on SPARK-23923 at 4/9/18 6:36 PM: -- I will work for this. was (Author: kiszk): I am working for this. > High-order function: cardinality(x) → bigint > > > Key: SPARK-23923 > URL: https://issues.apache.org/jira/browse/SPARK-23923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html and > https://prestodb.io/docs/current/functions/map.html. > Returns the cardinality (size) of the array/map x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23921) High-order function: array_sort(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431037#comment-16431037 ] Kazuaki Ishizaki commented on SPARK-23921: -- I am working for this > High-order function: array_sort(x) → array > -- > > Key: SPARK-23921 > URL: https://issues.apache.org/jira/browse/SPARK-23921 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Sorts and returns the array x. The elements of x must be orderable. Null > elements will be placed at the end of the returned array. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23919) High-order function: array_position(x, element) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431004#comment-16431004 ] Kazuaki Ishizaki edited comment on SPARK-23919 at 4/9/18 6:19 PM: -- I will work for this was (Author: kiszk): I am working for this. > High-order function: array_position(x, element) → bigint > > > Key: SPARK-23919 > URL: https://issues.apache.org/jira/browse/SPARK-23919 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the position of the first occurrence of the element in array x (or 0 > if not found). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23923) High-order function: cardinality(x) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431009#comment-16431009 ] Kazuaki Ishizaki commented on SPARK-23923: -- I am working for this. > High-order function: cardinality(x) → bigint > > > Key: SPARK-23923 > URL: https://issues.apache.org/jira/browse/SPARK-23923 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html and > https://prestodb.io/docs/current/functions/map.html. > Returns the cardinality (size) of the array/map x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23919) High-order function: array_position(x, element) → bigint
[ https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431004#comment-16431004 ] Kazuaki Ishizaki commented on SPARK-23919: -- I am working for this. > High-order function: array_position(x, element) → bigint > > > Key: SPARK-23919 > URL: https://issues.apache.org/jira/browse/SPARK-23919 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Returns the position of the first occurrence of the element in array x (or 0 > if not found). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23924) High-order function: element_at
[ https://issues.apache.org/jira/browse/SPARK-23924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431002#comment-16431002 ] Kazuaki Ishizaki commented on SPARK-23924: -- I will work for this. > High-order function: element_at > --- > > Key: SPARK-23924 > URL: https://issues.apache.org/jira/browse/SPARK-23924 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html and > https://prestodb.io/docs/current/functions/map.html > * element_at(array, index) → E > Returns element of array at given index. If index > 0, this function provides > the same functionality as the SQL-standard subscript operator ([]). If index > < 0, element_at accesses elements from the last to the first. > * element_at(map<K, V>, key) → V > Returns value for given key, or NULL if the key is not contained in the map. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23893) Possible overflow in long = int * int
Kazuaki Ishizaki created SPARK-23893: Summary: Possible overflow in long = int * int Key: SPARK-23893 URL: https://issues.apache.org/jira/browse/SPARK-23893 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki To perform `int * int` and then to cast to `long` may cause overflow if the MSB of the multiplication result is `1`. In other words, the result would be negative due to sign extension. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23892) Improve coverage and fix lint error in UTF8String-related Suite
[ https://issues.apache.org/jira/browse/SPARK-23892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-23892: - Description: The following code in {{UTF8StringSuite}} has no sense. {code} assertTrue(s1.startsWith(s1)); assertTrue(s1.endsWith(s1)); {code} The code {{if (length <= 0) ""}} in {{UTF8StringPropertyCheckSuite}} has no sense {code} test("lpad, rpad") { def padding(origin: String, pad: String, length: Int, isLPad: Boolean): String = { if (length <= 0) return "" if (length <= origin.length) { if (length <= 0) "" else origin.substring(0, length) } else { ... {code} The previous change in {{UTF8StringSuite}} broke lint-java check. was: The following code in {{UTF8StringSuite}} has no sense. {code} assertTrue(s1.startsWith(s1)); assertTrue(s1.endsWith(s1)); {code} {code} test("lpad, rpad") { def padding(origin: String, pad: String, length: Int, isLPad: Boolean): String = { if (length <= 0) return "" if (length <= origin.length) { if (length <= 0) "" else origin.substring(0, length) } else { ... {code} The previous change broken lint-java check. > Improve coverage and fix lint error in UTF8String-related Suite > --- > > Key: SPARK-23892 > URL: https://issues.apache.org/jira/browse/SPARK-23892 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > The following code in {{UTF8StringSuite}} has no sense. > {code} > assertTrue(s1.startsWith(s1)); > assertTrue(s1.endsWith(s1)); > {code} > The code {{if (length <= 0) ""}} in {{UTF8StringPropertyCheckSuite}} has no > sense > {code} > test("lpad, rpad") { > def padding(origin: String, pad: String, length: Int, isLPad: Boolean): > String = { > if (length <= 0) return "" > if (length <= origin.length) { > if (length <= 0) "" else origin.substring(0, length) > } else { >... > {code} > The previous change in {{UTF8StringSuite}} broke lint-java check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23892) Improve coverage and fix lint error in UTF8String-related Suite
[ https://issues.apache.org/jira/browse/SPARK-23892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-23892: - Description: The following code in {{UTF8StringSuite}} has no sense. {code} assertTrue(s1.startsWith(s1)); assertTrue(s1.endsWith(s1)); {code} {code} test("lpad, rpad") { def padding(origin: String, pad: String, length: Int, isLPad: Boolean): String = { if (length <= 0) return "" if (length <= origin.length) { if (length <= 0) "" else origin.substring(0, length) } else { ... {code} The previous change broken lint-java check. was: The following code in {{UTF8StringSuite}} has no sense. {code} assertTrue(s1.startsWith(s1)); assertTrue(s1.endsWith(s1)); {code} The previous change broken lint-java check. > Improve coverage and fix lint error in UTF8String-related Suite > --- > > Key: SPARK-23892 > URL: https://issues.apache.org/jira/browse/SPARK-23892 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > The following code in {{UTF8StringSuite}} has no sense. > {code} > assertTrue(s1.startsWith(s1)); > assertTrue(s1.endsWith(s1)); > {code} > {code} > test("lpad, rpad") { > def padding(origin: String, pad: String, length: Int, isLPad: Boolean): > String = { > if (length <= 0) return "" > if (length <= origin.length) { > if (length <= 0) "" else origin.substring(0, length) > } else { >... > {code} > The previous change broken lint-java check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23892) Improve coverage and fix lint error in UTF8String-related Suite
[ https://issues.apache.org/jira/browse/SPARK-23892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-23892: - Summary: Improve coverage and fix lint error in UTF8String-related Suite (was: Improve coverage and fix lint error in UTF8StringSuite) > Improve coverage and fix lint error in UTF8String-related Suite > --- > > Key: SPARK-23892 > URL: https://issues.apache.org/jira/browse/SPARK-23892 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Minor > > The following code in {{UTF8StringSuite}} has no sense. > {code} > assertTrue(s1.startsWith(s1)); > assertTrue(s1.endsWith(s1)); > {code} > The previous change broken lint-java check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23892) Improve coverage and fix lint error in UTF8StringSuite
Kazuaki Ishizaki created SPARK-23892: Summary: Improve coverage and fix lint error in UTF8StringSuite Key: SPARK-23892 URL: https://issues.apache.org/jira/browse/SPARK-23892 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki The following code in {{UTF8StringSuite}} has no sense. {code} assertTrue(s1.startsWith(s1)); assertTrue(s1.endsWith(s1)); {code} The previous change broken lint-java check. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23882) Is UTF8StringSuite.writeToOutputStreamUnderflow() supported?
Kazuaki Ishizaki created SPARK-23882: Summary: Is UTF8StringSuite.writeToOutputStreamUnderflow() supported? Key: SPARK-23882 URL: https://issues.apache.org/jira/browse/SPARK-23882 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.0 Reporter: Kazuaki Ishizaki The unit test {{UTF8StringSuite.writeToOutputStreamUnderflow()}} accesses metadata of an Java byte array objected where {{Platform.BYTE_ARRAY_OFFSET}} reserves. Is this test valid? Is this test necessary for Spark implementation? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23762) UTF8StringBuilder uses MemoryBlock
[ https://issues.apache.org/jira/browse/SPARK-23762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-23762: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-23879 > UTF8StringBuilder uses MemoryBlock > -- > > Key: SPARK-23762 > URL: https://issues.apache.org/jira/browse/SPARK-23762 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org