[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25728:
-
External issue ID: 
https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>    Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2018-10-14 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25728:


 Summary: SPIP: Structured Intermediate Representation (Tungsten 
IR) for generating Java code
 Key: SPARK-25728
 URL: https://issues.apache.org/jira/browse/SPARK-25728
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform

2018-10-09 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created ARROW-3476:
---

 Summary: [Java] mvn test in memory fails on a big-endian platform
 Key: ARROW-3476
 URL: https://issues.apache.org/jira/browse/ARROW-3476
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Kazuaki Ishizaki


On a big-endian platform, {{mvn test}} in memory causes a failure due to an 
assertion.
In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during 
an allocation of a {{RootAllocator}} class.

{code}
$ uname -a
Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 
2016 ppc64 ppc64 ppc64 GNU/Linux
$ arch  
ppc64
$ cd java/memory
$ mvn test
[INFO] Scanning for projects...
[INFO] 
[INFO] 
[INFO] Building Arrow Memory 0.12.0-SNAPSHOT
[INFO] 
[INFO] 
...
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 s 
- in org.apache.arrow.memory.TestAccountant
[INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 s 
- in org.apache.arrow.memory.TestLowCostIdentityHashMap
[INFO] Running org.apache.arrow.memory.TestBaseAllocator
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 s 
<<< FAILURE! - in org.apache.arrow.memory.TestEndianess
[ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess)  Time elapsed: 
0.313 s  <<< ERROR!
java.lang.ExceptionInInitializerError
at 
org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian 
systems.
at 
org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)

[ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 0.055 
s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3476) [Java] mvn test in memory fails on a big-endian platform

2018-10-09 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created ARROW-3476:
---

 Summary: [Java] mvn test in memory fails on a big-endian platform
 Key: ARROW-3476
 URL: https://issues.apache.org/jira/browse/ARROW-3476
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Kazuaki Ishizaki


On a big-endian platform, {{mvn test}} in memory causes a failure due to an 
assertion.
In {{TestEndianess.testLittleEndian}} test suite, the assertion occurs during 
an allocation of a {{RootAllocator}} class.

{code}
$ uname -a
Linux ppc64be.novalocal 4.5.7-300.fc24.ppc64 #1 SMP Fri Jun 10 20:29:32 UTC 
2016 ppc64 ppc64 ppc64 GNU/Linux
$ arch  
ppc64
$ cd java/memory
$ mvn test
[INFO] Scanning for projects...
[INFO] 
[INFO] 
[INFO] Building Arrow Memory 0.12.0-SNAPSHOT
[INFO] 
[INFO] 
...
[INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.082 s 
- in org.apache.arrow.memory.TestAccountant
[INFO] Running org.apache.arrow.memory.TestLowCostIdentityHashMap
[INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.012 s 
- in org.apache.arrow.memory.TestLowCostIdentityHashMap
[INFO] Running org.apache.arrow.memory.TestBaseAllocator
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.746 s 
<<< FAILURE! - in org.apache.arrow.memory.TestEndianess
[ERROR] testLittleEndian(org.apache.arrow.memory.TestEndianess)  Time elapsed: 
0.313 s  <<< ERROR!
java.lang.ExceptionInInitializerError
at 
org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)
Caused by: java.lang.IllegalStateException: Arrow only runs on LittleEndian 
systems.
at 
org.apache.arrow.memory.TestEndianess.testLittleEndian(TestEndianess.java:31)

[ERROR] Tests run: 22, Failures: 0, Errors: 21, Skipped: 1, Time elapsed: 0.055 
s <<< FAILURE! - in org.apache.arrow.memory.TestBaseAllocator
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs

2018-10-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-25497.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

> limit operation within whole stage codegen should not consume all the inputs
> 
>
> Key: SPARK-25497
> URL: https://issues.apache.org/jira/browse/SPARK-25497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> This issue was discovered during https://github.com/apache/spark/pull/21738 . 
> It turns out that limit is not whole-stage-codegened correctly and always 
> consume all the inputs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25497) limit operation within whole stage codegen should not consume all the inputs

2018-10-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reassigned SPARK-25497:


Assignee: Wenchen Fan

> limit operation within whole stage codegen should not consume all the inputs
> 
>
> Key: SPARK-25497
> URL: https://issues.apache.org/jira/browse/SPARK-25497
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>
> This issue was discovered during https://github.com/apache/spark/pull/21738 . 
> It turns out that limit is not whole-stage-codegened correctly and always 
> consume all the inputs



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344
 ] 

Kazuaki Ishizaki edited comment on SPARK-25538 at 10/1/18 5:21 PM:
---

This test case does not print {{63}} using master branch.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}


was (Author: kiszk):
This test case does not print {{63}}.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-10-01 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634344#comment-16634344
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

This test case does not print {{63}}.

{code}
  test("test2") {
val df = spark.read.parquet("file:///SPARK-25538-repro")
val c1 = df.distinct.count
val c2 = df.sort("col_0").distinct.count
val c3 = df.withColumnRenamed("col_0", "new").distinct.count
val c0 = df.count
print(s"c1=$c1, c2=$c2, c3=$c3, c0=$c0\n")
  }

c1=64, c2=73, c3=64, c0=123
{code}

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Blocker
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-30 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633568#comment-16633568
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Thank you. I will check it tonight in Japan.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
> Attachments: SPARK-25538-repro.tgz
>
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-28 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16631605#comment-16631605
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Thank for upload a schema. While I looked at the schema, I am still not sure 
about the reason of this problem.
I would appreciate it if you could find a good input data that can reproduce a 
problem.

> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25538) incorrect row counts after distinct()

2018-09-26 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629281#comment-16629281
 ] 

Kazuaki Ishizaki commented on SPARK-25538:
--

Hi [~Steven Rand], would it be possible to share the schema of this DataFrame?


> incorrect row counts after distinct()
> -
>
> Key: SPARK-25538
> URL: https://issues.apache.org/jira/browse/SPARK-25538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Reproduced on a Centos7 VM and from source in Intellij 
> on OS X.
>Reporter: Steven Rand
>Priority: Major
>  Labels: correctness
>
> It appears that {{df.distinct.count}} can return incorrect values after 
> SPARK-23713. It's possible that other operations are affected as well; 
> {{distinct}} just happens to be the one that we noticed. I believe that this 
> issue was introduced by SPARK-23713 because I can't reproduce it until that 
> commit, and I've been able to reproduce it after that commit as well as with 
> {{tags/v2.4.0-rc1}}. 
> Below are example spark-shell sessions to illustrate the problem. 
> Unfortunately the data used in these examples can't be uploaded to this Jira 
> ticket. I'll try to create test data which also reproduces the issue, and 
> will upload that if I'm able to do so.
> Example from Spark 2.3.1, which behaves correctly:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 115
> {code}
> Example from Spark 2.4.0-rc1, which returns different output:
> {code}
> scala> val df = spark.read.parquet("hdfs:///data")
> df: org.apache.spark.sql.DataFrame = []
> scala> df.count
> res0: Long = 123
> scala> df.distinct.count
> res1: Long = 116
> scala> df.sort("col_0").distinct.count
> res2: Long = 123
> scala> df.withColumnRenamed("col_0", "newName").distinct.count
> res3: Long = 115
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25487) Refactor PrimitiveArrayBenchmark

2018-09-21 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-25487.
--
   Resolution: Fixed
 Assignee: Chenxiao Mao
Fix Version/s: 2.5.0

Issue resolved by pull request 22497
https://github.com/apache/spark/pull/22497

> Refactor PrimitiveArrayBenchmark
> 
>
> Key: SPARK-25487
> URL: https://issues.apache.org/jira/browse/SPARK-25487
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Chenxiao Mao
>Assignee: Chenxiao Mao
>Priority: Major
> Fix For: 2.5.0
>
>
> Refactor PrimitiveArrayBenchmark to use main method and print the output as a 
> separate file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25432) Consider if using standard getOrCreate from PySpark into JVM SparkSession would simplify code

2018-09-19 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621416#comment-16621416
 ] 

Kazuaki Ishizaki commented on SPARK-25432:
--

nit: description seems to be in {{environment}} now. 

> Consider if using standard getOrCreate from PySpark into JVM SparkSession 
> would simplify code
> -
>
> Key: SPARK-25432
> URL: https://issues.apache.org/jira/browse/SPARK-25432
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: As we saw in 
> [https://github.com/apache/spark/pull/22295/files] the logic can get a bit 
> out of sync. It _might_ make sense to try and simplify this so there's less 
> duplicated logic in Python & Scala around session set up.
>Reporter: holdenk
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance

2018-09-19 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25437:
-
Comment: was deleted

(was: Is such a feature for major release, not for maintenance release?)

> Using OpenHashMap replace HashMap improve Encoder Performance
> -
>
> Key: SPARK-25437
> URL: https://issues.apache.org/jira/browse/SPARK-25437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25437) Using OpenHashMap replace HashMap improve Encoder Performance

2018-09-16 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617102#comment-16617102
 ] 

Kazuaki Ishizaki commented on SPARK-25437:
--

Is such a feature for major release, not for maintenance release?

> Using OpenHashMap replace HashMap improve Encoder Performance
> -
>
> Key: SPARK-25437
> URL: https://issues.apache.org/jira/browse/SPARK-25437
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: wangjiaochun
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25444) Refactor GenArrayData.genCodeToCreateArrayData() method

2018-09-16 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25444:


 Summary: Refactor GenArrayData.genCodeToCreateArrayData() method
 Key: SPARK-25444
 URL: https://issues.apache.org/jira/browse/SPARK-25444
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.5.0
Reporter: Kazuaki Ishizaki


{{GenArrayData.genCodeToCreateArrayData()}} generated Java code to create a 
temporary Java array to create  {{ArrayData}}. It can be eliminated by using 
{{ArrayData.createArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2018-09-12 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611717#comment-16611717
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

In {{branch-2.4}}, we still see the performance degradation compared to w/o 
codegen
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11 on Linux 
4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
SPARK-20184: Best/Avg Time(ms)Rate(M/s)   Per 
Row(ns)   Relative

codegen = T   2915 / 3204  0.0  
2915001883.0   1.0X
codegen = F   1178 / 1368  0.0  
1178020462.0   2.5X
{code}
 

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>Priority: Major
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611633#comment-16611633
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

[~cloud_fan] This PR in the Jira entry proposes two fixes
 # Read data in a table cache directry from a columnar storage
 # Generate code to build a table cache

We already implemented 1. But, we have not implmented 2. yet. Let us address 2. 
in the next release.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20184) performance regression for complex/long sql when enable whole stage codegen

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611502#comment-16611502
 ] 

Kazuaki Ishizaki commented on SPARK-20184:
--

Although I created another JIRA 
https://issues.apache.org/jira/browse/SPARK-20479, there is no PR. Let me check 
the performance in 2.4 branch.

> performance regression for complex/long sql when enable whole stage codegen
> ---
>
> Key: SPARK-20184
> URL: https://issues.apache.org/jira/browse/SPARK-20184
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.1.0
>Reporter: Fei Wang
>Priority: Major
>
> The performance of following SQL get much worse in spark 2.x  in contrast 
> with codegen off.
> SELECT
>sum(COUNTER_57) 
> ,sum(COUNTER_71) 
> ,sum(COUNTER_3)  
> ,sum(COUNTER_70) 
> ,sum(COUNTER_66) 
> ,sum(COUNTER_75) 
> ,sum(COUNTER_69) 
> ,sum(COUNTER_55) 
> ,sum(COUNTER_63) 
> ,sum(COUNTER_68) 
> ,sum(COUNTER_56) 
> ,sum(COUNTER_37) 
> ,sum(COUNTER_51) 
> ,sum(COUNTER_42) 
> ,sum(COUNTER_43) 
> ,sum(COUNTER_1)  
> ,sum(COUNTER_76) 
> ,sum(COUNTER_54) 
> ,sum(COUNTER_44) 
> ,sum(COUNTER_46) 
> ,DIM_1 
> ,DIM_2 
>   ,DIM_3
> FROM aggtable group by DIM_1, DIM_2, DIM_3 limit 100;
> Num of rows of aggtable is about 3500.
> whole stage codegen on(spark.sql.codegen.wholeStage = true):40s
> whole stage codegen  off(spark.sql.codegen.wholeStage = false):6s
> After some analysis i think this is related to the huge java method(a java 
> method of thousand lines) which generated by codegen.
> And If i config -XX:-DontCompileHugeMethods the performance get much 
> better(about 7s).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16196) Optimize in-memory scan performance using ColumnarBatches

2018-09-11 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16611494#comment-16611494
 ] 

Kazuaki Ishizaki commented on SPARK-16196:
--

I see. I will check this.

> Optimize in-memory scan performance using ColumnarBatches
> -
>
> Key: SPARK-16196
> URL: https://issues.apache.org/jira/browse/SPARK-16196
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
>
> A simple benchmark such as the following reveals inefficiencies in the 
> existing in-memory scan implementation:
> {code}
> spark.range(N)
>   .selectExpr("id", "floor(rand() * 1) as k")
>   .createOrReplaceTempView("test")
> val ds = spark.sql("select count(k), count(id) from test").cache()
> ds.collect()
> ds.collect()
> {code}
> There are many reasons why caching is slow. The biggest is that compression 
> takes a long time. The second is that there are a lot of virtual function 
> calls in this hot code path since the rows are processed using iterators. 
> Further, the rows are converted to and from ByteBuffers, which are slow to 
> read in general.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25388) checkEvaluation may miss incorrect nullable of DataType in the result

2018-09-09 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25388:


 Summary: checkEvaluation may miss incorrect nullable of DataType 
in the result
 Key: SPARK-25388
 URL: https://issues.apache.org/jira/browse/SPARK-25388
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0
Reporter: Kazuaki Ishizaki


Current {{checkEvalution}} may miss incorrect nullable of {{DataType}} in 
{{checkEvaluationWithUnsafeProjection}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-05 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16604120#comment-16604120
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

When I have been investigating this issue, I realized that # of Javabyte code 
size in a method can change performance. I guess that this issue is related to 
method inlining. However, I have not found the root cause yet.

[~mgaido] Would it be possible to submit a PR to fix this issue if possible?

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method

2018-09-04 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25338:


 Summary: Several tests miss calling super.afterAll() in their 
afterAll() method
 Key: SPARK-25338
 URL: https://issues.apache.org/jira/browse/SPARK-25338
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


The following tests under {{external}} may not call {{super.afterAll()}} in 
their {{afterAll()}} method.

{code}
external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala
external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark JIRA tags clarification and management

2018-09-04 Thread Kazuaki Ishizaki
Of course, we would like to eliminate all of the following tags

"flanky" or "flankytest"

Kazuaki Ishizaki



From:   Hyukjin Kwon 
To: dev 
Cc: Xiao Li , Wenchen Fan 
Date:   2018/09/04 14:20
Subject:Re: Spark JIRA tags clarification and management



Thanks, Reynold.

+Adding Xiao and Wenchen who I saw often used tags.

Would you have some tags you think we should document more?

2018년 9월 4일 (화) 오전 9:27, Reynold Xin 님이 작성:
The most common ones we do are:

releasenotes

correctness



On Mon, Sep 3, 2018 at 6:23 PM Hyukjin Kwon  wrote:
Thanks, Felix and Reynold. Would you guys mind if I ask this to anyone who 
use the tags frequently? Frankly, I don't use the tags often ..

2018년 9월 4일 (화) 오전 2:04, Felix Cheung 님
이 작성:
+1 good idea.
There are a few for organizing but some also are critical to the release 
process, like rel note. Would be good to clarify.


From: Reynold Xin 
Sent: Sunday, September 2, 2018 11:50 PM
To: Hyukjin Kwon
Cc: dev
Subject: Re: Spark JIRA tags clarification and management 
 
It would be great to document the common ones.

On Sun, Sep 2, 2018 at 11:49 PM Hyukjin Kwon  wrote:
Hi all, 

I lately noticed tags are often used to classify JIRAs. I was thinking we 
better explicitly document what tags are used and explain which tag means 
what. For instance, we documented "Contributing to JIRA Maintenance" at 
https://spark.apache.org/contributing.html before (thanks, Sean Owen) - 
this helps me a lot to managing JIRAs, and they are good standards for, at 
least, me to take an action.

It doesn't necessarily mean we should clarify everything but it might be 
good to document tags used often.

We can leave this for committer's scope as well, if that's preferred - I 
don't have a strong opinion on this. My point is, can we clarify this in 
the contributing guide so that we can reduce the maintenance cost?





[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602567#comment-16602567
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

I confirmed this performance difference even after adding warmup. Let me 
investigate furthermore.

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602506#comment-16602506
 ] 

Kazuaki Ishizaki commented on SPARK-25317:
--

Let me run this on 2.3 and master.
One question. This benchmark does not have an warm up loop. In other words, 
this benchmark may include execution time on an interpreter, too. Is this 
behavior intentional?

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25317) MemoryBlock performance regression

2018-09-03 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25317:
-
Description: 
eThere is a performance regression when calculating hash code for UTF8String:
{code:java}
  test("hashing") {
import org.apache.spark.unsafe.hash.Murmur3_x86_32
import org.apache.spark.unsafe.types.UTF8String
val hasher = new Murmur3_x86_32(0)
val str = UTF8String.fromString("b" * 10001)
val numIter = 10
val start = System.nanoTime
for (i <- 0 until numIter) {
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
}
val duration = (System.nanoTime() - start) / 1000 / numIter
println(s"duration $duration us")
  }
{code}
To run this test in 2.3, we need to add
{code:java}
public static int hashUTF8String(UTF8String str, int seed) {
return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
str.numBytes(), seed);
  }
{code}
to `Murmur3_x86_32`

In my laptop, the result for master vs 2.3 is: 120 us vs 40 us

  was:
There is a performance regression when calculating hash code for UTF8String:

{code}
  test("hashing") {
import org.apache.spark.unsafe.hash.Murmur3_x86_32
import org.apache.spark.unsafe.types.UTF8String
val hasher = new Murmur3_x86_32(0)
val str = UTF8String.fromString("b" * 10001)
val numIter = 10
val start = System.nanoTime
for (i <- 0 until numIter) {
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
  Murmur3_x86_32.hashUTF8String(str, 0)
}
val duration = (System.nanoTime() - start) / 1000 / numIter
println(s"duration $duration us")
  }
{code}

To run this test in 2.3, we need to add
{code}
public static int hashUTF8String(UTF8String str, int seed) {
return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
str.numBytes(), seed);
  }
{code}
to `Murmur3_x86_32`

In my laptop, the result for master vs 2.3 is: 120 us vs 40 us


> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performan

[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException

2018-09-02 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25310:
-
Description: 
Invoking {{ArraysOverlap}} function with non-nullable array type throws the 
following error in the code generation phase.

{code:java}
Code generation of arrays_overlap([1,2,3], [4,5,3]) failed:
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, 
Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an 
rvalue
java.util.concurrent.ExecutionException: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 56, 
Column 11: failed to compile: org.codehaus.commons.compiler.CompileException: 
File 'generated.java', Line 56, Column 11: Expression "isNull_0" is not an 
rvalue
at 
com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
at 
com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
at 
com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
at 
com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
at 
com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
at 
com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
at 
com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1305)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:48)
at 
org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:32)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1260)
{code}

> ArraysOverlap may throw a CompileException
> --
>
> Key: SPARK-25310
> URL: https://issues.apache.org/jira/browse/SPARK-25310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> Invoking {{ArraysOverlap}} function with non-nullable array type throws the 
> following error in the code generation phase.
> {code:java}
> Code generation of arrays_overlap([1,2,3], [4,5,3]) failed:
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: Expression "isNull_0" is not an rvalue
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 56, Column 11: Expression "isNull_0" is not an rvalue
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:293)
>   at 
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>   at 
> com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
>   at 
> com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2410)
>   at 
> com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2380)
>   at 
> com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
>   at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spar

[jira] [Updated] (SPARK-25310) ArraysOverlap may throw a CompileException

2018-09-02 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25310:
-
Summary: ArraysOverlap may throw a CompileException  (was: ArraysOverlap 
throws an Exception)

> ArraysOverlap may throw a CompileException
> --
>
> Key: SPARK-25310
> URL: https://issues.apache.org/jira/browse/SPARK-25310
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25310) ArraysOverlap throws an Exception

2018-09-02 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25310:


 Summary: ArraysOverlap throws an Exception
 Key: SPARK-25310
 URL: https://issues.apache.org/jira/browse/SPARK-25310
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25178) Directly ship the StructType objects of the keySchema / valueSchema for xxxHashMapGenerator

2018-08-22 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25178:
-
Summary: Directly ship the StructType objects of the keySchema / 
valueSchema for xxxHashMapGenerator  (was: Use dummy name for 
xxxHashMapGenerator key/value schema field)

> Directly ship the StructType objects of the keySchema / valueSchema for 
> xxxHashMapGenerator
> ---
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25178) Use dummy name for xxxHashMapGenerator key/value schema field

2018-08-21 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16587897#comment-16587897
 ] 

Kazuaki Ishizaki commented on SPARK-25178:
--

[~rednaxelafx] Thank you for opening a JIRA entry :)
[~smilegator] I can take this.

> Use dummy name for xxxHashMapGenerator key/value schema field
> -
>
> Key: SPARK-25178
> URL: https://issues.apache.org/jira/browse/SPARK-25178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> Following SPARK-18952 and SPARK-22273, this ticket proposes to change the 
> generated field name of the keySchema / valueSchema to a dummy name instead 
> of using {{key.name}}.
> In previous discussion from SPARK-18952's PR [1], it was already suggested 
> that the field names were being used, so it's not worth capturing the strings 
> as reference objects here. Josh suggested merging the original fix as-is due 
> to backportability / pickability concerns. Now that we're coming up to a new 
> release, this can be revisited.
> [1]: https://github.com/apache/spark/pull/16361#issuecomment-270253719



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25036:
-
Description: 
When compiling with sbt, the following errors occur:

There are -two- three types:
1. {{ExprValue.isNull}} is compared with unexpected type.
2. {{match may not be exhaustive}} is detected at {{match}}
3. discarding unmoored doc comment

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.
{code:java}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
Boolean = (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn] (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
 match may not be exhaustive.
[error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
ArrayData()), (_, _)
[error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
 match may not be exhaustive.
[error] It would fail on the following inputs: NewFunctionSpec(_, None, 
Some(_)), NewFunctionSpec(_, Some(_), None)
[error] [warn] newFunction match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely always compare unequal
[error] [warn] if (eval.isNull != "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]  if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn] if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
 match may not be exhaustive.
[error] It would fail on the following input: Schema((x: 
org.apache.spark.sql.types.DataType forSome x not in 
org.apache.spark.sql.types.StructType), _)
[error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
[error] [warn] 
{code}


{code:java}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:410:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala:441:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
...
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala:440:
 discarding unmoored doc comment
[error] [warn] /**
[error] [warn] 
{code}

  was:
When compiling with sbt, the following errors occur:

There are two types:
1. {{ExprValue.isNull}} is compared with unexpected type.
1. {{match may not be exhaustive}} is detected at {{match}}

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.

{code}

[jira] [Comment Edited] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575137#comment-16575137
 ] 

Kazuaki Ishizaki edited comment on SPARK-25036 at 8/9/18 5:05 PM:
--

Another type of compilation error is found. Added the log to the description


was (Author: kiszk):
Another type of compilation error is found

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>    Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   

[jira] [Reopened] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reopened SPARK-25036:
--

Another type of compilation error is found

> Scala 2.12 issues: Compilation error with sbt
> -
>
> Key: SPARK-25036
> URL: https://issues.apache.org/jira/browse/SPARK-25036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Kazuaki Ishizaki
>    Assignee: Kazuaki Ishizaki
>Priority: Major
> Fix For: 2.4.0
>
>
> When compiling with sbt, the following errors occur:
> There are two types:
> 1. {{ExprValue.isNull}} is compared with unexpected type.
> 1. {{match may not be exhaustive}} is detected at {{match}}
> The first one is more serious since it may also generate incorrect code in 
> Spark 2.3.
> {code}
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
> Boolean = (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (NumericValueInterval(_, _), 
> _), (_, NumericValueInterval(_, _)), (_, _)
> [error] [warn] (r1, r2) match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
> ArrayData()), (_, _)
> [error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
>  match may not be exhaustive.
> [error] It would fail on the following inputs: NewFunctionSpec(_, None, 
> Some(_)), NewFunctionSpec(_, Some(_), None)
> [error] [warn] newFunction match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely always compare unequal
> [error] [warn] if (eval.isNull != "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]  if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn] if (eval.isNull == "true") {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
>  match may not be exhaustive.
> [error] It would fail on the following input: Schema((x: 
> org.apache.spark.sql.types.DataType forSome x not in 
> org.apache.spark.sql.types.StructType), _)
> [error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
> match {
> [error] [warn] 
> [error] [warn] 
> /home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
>  org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
> unrelated: they will most likely never compare equal
> [error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
> [error] [warn] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25059) Exception while executing an action on DataFrame that read Json

2018-08-09 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575129#comment-16575129
 ] 

Kazuaki Ishizaki commented on SPARK-25059:
--

Thank you for reporting the issue. Could you please try this using Spark 2.3?
This is because the community extensively investigated and fixed these issues 
in Spark 2.3

> Exception while executing an action on DataFrame that read Json
> ---
>
> Key: SPARK-25059
> URL: https://issues.apache.org/jira/browse/SPARK-25059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.2.0
> Environment: AWS EMR 5.8.0 
> Spark 2.2.0 
>  
>Reporter: Kunal Goswami
>Priority: Major
>  Labels: Spark-SQL
>
> When I try to read ~9600 Json files using
> {noformat}
> val test = spark.read.option("header", true).option("inferSchema", 
> true).json(paths: _*) {noformat}
>  
> Any action on the above created data frame results in: 
> {noformat}
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "apply2_1$(Lorg/apache/spark/sql/catalyst/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V"
>  of class "org.apache.spark.sql.catalyst.expressions.Generat[73/1850]
> pecificUnsafeProjection" grows beyond 64 KB
>   at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:949)
>   at org.codehaus.janino.CodeContext.write(CodeContext.java:839)
>   at org.codehaus.janino.UnitCompiler.writeOpcode(UnitCompiler.java:11081)
>   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4546)
>   at org.codehaus.janino.UnitCompiler.access$7500(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3774)
>   at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:3762)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3762)
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4933)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3180)
>   at org.codehaus.janino.UnitCompiler.access$5000(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3151)
>   at 
> org.codehaus.janino.UnitCompiler$9.visitMethodInvocation(UnitCompiler.java:3139)
>   at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4328)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112)
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2220)
>   at org.codehaus.janino.UnitCompiler.access$1800(UnitCompiler.java:206)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1378)
>   at 
> org.codehaus.janino.UnitCompiler$6.visitIfStatement(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$IfStatement.accept(Java.java:2621)
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370)
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450)
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1436)
>   at org.codehaus.janino.UnitCompiler.access$1600(UnitCompiler.java:206)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1376)
>   at org.codehaus.janino.UnitCompiler$6.visitBlock(UnitCompiler.java:1370)
>   at org.codehaus.janino.Java$Block.accept(Java.java:2471)
>   at org.codehaus.janino.

[jira] [Updated] (SPARK-25041) genjavadoc-plugin_0.10 is not found with sbt in scala-2.12

2018-08-07 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-25041:
-
Summary: genjavadoc-plugin_0.10 is not found with sbt in scala-2.12  (was: 
genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12)

> genjavadoc-plugin_0.10 is not found with sbt in scala-2.12
> --
>
> Key: SPARK-25041
> URL: https://issues.apache.org/jira/browse/SPARK-25041
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> When the master is build with sbt in scala-2.12, the following error occurs:
> {code}
> [warn]module not found: 
> com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10
> [warn]  public: tried
> [warn]   
> https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
> [warn]  Maven2 Local: tried
> [warn]   
> file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
> [warn]  local: tried
> [warn]   
> /gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml
> [info] Resolving jline#jline;2.14.3 ...
> [warn]::
> [warn]::  UNRESOLVED DEPENDENCIES ::
> [warn]::
> [warn]:: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not 
> found
> [warn]::
> [warn] 
> [warn]Note: Unresolved dependencies path:
> [warn]com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 
> (/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118)
> [warn]  +- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT
> sbt.ResolveException: unresolved dependency: 
> com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found
>   at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320)
>   at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
>   at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
>   at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
>   at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
>   at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
>   at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
>   at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
>   at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
>   at 
> xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
>   at 
> xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
>   at xsbt.boot.Using$.withResource(Using.scala:10)
>   at xsbt.boot.Using$.apply(Using.scala:9)
>   at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
>   at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
>   at xsbt.boot.Locks$.apply0(Locks.scala:31)
>   at xsbt.boot.Locks$.apply(Locks.scala:28)
>   at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
>   at sbt.IvySbt.withIvy(Ivy.scala:128)
>   at sbt.IvySbt.withIvy(Ivy.scala:125)
>   at sbt.IvySbt$Module.withModule(Ivy.scala:156)
>   at sbt.IvyActions$.updateEither(IvyActions.scala:168)
>   at 
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555)
>   at 
> sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551)
>   at 
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586)
>   at 
> sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584)
>   at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
>   at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589)
>   at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583)
>   at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
>   at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606)
>   at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533)
>   at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485)
>   at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
>   at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
>   at sbt.std.Transform$$anon$4.work(System.scala:63)
>   at 
> sbt.Execute$$anonfun

[jira] [Created] (SPARK-25041) genjavadoc-plugin_2.12.6 is not found with sbt in scala-2.12

2018-08-07 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25041:


 Summary: genjavadoc-plugin_2.12.6 is not found with sbt in 
scala-2.12
 Key: SPARK-25041
 URL: https://issues.apache.org/jira/browse/SPARK-25041
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


When the master is build with sbt in scala-2.12, the following error occurs:

{code}
[warn]  module not found: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10
[warn]  public: tried
[warn]   
https://repo1.maven.org/maven2/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
[warn]  Maven2 Local: tried
[warn]   
file:/gsa/jpngsa/home/i/s/ishizaki/.m2/repository/com/typesafe/genjavadoc/genjavadoc-plugin_2.12.6/0.10/genjavadoc-plugin_2.12.6-0.10.pom
[warn]  local: tried
[warn]   
/gsa/jpngsa/home/i/s/ishizaki/.ivy2/local/com.typesafe.genjavadoc/genjavadoc-plugin_2.12.6/0.10/ivys/ivy.xml
[info] Resolving jline#jline;2.14.3 ...
[warn]  ::
[warn]  ::  UNRESOLVED DEPENDENCIES ::
[warn]  ::
[warn]  :: com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found
[warn]  ::
[warn] 
[warn]  Note: Unresolved dependencies path:
[warn]  com.typesafe.genjavadoc:genjavadoc-plugin_2.12.6:0.10 
(/home/ishizaki/Spark/PR/scala212/spark/project/SparkBuild.scala#L118)
[warn]+- org.apache.spark:spark-tags_2.12:2.4.0-SNAPSHOT
sbt.ResolveException: unresolved dependency: 
com.typesafe.genjavadoc#genjavadoc-plugin_2.12.6;0.10: not found
at sbt.IvyActions$.sbt$IvyActions$$resolve(IvyActions.scala:320)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:191)
at sbt.IvyActions$$anonfun$updateEither$1.apply(IvyActions.scala:168)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$Module$$anonfun$withModule$1.apply(Ivy.scala:156)
at sbt.IvySbt$$anonfun$withIvy$1.apply(Ivy.scala:133)
at sbt.IvySbt.sbt$IvySbt$$action$1(Ivy.scala:57)
at sbt.IvySbt$$anon$4.call(Ivy.scala:65)
at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:93)
at 
xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:78)
at 
xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:97)
at xsbt.boot.Using$.withResource(Using.scala:10)
at xsbt.boot.Using$.apply(Using.scala:9)
at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:58)
at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:48)
at xsbt.boot.Locks$.apply0(Locks.scala:31)
at xsbt.boot.Locks$.apply(Locks.scala:28)
at sbt.IvySbt.withDefaultLogger(Ivy.scala:65)
at sbt.IvySbt.withIvy(Ivy.scala:128)
at sbt.IvySbt.withIvy(Ivy.scala:125)
at sbt.IvySbt$Module.withModule(Ivy.scala:156)
at sbt.IvyActions$.updateEither(IvyActions.scala:168)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1555)
at 
sbt.Classpaths$$anonfun$sbt$Classpaths$$work$1$1.apply(Defaults.scala:1551)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1586)
at 
sbt.Classpaths$$anonfun$doWork$1$1$$anonfun$122.apply(Defaults.scala:1584)
at sbt.Tracked$$anonfun$lastOutput$1.apply(Tracked.scala:37)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1589)
at sbt.Classpaths$$anonfun$doWork$1$1.apply(Defaults.scala:1583)
at sbt.Tracked$$anonfun$inputChanged$1.apply(Tracked.scala:60)
at sbt.Classpaths$.cachedUpdate(Defaults.scala:1606)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1533)
at sbt.Classpaths$$anonfun$updateTask$1.apply(Defaults.scala:1485)
at scala.Function1$$anonfun$compose$1.apply(Function1.scala:47)
at sbt.$tilde$greater$$anonfun$$u2219$1.apply(TypeFunctions.scala:40)
at sbt.std.Transform$$anon$4.work(System.scala:63)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at 
sbt.Execute$$anonfun$submit$1$$anonfun$apply$1.apply(Execute.scala:228)
at sbt.ErrorHandling$.wideConvert(ErrorHandling.scala:17)
at sbt.Execute.work(Execute.scala:237)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at sbt.Execute$$anonfun$submit$1.apply(Execute.scala:228)
at 
sbt.ConcurrentRestrictions$$anon$4$$anonfun$1.apply(ConcurrentRestrictions.scala:159)
at sbt.CompletionService$$anon$2.call(CompletionService.scala:28)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511

[jira] [Created] (SPARK-25036) Scala 2.12 issues: Compilation error with sbt

2018-08-06 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25036:


 Summary: Scala 2.12 issues: Compilation error with sbt
 Key: SPARK-25036
 URL: https://issues.apache.org/jira/browse/SPARK-25036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.4.0
Reporter: Kazuaki Ishizaki


When compiling with sbt, the following errors occur:

There are two types:
1. {{ExprValue.isNull}} is compared with unexpected type.
1. {{match may not be exhaustive}} is detected at {{match}}

The first one is more serious since it may also generate incorrect code in 
Spark 2.3.

{code}
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:63:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn]   def isIntersected(r1: ValueInterval, r2: ValueInterval): 
Boolean = (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ValueInterval.scala:79:
 match may not be exhaustive.
[error] It would fail on the following inputs: (NumericValueInterval(_, _), _), 
(_, NumericValueInterval(_, _)), (_, _)
[error] [warn] (r1, r2) match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala:67:
 match may not be exhaustive.
[error] It would fail on the following inputs: (ArrayType(_, _), _), (_, 
ArrayData()), (_, _)
[error] [warn] (endpointsExpression.dataType, endpointsExpression.eval()) 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala:470:
 match may not be exhaustive.
[error] It would fail on the following inputs: NewFunctionSpec(_, None, 
Some(_)), NewFunctionSpec(_, Some(_), None)
[error] [warn] newFunction match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:94:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely always compare unequal
[error] [warn] if (eval.isNull != "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:126:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]  if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:133:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn] if (eval.isNull == "true") {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala:709:
 match may not be exhaustive.
[error] It would fail on the following input: Schema((x: 
org.apache.spark.sql.types.DataType forSome x not in 
org.apache.spark.sql.types.StructType), _)
[error] [warn]   def attributesFor[T: TypeTag]: Seq[Attribute] = schemaFor[T] 
match {
[error] [warn] 
[error] [warn] 
/home/ishizaki/Spark/PR/scala212/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala:90:
 org.apache.spark.sql.catalyst.expressions.codegen.ExprValue and String are 
unrelated: they will most likely never compare equal
[error] [warn]   if (inputs.map(_.isNull).forall(_ == "false")) {
[error] [warn] 
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-06 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16570440#comment-16570440
 ] 

Kazuaki Ishizaki commented on SPARK-25029:
--

[~srowen][~skonto] Thank you for your investigations while I am creating 
scala-2.12 environment ( I still get compilation errors with scala-2.12 using 
sbt.)

I got the situation... It is related to {{default}} method. We may have to 
update a method lookup algorithm to consider {{default}} in janino.



> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:4070)
> at org.codehaus

[jira] [Commented] (SPARK-25029) Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods ..." errors

2018-08-05 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16569621#comment-16569621
 ] 

Kazuaki Ishizaki commented on SPARK-25029:
--

[~srowen] I see. The following parts have the method. I will try to see it. My 
first feeling is that the problem may be in the scala collection library or 
catalyst Java code generator.

{code}
...
/* 146 */ final int length_1 = MapObjects_loopValue140.size();
...
/* 315 */ final int length_0 = MapObjects_loopValue140.size();
...
{code}


> Scala 2.12 issues: TaskNotSerializable and Janino "Two non-abstract methods 
> ..." errors
> ---
>
> Key: SPARK-25029
> URL: https://issues.apache.org/jira/browse/SPARK-25029
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> We actually still have some test failures in the Scala 2.12 build. There seem 
> to be two types. First are that some tests fail with "TaskNotSerializable" 
> because some code construct now captures a reference to scalatest's 
> AssertionHelper. Example:
> {code:java}
> - LegacyAccumulatorWrapper with AccumulatorParam that has no equals/hashCode 
> *** FAILED *** java.io.NotSerializableException: 
> org.scalatest.Assertions$AssertionsHelper Serialization stack: - object not 
> serializable (class: org.scalatest.Assertions$AssertionsHelper, value: 
> org.scalatest.Assertions$AssertionsHelper@3bc5fc8f){code}
> These seem generally easy to fix by tweaking the test code. It's not clear if 
> something about closure cleaning in 2.12 could be improved to detect this 
> situation automatically; given that yet only a handful of tests fail for this 
> reason, it's unlikely to be a systemic problem.
>  
> The other error is curioser. Janino fails to compile generate code in many 
> cases with errors like:
> {code:java}
> - encode/decode for seq of string: List(abc, xyz) *** FAILED ***
> java.lang.RuntimeException: Error while encoding: 
> org.codehaus.janino.InternalCompilerException: failed to compile: 
> org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
> Two non-abstract methods "public int scala.collection.TraversableOnce.size()" 
> have the same parameter types, declaring type and return type{code}
>  
> I include the full generated code that failed in one case below. There is no 
> {{size()}} in the generated code. It's got to be down to some difference in 
> Scala 2.12, potentially even a Janino problem.
>  
> {code:java}
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Two non-abstract methods "public int 
> scala.collection.TraversableOnce.size()" have the same parameter types, 
> declaring type and return type
> at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
> at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
> at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
> at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
> at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
> at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1342)
> ... 30 more
> Caused by: org.codehaus.janino.InternalCompilerException: Two non-abstract 
> methods "public int scala.collection.TraversableOnce.size()" have the same 
> parameter types, declaring type and return type
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:9112)
> at 
> org.codehaus.janino.UnitCompiler.findMostSpecificIInvocable(UnitCompiler.java:)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8770)
> at org.codehaus.janino.UnitCompiler.findIMethod(UnitCompiler.java:8672)
> at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4737)
> at org.codehaus.janino.UnitCompiler.access$8300(UnitCompiler.java:212)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4097)
> at 
> org.codehaus.janino.UnitCompiler$12.visitMethodInvocation(UnitCompiler.java:4070)
> at org.codehaus.janino.Java$MethodInvocation.accept(Java.java:4902)
> at org.codehaus.janino.UnitCompiler.compi

[jira] [Created] (SPARK-24962) refactor CodeGenerator.createUnsafeArray

2018-07-29 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24962:


 Summary: refactor CodeGenerator.createUnsafeArray
 Key: SPARK-24962
 URL: https://issues.apache.org/jira/browse/SPARK-24962
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


{{CodeGenerator.createUnsafeArray()}} generates code for allocating 
{{UnsafeArrayData}}. This method can support to generate code for allocating 
{{UnsafeArrayData}} or {{GenericArrayData}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16560600#comment-16560600
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

[~ericfchang] Thank you very much for your suggestion. As the first step, I 
created [a PR|https://github.com/apache/spark/pull/21905] to upgrade maven.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24956) Upgrade maven from 3.3.9 to 3.5.4

2018-07-27 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24956:


 Summary: Upgrade maven from 3.3.9 to 3.5.4
 Key: SPARK-24956
 URL: https://issues.apache.org/jira/browse/SPARK-24956
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


Maven 3.3.9 looks pretty old. It would be good to upgrade this to the latest.

As suggest in SPARK-24895, the current maven will see a problem with some 
plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559987#comment-16559987
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

I see. Thank you very much. At first, I will try to make a PR to upgrade a 
maven.

BTW, I have no idea to make sure maven central repo works well for now.

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24895) Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559974#comment-16559974
 ] 

Kazuaki Ishizaki commented on SPARK-24895:
--

[~yhuai] Thank you.

BTW, how can I re-enable spotbugs without this problem? Do you have any 
suggestion? cc: [~hyukjin.kwon]

> Spark 2.4.0 Snapshot artifacts has broken metadata due to mismatched filenames
> --
>
> Key: SPARK-24895
> URL: https://issues.apache.org/jira/browse/SPARK-24895
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Eric Chang
>Assignee: Eric Chang
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.4.0 has Maven build errors because artifacts uploaded to apache maven 
> repo has mismatched filenames:
> {noformat}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce 
> (enforce-banned-dependencies) on project spark_2.4: Execution 
> enforce-banned-dependencies of goal 
> org.apache.maven.plugins:maven-enforcer-plugin:1.4.1:enforce failed: 
> org.apache.maven.shared.dependency.graph.DependencyGraphBuilderException: 
> Could not resolve following dependencies: 
> [org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT (compile), 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT (compile)]: Could not 
> resolve dependencies for project com.databricks:spark_2.4:pom:1: The 
> following artifacts could not be resolved: 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-network-shuffle_2.11:jar:2.4.0-SNAPSHOT, 
> org.apache.spark:spark-sketch_2.11:jar:2.4.0-SNAPSHOT: Could not find 
> artifact 
> org.apache.spark:spark-mllib-local_2.11:jar:2.4.0-20180723.232411-177 in 
> apache-snapshots ([https://repository.apache.org/snapshots/]) -> [Help 1]
> {noformat}
>  
> If you check the artifact metadata you will see the pom and jar files are 
> 2.4.0-20180723.232411-177 instead of 2.4.0-20180723.232410-177:
> {code:xml}
> 
>   org.apache.spark
>   spark-mllib-local_2.11
>   2.4.0-SNAPSHOT
>   
> 
>   20180723.232411
>   177
> 
> 20180723232411
> 
>   
> jar
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> pom
> 2.4.0-20180723.232411-177
> 20180723232411
>   
>   
> tests
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
>   
> test-sources
> jar
> 2.4.0-20180723.232410-177
> 20180723232411
>   
> 
>   
> 
> {code}
>  
> This behavior is very similar to this issue: 
> https://issues.apache.org/jira/browse/MDEPLOY-221
> Since 2.3.0 snapshots work with the same maven 3.3.9 version and maven deploy 
> 2.8.2 plugin, it is highly possible that we introduced a new plugin that 
> causes this. 
> The most recent addition is the spot-bugs plugin, which is known to have 
> incompatibilities with other plugins: 
> [https://github.com/spotbugs/spotbugs-maven-plugin/issues/21]
> We may want to try building without it to sanity check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24925) input bytesRead metrics fluctuate from time to time

2018-07-27 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16559972#comment-16559972
 ] 

Kazuaki Ishizaki commented on SPARK-24925:
--

Do we need a test case or which test case covers this PR?

> input bytesRead metrics fluctuate from time to time
> ---
>
> Key: SPARK-24925
> URL: https://issues.apache.org/jira/browse/SPARK-24925
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: yucai
>Priority: Major
> Attachments: bytesRead.gif
>
>
> input bytesRead metrics fluctuate from time to time, it is worse when 
> pushdown enabled.
> Query
> {code:java}
> CREATE TABLE dev AS
> SELECT
> ...
> FROM lstg_item cold, lstg_item_vrtn v
> WHERE cold.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> AND v.auct_end_dt = CAST(SUBSTR('2018-03-18 00:00:00',1,10) AS DATE)
> ...
> {code}
> Issue
> See attached bytesRead.gif, input bytesRead shows 48GB, 52GB, 51GB, 50GB, 
> 54GB, 53GB ... 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24841) Memory leak in converting spark dataframe to pandas dataframe

2018-07-22 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16552231#comment-16552231
 ] 

Kazuaki Ishizaki commented on SPARK-24841:
--

Thank you for reporting an issue with heap profiling. Would it be possible to 
post a standalone program that can reproduce this problem?

> Memory leak in converting spark dataframe to pandas dataframe
> -
>
> Key: SPARK-24841
> URL: https://issues.apache.org/jira/browse/SPARK-24841
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: Running PySpark in standalone mode
>Reporter: Piyush Seth
>Priority: Minor
>
> I am running a continuous running application using PySpark. In one of the 
> operations I have to convert PySpark data frame to Pandas data frame using 
> toPandas API  on pyspark driver. After running for a while I am getting 
> "java.lang.OutOfMemoryError: GC overhead limit exceeded" error.
> I tried running this in a loop and could see that the heap memory is 
> increasing continuously. When I ran jmap for the first time I had the 
> following top rows:
>  num #instances #bytes  class name
> --
>    1:  1757  411477568  [J
> {color:#FF}   *2:    124188  266323152  [C*{color}
>    3:    167219   46821320  org.apache.spark.status.TaskDataWrapper
>    4: 69683   27159536  [B
>    5:    359278    8622672  java.lang.Long
>    6:    221808    7097856  
> java.util.concurrent.ConcurrentHashMap$Node
>    7:    283771    6810504  scala.collection.immutable.$colon$colon
> After running several iterations I had the following
>  num #instances #bytes  class name
> --
> {color:#FF}   *1:    110760 3439887928  [C*{color}
>    2:   698  411429088  [J
>    3:    238096   6880  org.apache.spark.status.TaskDataWrapper
>    4: 68819   24050520  [B
>    5:    498308   11959392  java.lang.Long
>    6:    292741    9367712  
> java.util.concurrent.ConcurrentHashMap$Node
>    7:    282878    6789072  scala.collection.immutable.$colon$colon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24754) Minhash integer overflow

2018-07-07 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535657#comment-16535657
 ] 

Kazuaki Ishizaki commented on SPARK-24754:
--

In test cases, we would appreciate it if you will compare values with them by 
other implementations.

> Minhash integer overflow
> 
>
> Key: SPARK-24754
> URL: https://issues.apache.org/jira/browse/SPARK-24754
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Jiayuan Ma
>Priority: Minor
>
> Hash computation in MinHashLSHModel has integer overflow bug.
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [SPARK ML] Minhash integer overflow

2018-07-07 Thread Kazuaki Ishizaki
Of course, the hash value can just be negative. I thought that it would be 
after computation without overflow.

When I checked another implementation, it performs computations with int.
https://github.com/ALShum/MinHashLSH/blob/master/LSH.java#L89

By copy to Xjiayuan, did you compare the hash value generated by Spark 
with it generated by other implementations?

Regards,
Kazuaki Ishizaki



From:   Sean Owen 
To: jiayuanm 
Cc: dev@spark.apache.org
Date:   2018/07/07 15:46
Subject:Re: [SPARK ML] Minhash integer overflow



I think it probably still does its.job; the hash value can just be 
negative. It is likely to be very slightly biased though. Because the 
intent doesn't seem to be to allow the overflow it's worth changing to use 
longs for the calculation. 

On Fri, Jul 6, 2018, 8:36 PM jiayuanm  wrote:
Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I 
noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large 
sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be 
smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org





Re: [SPARK ML] Minhash integer overflow

2018-07-06 Thread Kazuaki Ishizaki
Thank for you reporting this issue. I think this is a bug regarding 
integer overflow. IMHO, it would be good to compute hashes with Long.

Would it be possible to create a JIRA entry?  Do you want to submit a pull 
request, too?

Regards,
Kazuaki Ishizaki



From:   jiayuanm 
To: dev@spark.apache.org
Date:   2018/07/07 10:36
Subject:[SPARK ML] Minhash integer overflow



Hi everyone,

I was playing around with LSH/Minhash module from spark ml module. I 
noticed
that hash computation is done with Int (see
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/MinHashLSH.scala#L69
).
Since "a" and "b" are from a uniform distribution of [1,
MinHashLSH.HASH_PRIME] and MinHashLSH.HASH_PRIME is close to Int.MaxValue,
it's likely for the multiplication to cause Int overflow with a large 
sparse
input vector.

I wonder if this is a bug or intended. If it's a bug, one way to fix it is
to compute hashes with Long and insert a couple of mod
MinHashLSH.HASH_PRIME. Because MinHashLSH.HASH_PRIME is chosen to be 
smaller
than sqrt(2^63 - 1), this won't overflow 64-bit integer. Another option is
to use BigInteger.

Let me know what you think.

Thanks,
Jiayuan





--
Sent from: 
http://apache-spark-developers-list.1001551.n3.nabble.com/


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org







[jira] [Commented] (SPARK-24579) SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks

2018-07-02 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16530640#comment-16530640
 ] 

Kazuaki Ishizaki commented on SPARK-24579:
--

I cannot see comments on the doc, too.

> SPIP: Standardize Optimized Data Exchange between Spark and DL/AI frameworks
> 
>
> Key: SPARK-24579
> URL: https://issues.apache.org/jira/browse/SPARK-24579
> Project: Spark
>  Issue Type: Epic
>  Components: ML, PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen
> Attachments: [SPARK-24579] SPIP_ Standardize Optimized Data Exchange 
> between Apache Spark and DL%2FAI Frameworks .pdf
>
>
> (see attached SPIP pdf for more details)
> At the crossroads of big data and AI, we see both the success of Apache Spark 
> as a unified
> analytics engine and the rise of AI frameworks like TensorFlow and Apache 
> MXNet (incubating).
> Both big data and AI are indispensable components to drive business 
> innovation and there have
> been multiple attempts from both communities to bring them together.
> We saw efforts from AI community to implement data solutions for AI 
> frameworks like tf.data and tf.Transform. However, with 50+ data sources and 
> built-in SQL, DataFrames, and Streaming features, Spark remains the community 
> choice for big data. This is why we saw many efforts to integrate DL/AI 
> frameworks with Spark to leverage its power, for example, TFRecords data 
> source for Spark, TensorFlowOnSpark, TensorFrames, etc. As part of Project 
> Hydrogen, this SPIP takes a different angle at Spark + AI unification.
> None of the integrations are possible without exchanging data between Spark 
> and external DL/AI frameworks. And the performance matters. However, there 
> doesn’t exist a standard way to exchange data and hence implementation and 
> performance optimization fall into pieces. For example, TensorFlowOnSpark 
> uses Hadoop InputFormat/OutputFormat for TensorFlow’s TFRecords to load and 
> save data and pass the RDD records to TensorFlow in Python. And TensorFrames 
> converts Spark DataFrames Rows to/from TensorFlow Tensors using TensorFlow’s 
> Java API. How can we reduce the complexity?
> The proposal here is to standardize the data exchange interface (or format) 
> between Spark and DL/AI frameworks and optimize data conversion from/to this 
> interface.  So DL/AI frameworks can leverage Spark to load data virtually 
> from anywhere without spending extra effort building complex data solutions, 
> like reading features from a production data warehouse or streaming model 
> inference. Spark users can use DL/AI frameworks without learning specific 
> data APIs implemented there. And developers from both sides can work on 
> performance optimizations independently given the interface itself doesn’t 
> introduce big overhead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
If it is difficult to create the small stand alone program, another 
approach seems to attach everything (i.e. configuration, data, program, 
console output, log, history server data, etc.)
As a log, the community would recommend the info log with 
"spark.sql.codegen.logging.maxLines=2147483647". The log has to include 
the all of the generated Java methods.

The community may take more time to address this problem than the case 
with the small program.

Best Regards,
Kazuaki Ishizaki



From:   Aakash Basu 
To:     Kazuaki Ishizaki 
Cc: vaquar khan , Eyal Zituny 
, user 
Date:   2018/06/21 01:29
Subject:Re: [Help] Codegen Stage grows beyond 64 KB



Hi Kazuaki,

It would be really difficult to produce a small S-A code to reproduce this 
problem because, I'm running through a big pipeline of feature engineering 
where I derive a lot of variables based on the present ones to kind of 
explode the size of the table by many folds. Then, when I do any kind of 
join, this error shoots up.

I tried with wholeStage.codegen=false, but that errors out the entire 
program rather than running it with a lesser optimized code.

Any suggestion on how I can proceed towards a JIRA entry for this?

Thanks,
Aakash.

On Wed, Jun 20, 2018 at 9:41 PM, Kazuaki Ishizaki  
wrote:
Spark 2.3 tried to split a large generated Java methods into small methods 
as possible. However, this report may remain places that generates a large 
method.

Would it be possible to create a JIRA entry with a small stand alone 
program that can reproduce this problem? It would be very helpful that the 
community will address this problem.

Best regards,
Kazuaki Ishizaki



From:vaquar khan 
To:Eyal Zituny 
Cc:Aakash Basu , user <
user@spark.apache.org>
Date:2018/06/18 01:57
Subject:Re: [Help] Codegen Stage grows beyond 64 KB




Totally agreed with Eyal .

The problem is that when Java programs generated using Catalyst from 
programs using DataFrame and Dataset are compiled into Java bytecode, the 
size of byte code of one method must not be 64 KB or more, This conflicts 
with the limitation of the Java class file, which is an exception that 
occurs. 

In order to avoid occurrence of an exception due to this restriction, 
within Spark, a solution is to split the methods that compile and make 
Java bytecode that is likely to be over 64 KB into multiple methods when 
Catalyst generates Java programs It has been done.

Use persist or any other logical separation in pipeline.

Regards,
Vaquar khan 

On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny  
wrote:
Hi Akash,
such errors might appear in large spark pipelines, the root cause is a 
64kb jvm limitation.
the reason that your job isn't failing at the end is due to spark fallback 
- if code gen is failing, spark compiler will try to create the flow 
without the code gen (less optimized)
if you do not want to see this error, you can either disable code gen 
using the flag:  spark.sql.codegen.wholeStage= "false"
or you can try to split your complex pipeline into several spark flows if 
possible

hope that helps

Eyal

On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu  
wrote:
Hi,

I already went through it, that's one use case. I've a complex and very 
big pipeline of multiple jobs under one spark session. Not getting, on how 
to solve this, as it is happening over Logistic Regression and Random 
Forest models, which I'm just using from Spark ML package rather than 
doing anything by myself.

Thanks,
Aakash.

On Sun 17 Jun, 2018, 8:21 AM vaquar khan,  wrote:
Hi Akash,

Please check stackoverflow.

https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe


Regards,
Vaquar khan

On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu  
wrote:
Hi guys,

I'm getting an error when I'm feature engineering on 30+ columns to create 
about 200+ columns. It is not failing the job, but the ERROR shows. I want 
to know how can I avoid this.

Spark - 2.3.1
Python - 3.6

Cluster Config -
1 Master - 32 GB RAM, 16 Cores
4 Slaves - 16 GB RAM, 8 Cores


Input data - 8 partitions of parquet file with snappy compression.

My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077
--num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 
5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf 
spark.driver.maxResultSize=2G --conf 
"spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf 
spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py 
> /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt


Stack-Trace below -

ERROR CodeGenerator:91 - failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.Genera

Re: [Help] Codegen Stage grows beyond 64 KB

2018-06-20 Thread Kazuaki Ishizaki
Spark 2.3 tried to split a large generated Java methods into small methods 
as possible. However, this report may remain places that generates a large 
method.

Would it be possible to create a JIRA entry with a small stand alone 
program that can reproduce this problem? It would be very helpful that the 
community will address this problem.

Best regards,
Kazuaki Ishizaki



From:   vaquar khan 
To: Eyal Zituny 
Cc: Aakash Basu , user 

Date:   2018/06/18 01:57
Subject:Re: [Help] Codegen Stage grows beyond 64 KB



Totally agreed with Eyal .

The problem is that when Java programs generated using Catalyst from 
programs using DataFrame and Dataset are compiled into Java bytecode, the 
size of byte code of one method must not be 64 KB or more, This conflicts 
with the limitation of the Java class file, which is an exception that 
occurs. 

In order to avoid occurrence of an exception due to this restriction, 
within Spark, a solution is to split the methods that compile and make 
Java bytecode that is likely to be over 64 KB into multiple methods when 
Catalyst generates Java programs It has been done.

Use persist or any other logical separation in pipeline.

Regards,
Vaquar khan 

On Sun, Jun 17, 2018 at 5:25 AM, Eyal Zituny  
wrote:
Hi Akash,
such errors might appear in large spark pipelines, the root cause is a 
64kb jvm limitation.
the reason that your job isn't failing at the end is due to spark fallback 
- if code gen is failing, spark compiler will try to create the flow 
without the code gen (less optimized)
if you do not want to see this error, you can either disable code gen 
using the flag:  spark.sql.codegen.wholeStage= "false"
or you can try to split your complex pipeline into several spark flows if 
possible

hope that helps

Eyal

On Sun, Jun 17, 2018 at 8:16 AM, Aakash Basu  
wrote:
Hi,

I already went through it, that's one use case. I've a complex and very 
big pipeline of multiple jobs under one spark session. Not getting, on how 
to solve this, as it is happening over Logistic Regression and Random 
Forest models, which I'm just using from Spark ML package rather than 
doing anything by myself.

Thanks,
Aakash.

On Sun 17 Jun, 2018, 8:21 AM vaquar khan,  wrote:
Hi Akash,

Please check stackoverflow.

https://stackoverflow.com/questions/41098953/codegen-grows-beyond-64-kb-error-when-normalizing-large-pyspark-dataframe

Regards,
Vaquar khan

On Sat, Jun 16, 2018 at 3:27 PM, Aakash Basu  
wrote:
Hi guys,

I'm getting an error when I'm feature engineering on 30+ columns to create 
about 200+ columns. It is not failing the job, but the ERROR shows. I want 
to know how can I avoid this.

Spark - 2.3.1
Python - 3.6

Cluster Config -
1 Master - 32 GB RAM, 16 Cores
4 Slaves - 16 GB RAM, 8 Cores


Input data - 8 partitions of parquet file with snappy compression.

My Spark-Submit -> spark-submit --master spark://192.168.60.20:7077 
--num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 
5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf 
spark.driver.maxResultSize=2G --conf 
"spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf 
spark.scheduler.listenerbus.eventqueue.capacity=2 --conf 
spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py 
> /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt

Stack-Trace below -

ERROR CodeGenerator:91 - failed to compile: 
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426"
 
grows beyond 64 KB
org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": 
Code of method "processNext()V" of class 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426"
 
grows beyond 64 KB
at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:361)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:234)
at 
org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:446)
at 
org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:313)
at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:235)
at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:204)
at org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490)
at 
org.spark_project.guava.cache.LocalCache$LoadingValueRefer

[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-19 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16517097#comment-16517097
 ] 

Kazuaki Ishizaki commented on SPARK-24498:
--

[~maropu] thank you, let us use this as a start point.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24529) Add spotbugs into maven build process

2018-06-12 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16509569#comment-16509569
 ] 

Kazuaki Ishizaki commented on SPARK-24529:
--

I am working for this

> Add spotbugs into maven build process
> -
>
> Key: SPARK-24529
> URL: https://issues.apache.org/jira/browse/SPARK-24529
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> We will enable a Java bytecode check tool 
> [spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
> multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24529) Add spotbugs into maven build process

2018-06-12 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24529:


 Summary: Add spotbugs into maven build process
 Key: SPARK-24529
 URL: https://issues.apache.org/jira/browse/SPARK-24529
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


We will enable a Java bytecode check tool 
[spotbugs|https://spotbugs.github.io/] to avoid possible integer overflow at 
multiplication. Due to the tool limitation, some other checks will be enabled.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24498) Add JDK compiler for runtime codegen

2018-06-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16506239#comment-16506239
 ] 

Kazuaki Ishizaki commented on SPARK-24498:
--

Hi
[~smilegator] Definetely, I am interested in this task. I will investigate this 
issue.

> Add JDK compiler for runtime codegen
> 
>
> Key: SPARK-24498
> URL: https://issues.apache.org/jira/browse/SPARK-24498
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> In some cases, JDK compiler can generate smaller bytecode and take less time 
> in compilation compared to Janino. However, in some cases, Janino is better. 
> We should support both for our runtime codegen. Janino will be still our 
> default runtime codegen compiler. 
> See the related JIRAs in DRILL: 
> - https://issues.apache.org/jira/browse/DRILL-1155
> - https://issues.apache.org/jira/browse/DRILL-4778
> - https://issues.apache.org/jira/browse/DRILL-5696



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns

2018-06-07 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16504845#comment-16504845
 ] 

Kazuaki Ishizaki commented on SPARK-24486:
--

Thank you for reporting a problem.
Could you please let us know which value is shown for each of three results in 
`sum(...)`?

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Strange codegen error for SortMergeJoin in Spark 2.2.1

2018-06-07 Thread Kazuaki Ishizaki
Thank you for reporting a problem.
Would it be possible to create a JIRA entry with a small program that can 
reproduce this problem?

Best Regards,
Kazuaki Ishizaki



From:   Rico Bergmann 
To: "user@spark.apache.org" 
Date:   2018/06/05 19:58
Subject:Strange codegen error for SortMergeJoin in Spark 2.2.1



Hi!
I get a strange error when executing a complex SQL-query involving 4 
tables that are left-outer-joined:
Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 37, Column 18: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 37, Column 18: No applicable constructor/method found for actual 
parameters "int"; candidates are: 
"org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(org.apache.spark.memory.TaskMemoryManager,
 
org.apache.spark.storage.BlockManager, 
org.apache.spark.serializer.SerializerManager, 
org.apache.spark.TaskContext, int, long, int, int)", 
"org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(int, 
int)"

...

/* 037 */ smj_matches = new 
org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray(2147483647);


The same query works with Spark 2.2.0.
I checked the Spark source code and saw that in 
ExternalAppendOnlyUnsafeRowArray a second int was introduced into the 
constructor in 2.2.1
But looking at the codegeneration part of SortMergeJoinExec:
// A list to hold all matched rows from right side.
val matches = ctx.freshName("matches")
val clsName = classOf[ExternalAppendOnlyUnsafeRowArray].getName

val spillThreshold = getSpillThreshold
val inMemoryThreshold = getInMemoryThreshold

ctx.addMutableState(clsName, matches,
  s"$matches = new $clsName($inMemoryThreshold, $spillThreshold);")


it should get 2 parameters, not just one.

May be anyone has an idea?

Best,
Rico.




[jira] [Created] (SPARK-24452) long = int*int or long = int+int may cause overflow.

2018-06-01 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24452:


 Summary: long = int*int or long = int+int may cause overflow.
 Key: SPARK-24452
 URL: https://issues.apache.org/jira/browse/SPARK-24452
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


The following assignment cause overflow in right hand side. As a result, the 
result may be negative.
{code:java}
long = int*int
long = int+int{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24323) Java lint errors

2018-05-19 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24323:


 Summary: Java lint errors
 Key: SPARK-24323
 URL: https://issues.apache.org/jira/browse/SPARK-24323
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


The following error occurs when run lint-java
{code:java}
[ERROR] 
src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartition.java:[39] 
(sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] 
src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[26]
 (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] 
src/main/java/org/apache/spark/sql/sources/v2/reader/InputPartitionReader.java:[30]
 (sizes) LineLength: Line is longer than 100 characters (found 104).
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480482#comment-16480482
 ] 

Kazuaki Ishizaki commented on SPARK-24314:
--

I am working for this.

> interpreted element_at or GetMapValue does not work for complex types
> -
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reopened SPARK-24314:
--

> interpreted element_at or GetMapValue does not work for complex types
> -
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-24314:
-
Summary: interpreted element_at or GetMapValue does not work for complex 
types  (was: interpreted array_position does not work for complex types)

> interpreted element_at or GetMapValue does not work for complex types
> -
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24314) interpreted array_position does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-24314.
--
Resolution: Duplicate

> interpreted array_position does not work for complex types
> --
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24314) interpreted array_position does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-24314:


 Summary: interpreted array_position does not work for complex types
 Key: SPARK-24314
 URL: https://issues.apache.org/jira/browse/SPARK-24314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24273) Failure while using .checkpoint method

2018-05-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16475941#comment-16475941
 ] 

Kazuaki Ishizaki commented on SPARK-24273:
--

Thank you for reporting this issue?
Would it be possible to attach the standalone program that can reproduce this 
issue?

> Failure while using .checkpoint method
> --
>
> Key: SPARK-24273
> URL: https://issues.apache.org/jira/browse/SPARK-24273
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.0
>Reporter: Jami Malikzade
>Priority: Major
>
> We are getting following error:
> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 416, AWS 
> Service: Amazon S3, AWS Request ID: 
> tx14126-005ae9bfd9-9ed9ac2-default, AWS Error Code: 
> InvalidRange, AWS Error Message: null, S3 Extended Request ID: 
> 9ed9ac2-default-default"
> when we use checkpoint method as below.
> val streamBucketDF = streamPacketDeltaDF
>  .filter('timeDelta > maxGap && 'timeDelta <= 3)
>  .withColumn("bucket", when('timeDelta <= mediumGap, "medium")
>  .otherwise("large")
>  )
>  .checkpoint()
> Do you have idea how to prevent invalid range in header to be sent, or how it 
> can be workarounded or fixed?
> Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24220) java.lang.NullPointerException at org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)

2018-05-11 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16472195#comment-16472195
 ] 

Kazuaki Ishizaki commented on SPARK-24220:
--

Thank you for reporting an issue. Would it be possible to post standalone 
reproduable program? This program seems to connect to an external database or 
something thru {{DriverManager.getConnection(adminUrl)}}.

> java.lang.NullPointerException at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
> 
>
> Key: SPARK-24220
> URL: https://issues.apache.org/jira/browse/SPARK-24220
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.2.0
>Reporter: joy-m
>Priority: Major
>
> def getInputStream(rows:Iterator[Row]): PipedInputStream ={
>  printMem("before gen string")
>  val pipedOutputStream = new PipedOutputStream()
>  (new Thread() {
>  override def run(){
>  if(rows == null){
>  logError("rows is null==>")
>  }else{
>  println(s"record-start-${rows.length}")
>  try {
>  while (rows.hasNext) {
>  val row = rows.next()
>  println(row)
>  val str = row.mkString("\001") + "\r\n"
>  println(str)
>  pipedOutputStream.write(str.getBytes(StandardCharsets.UTF_8))
>  }
>  println("record-end-")
>  pipedOutputStream.close()
>  } catch {
>  case ex:Exception =>
>  ex.printStackTrace()
>  }
>  }
>  }
>  }).start()
>  println("pipedInPutStream--")
>  val pipedInPutStream = new PipedInputStream()
>  pipedInPutStream.connect(pipedOutputStream)
>  println("pipedInPutStream--- conn---")
>  printMem("after gen string")
>  pipedInPutStream
> }
> resDf.coalesce(15).foreachPartition(rows=>{
>  if(rows == null){
>  logError("rows is null=>")
>  }else{
>  val copyCmd = s"COPY ${tableName} FROM STDIN with DELIMITER as '\001' NULL 
> as 'null string'"
>  var con: Connection = null
>  try {
>  con = DriverManager.getConnection(adminUrl)
>  val copyManager = new CopyManager(con.asInstanceOf[BaseConnection])
>  val start = System.currentTimeMillis()
>  var count: Long = 0
>  var copyCount: Long = 0
>  println("before copyManager=>")
>  copyCount += copyManager.copyIn(copyCmd, getInputStream(rows))
>  println("after copyManager=>")
>  val finish = System.currentTimeMillis()
>  println("copyCount:" + copyCount + " count:" + count + " time(s):" + (finish 
> - start) / 1000)
>  con.close()
>  } catch {
>  case ex:Exception =>
>  ex.printStackTrace()
>  println(s"copyIn error!${ex.toString}")
>  } finally {
>  try {
>  if (con != null) {
>  con.close()
>  }
>  } catch {
>  case ex:SQLException =>
>  ex.printStackTrace()
>  println(s"copyIn error!${ex.toString}")
>  }
>  }
>  }
>  
> 18/05/09 13:31:30 ERROR util.SparkUncaughtExceptionHandler: Uncaught 
> exception in thread Thread[Thread-4,5,main]
> java.lang.NullPointerException
>  at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.(UnsafeExternalRowSorter.java:83)
>  at org.apache.spark.sql.execution.SortExec.createSorter(SortExec.scala:87)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.init(Unknown
>  Source)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:392)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8.apply(WholeStageCodegenExec.scala:389)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
>  at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
>

Re: SparkR test failures in PR builder

2018-05-02 Thread Kazuaki Ishizaki
I am not familiar with SparkR or CRAN. However, I remember that we had the 
similar situation.

Here is a great work at that time. When I have just visited this PR, I 
think that we have the similar situation (i.e. format error) again.
https://github.com/apache/spark/pull/20005

Any other comments are appreciated.

Regards,
Kazuaki Ishizaki



From:   Joseph Bradley <jos...@databricks.com>
To: dev <dev@spark.apache.org>
Cc: Hossein Falaki <hoss...@databricks.com>
Date:   2018/05/03 07:31
Subject:SparkR test failures in PR builder



Hi all,

Does anyone know why the PR builder keeps failing on SparkR's CRAN 
checks?  I've seen this in a lot of unrelated PRs.  E.g.: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90065/console

Hossein spotted this line:
```
* checking CRAN incoming feasibility ...Error in 
.check_package_CRAN_incoming(pkgdir) : 
  dims [product 24] do not match the length of object [0]
```
and suggested that it could be CRAN flakiness.  I'm not familiar with 
CRAN, but do others have thoughts about how to fix this?

Thanks!
Joseph

-- 
Joseph Bradley
Software Engineer - Machine Learning
Databricks, Inc.





[jira] [Commented] (SPARK-24119) Add interpreted execution to SortPrefix expression

2018-04-29 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16458297#comment-16458297
 ] 

Kazuaki Ishizaki commented on SPARK-24119:
--

It seems to make sense.

It would be good to set this JIRA sas a subtask of SPARK-23580

> Add interpreted execution to SortPrefix expression
> --
>
> Key: SPARK-24119
> URL: https://issues.apache.org/jira/browse/SPARK-24119
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> [~hvanhovell] [~kiszk]
> I noticed SortPrefix did not support interpreted execution when I was testing 
> the PR for SPARK-24043. Somehow it was not covered by the umbrella Jira for 
> adding interpreted execution (SPARK-23580)
> Since I had to implement interpreted execution for SortPrefix to complete 
> testing, I am creating this Jira. If there's no good reason why eval wasn't 
> implemented, I will make the PR in a few days.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-25 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452404#comment-16452404
 ] 

Kazuaki Ishizaki edited comment on SPARK-23933 at 4/25/18 6:48 PM:
---

Thank you for your comment.
The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due 
to a pair of key and map.
We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave 
as currently.

How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}? Or How about 
{{CreateMap(Seq(CreateArray(sSeq.map(Literal(\_))), 
CreateArray(iSeq.map(Literal(\_)}}?




was (Author: kiszk):
Thank you for your comment.
The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due 
to a pair of key and map.
We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave 
as currently.

How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}?



> High-order function: map(array, array) → map<K,V>
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-25 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16452404#comment-16452404
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

Thank you for your comment.
The current map can take the even number of arguments (e.g. 2, 4, 6, 8 ...) due 
to a pair of key and map.
We can determine {{map(1.0, '2', 3.0, '4') or map(1.0, '2')}} should be behave 
as currently.

How about {{map(ARRAY [1, 2], ARRAY ["a", "b"])}}?



> High-order function: map(array, array) → map<K,V>
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-24 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450262#comment-16450262
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

cc [~smilegator]

> High-order function: map(array, array) → map<K,V>
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10399) Off Heap Memory Access for non-JVM libraries (C++)

2018-04-20 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446166#comment-16446166
 ] 

Kazuaki Ishizaki commented on SPARK-10399:
--

https://issues.apache.org/jira/browse/SPARK-23879 is the following JIRA entry.

> Off Heap Memory Access for non-JVM libraries (C++)
> --
>
> Key: SPARK-10399
> URL: https://issues.apache.org/jira/browse/SPARK-10399
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Paul Weiss
>Priority: Major
>
> *Summary*
> Provide direct off-heap memory access to an external non-JVM program such as 
> a c++ library within the Spark running JVM/executor.  As Spark moves to 
> storing all data into off heap memory it makes sense to provide access points 
> to the memory for non-JVM programs.
> 
> *Assumptions*
> * Zero copies will be made during the call into non-JVM library
> * Access into non-JVM libraries will be accomplished via JNI
> * A generic JNI interface will be created so that developers will not need to 
> deal with the raw JNI call
> * C++ will be the initial target non-JVM use case
> * memory management will remain on the JVM/Spark side
> * the API from C++ will be similar to dataframes as much as feasible and NOT 
> require expert knowledge of JNI
> * Data organization and layout will support complex (multi-type, nested, 
> etc.) types
> 
> *Design*
> * Initially Spark JVM -> non-JVM will be supported 
> * Creating an embedded JVM with Spark running from a non-JVM program is 
> initially out of scope
> 
> *Technical*
> * GetDirectBufferAddress is the JNI call used to access byte buffer without 
> copy



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-18 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436976#comment-16436976
 ] 

Kazuaki Ishizaki edited comment on SPARK-23933 at 4/18/18 4:22 PM:
---

[~smilegator] [~ueshin] Could you favor us?
SparkSQL already uses syntax of {{map}} function for the different purpose.

Even if we limit two array in the argument list, we may have conflict between 
this new feature and creating a map with one entry having an array for key and 
value. Do you have any good idea?


{code}
@ExpressionDescription(
  usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the 
given key/value pairs.",
  examples = """
Examples:
  > SELECT _FUNC_(1.0, '2', 3.0, '4');
   {1.0:"2",3.0:"4"}
  """)
case class CreateMap(children: Seq[Expression]) extends Expression {
...
{code}


was (Author: kiszk):
[~smilegator] [~ueshin] Could you favor us?
SparkSQL already uses syntax of {{map}} function for the similar purpose.

Even if we limit two array in the argument list, we may have conflict between 
this new feature and creating a map with one entry having an array for key and 
value. Do you have any good idea?


{code}
@ExpressionDescription(
  usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the 
given key/value pairs.",
  examples = """
Examples:
  > SELECT _FUNC_(1.0, '2', 3.0, '4');
   {1.0:"2",3.0:"4"}
  """)
case class CreateMap(children: Seq[Expression]) extends Expression {
...
{code}

> High-order function: map(array, array) → map<K,V>
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-17 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441268#comment-16441268
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

ping [~smilegator]

> High-order function: map(array, array) → map<K,V>
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438885#comment-16438885
 ] 

Kazuaki Ishizaki commented on SPARK-23986:
--

While I also checked it with branch-2.3, it works well without any exception.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438818#comment-16438818
 ] 

Kazuaki Ishizaki edited comment on SPARK-23986 at 4/15/18 7:36 PM:
---

Thank for reporting an issue with deep dive.

When I run this repro with the latest master, it works well without an 
exception. When I checked the generated code, I cannot find variables 
{{agg_expr_[21|31|41|51|61]}}. I will check it with branch-2.3 tomorrow.
Would it be possible to attach the log file of the generated code?


was (Author: kiszk):
Thank for reporting an issue with deep dive.

When I run this repro with the latest master, it works well without an 
exception. When I checked the generated code, I cannot find variables 
{{agg_expr_[21|31|41|51|61]}}. 
Would it be possible to attach the log file of the generated code?

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438818#comment-16438818
 ] 

Kazuaki Ishizaki commented on SPARK-23986:
--

Thank for reporting an issue with deep dive.

When I run this repro with the latest master, it works well without an 
exception. When I checked the generated code, I cannot find variables 
{{agg_expr_[21|31|41|51|61]}}. 
Would it be possible to attach the log file of the generated code?

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23976) UTF8String.concat() or ByteArray.concat() may allocate shorter structure.

2018-04-13 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23976:


 Summary: UTF8String.concat() or ByteArray.concat() may allocate 
shorter structure.
 Key: SPARK-23976
 URL: https://issues.apache.org/jira/browse/SPARK-23976
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


When the three inputs has `0x7FFF_FF00`, `0x7FFF_FF00`, and `0xE00`, the 
current algorithm allocate the result structure with 0x1000 length due to 
integer sum overflow.

We should detect overflow.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-13 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436976#comment-16436976
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

[~smilegator] [~ueshin] Could you favor us?
SparkSQL already uses syntax of {{map}} function for the similar purpose.

Even if we limit two array in the argument list, we may have conflict between 
this new feature and creating a map with one entry having an array for key and 
value. Do you have any good idea?


{code}
@ExpressionDescription(
  usage = "_FUNC_(key0, value0, key1, value1, ...) - Creates a map with the 
given key/value pairs.",
  examples = """
Examples:
  > SELECT _FUNC_(1.0, '2', 3.0, '4');
   {1.0:"2",3.0:"4"}
  """)
case class CreateMap(children: Seq[Expression]) extends Expression {
...
{code}

> High-order function: map(array, array) → map<K,V>
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23933) High-order function: map(array, array) → map<K,V>

2018-04-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16436136#comment-16436136
 ] 

Kazuaki Ishizaki commented on SPARK-23933:
--

I will work for this, thank you.

> High-order function: map(array, array) → map<K,V>
> ---
>
> Key: SPARK-23933
> URL: https://issues.apache.org/jira/browse/SPARK-23933
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/map.html
> Returns a map created using the given key/value arrays.
> {noformat}
> SELECT map(ARRAY[1,3], ARRAY[2,4]); -- {1 -> 2, 3 -> 4}
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23915) High-order function: array_except(x, y) → array

2018-04-12 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16435666#comment-16435666
 ] 

Kazuaki Ishizaki commented on SPARK-23915:
--

I will work for this, thanks.

> High-order function: array_except(x, y) → array
> ---
>
> Key: SPARK-23915
> URL: https://issues.apache.org/jira/browse/SPARK-23915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of elements in x but not in y, without duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23914) High-order function: array_union(x, y) → array

2018-04-11 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434494#comment-16434494
 ] 

Kazuaki Ishizaki commented on SPARK-23914:
--

I will work for this, thank you.

> High-order function: array_union(x, y) → array
> --
>
> Key: SPARK-23914
> URL: https://issues.apache.org/jira/browse/SPARK-23914
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the union of x and y, without duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23913) High-order function: array_intersect(x, y) → array

2018-04-11 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16434491#comment-16434491
 ] 

Kazuaki Ishizaki commented on SPARK-23913:
--

I will work for this, thank you.

> High-order function: array_intersect(x, y) → array
> --
>
> Key: SPARK-23913
> URL: https://issues.apache.org/jira/browse/SPARK-23913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns an array of the elements in the intersection of x and y, without 
> duplicates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23916) High-order function: array_join(x, delimiter, null_replacement) → varchar

2018-04-10 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431795#comment-16431795
 ] 

Kazuaki Ishizaki commented on SPARK-23916:
--

Sorry for my mistake regarding a PR with wrong JIRA number.

> High-order function: array_join(x, delimiter, null_replacement) → varchar
> -
>
> Key: SPARK-23916
> URL: https://issues.apache.org/jira/browse/SPARK-23916
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Concatenates the elements of the given array using the delimiter and an 
> optional string to replace nulls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23923) High-order function: cardinality(x) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431009#comment-16431009
 ] 

Kazuaki Ishizaki edited comment on SPARK-23923 at 4/9/18 6:36 PM:
--

I will work for this.


was (Author: kiszk):
I am working for this.

> High-order function: cardinality(x) → bigint
> 
>
> Key: SPARK-23923
> URL: https://issues.apache.org/jira/browse/SPARK-23923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and  
> https://prestodb.io/docs/current/functions/map.html.
> Returns the cardinality (size) of the array/map x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23921) High-order function: array_sort(x) → array

2018-04-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431037#comment-16431037
 ] 

Kazuaki Ishizaki commented on SPARK-23921:
--

I am working for this

> High-order function: array_sort(x) → array
> --
>
> Key: SPARK-23921
> URL: https://issues.apache.org/jira/browse/SPARK-23921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Sorts and returns the array x. The elements of x must be orderable. Null 
> elements will be placed at the end of the returned array.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431004#comment-16431004
 ] 

Kazuaki Ishizaki edited comment on SPARK-23919 at 4/9/18 6:19 PM:
--

I will work for this


was (Author: kiszk):
I am working for this.

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23923) High-order function: cardinality(x) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431009#comment-16431009
 ] 

Kazuaki Ishizaki commented on SPARK-23923:
--

I am working for this.

> High-order function: cardinality(x) → bigint
> 
>
> Key: SPARK-23923
> URL: https://issues.apache.org/jira/browse/SPARK-23923
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and  
> https://prestodb.io/docs/current/functions/map.html.
> Returns the cardinality (size) of the array/map x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23919) High-order function: array_position(x, element) → bigint

2018-04-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431004#comment-16431004
 ] 

Kazuaki Ishizaki commented on SPARK-23919:
--

I am working for this.

> High-order function: array_position(x, element) → bigint
> 
>
> Key: SPARK-23919
> URL: https://issues.apache.org/jira/browse/SPARK-23919
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Returns the position of the first occurrence of the element in array x (or 0 
> if not found).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23924) High-order function: element_at

2018-04-09 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431002#comment-16431002
 ] 

Kazuaki Ishizaki commented on SPARK-23924:
--

I will work for this.

> High-order function: element_at
> ---
>
> Key: SPARK-23924
> URL: https://issues.apache.org/jira/browse/SPARK-23924
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html and 
> https://prestodb.io/docs/current/functions/map.html 
> * element_at(array, index) → E
> Returns element of array at given index. If index > 0, this function provides 
> the same functionality as the SQL-standard subscript operator ([]). If index 
> < 0, element_at accesses elements from the last to the first.
> * element_at(map<K, V>, key) → V
> Returns value for given key, or NULL if the key is not contained in the map.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23893) Possible overflow in long = int * int

2018-04-07 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23893:


 Summary: Possible overflow in long = int * int
 Key: SPARK-23893
 URL: https://issues.apache.org/jira/browse/SPARK-23893
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


To perform `int * int` and then to cast to `long` may cause overflow if the MSB 
of the multiplication result is `1`. In other words, the result would be 
negative due to sign extension.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23892) Improve coverage and fix lint error in UTF8String-related Suite

2018-04-06 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23892:
-
Description: 
The following code in {{UTF8StringSuite}} has no sense.
{code}
assertTrue(s1.startsWith(s1));
assertTrue(s1.endsWith(s1));
{code}

The code {{if (length <= 0) ""}} in {{UTF8StringPropertyCheckSuite}} has no 
sense
{code}
  test("lpad, rpad") {
def padding(origin: String, pad: String, length: Int, isLPad: Boolean): 
String = {
  if (length <= 0) return ""
  if (length <= origin.length) {
if (length <= 0) "" else origin.substring(0, length)
  } else {
   ...
{code}

The previous change in {{UTF8StringSuite}} broke lint-java check.

  was:
The following code in {{UTF8StringSuite}} has no sense.
{code}
assertTrue(s1.startsWith(s1));
assertTrue(s1.endsWith(s1));
{code}

{code}
  test("lpad, rpad") {
def padding(origin: String, pad: String, length: Int, isLPad: Boolean): 
String = {
  if (length <= 0) return ""
  if (length <= origin.length) {
if (length <= 0) "" else origin.substring(0, length)
  } else {
   ...
{code}

The previous change broken lint-java check.


> Improve coverage and fix lint error in UTF8String-related Suite
> ---
>
> Key: SPARK-23892
> URL: https://issues.apache.org/jira/browse/SPARK-23892
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> The following code in {{UTF8StringSuite}} has no sense.
> {code}
> assertTrue(s1.startsWith(s1));
> assertTrue(s1.endsWith(s1));
> {code}
> The code {{if (length <= 0) ""}} in {{UTF8StringPropertyCheckSuite}} has no 
> sense
> {code}
>   test("lpad, rpad") {
> def padding(origin: String, pad: String, length: Int, isLPad: Boolean): 
> String = {
>   if (length <= 0) return ""
>   if (length <= origin.length) {
> if (length <= 0) "" else origin.substring(0, length)
>   } else {
>...
> {code}
> The previous change in {{UTF8StringSuite}} broke lint-java check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23892) Improve coverage and fix lint error in UTF8String-related Suite

2018-04-06 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23892:
-
Description: 
The following code in {{UTF8StringSuite}} has no sense.
{code}
assertTrue(s1.startsWith(s1));
assertTrue(s1.endsWith(s1));
{code}

{code}
  test("lpad, rpad") {
def padding(origin: String, pad: String, length: Int, isLPad: Boolean): 
String = {
  if (length <= 0) return ""
  if (length <= origin.length) {
if (length <= 0) "" else origin.substring(0, length)
  } else {
   ...
{code}

The previous change broken lint-java check.

  was:
The following code in {{UTF8StringSuite}} has no sense.
{code}
assertTrue(s1.startsWith(s1));
assertTrue(s1.endsWith(s1));
{code}

The previous change broken lint-java check.


> Improve coverage and fix lint error in UTF8String-related Suite
> ---
>
> Key: SPARK-23892
> URL: https://issues.apache.org/jira/browse/SPARK-23892
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>    Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> The following code in {{UTF8StringSuite}} has no sense.
> {code}
> assertTrue(s1.startsWith(s1));
> assertTrue(s1.endsWith(s1));
> {code}
> {code}
>   test("lpad, rpad") {
> def padding(origin: String, pad: String, length: Int, isLPad: Boolean): 
> String = {
>   if (length <= 0) return ""
>   if (length <= origin.length) {
> if (length <= 0) "" else origin.substring(0, length)
>   } else {
>...
> {code}
> The previous change broken lint-java check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23892) Improve coverage and fix lint error in UTF8String-related Suite

2018-04-06 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23892:
-
Summary: Improve coverage and fix lint error in UTF8String-related Suite  
(was: Improve coverage and fix lint error in UTF8StringSuite)

> Improve coverage and fix lint error in UTF8String-related Suite
> ---
>
> Key: SPARK-23892
> URL: https://issues.apache.org/jira/browse/SPARK-23892
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> The following code in {{UTF8StringSuite}} has no sense.
> {code}
> assertTrue(s1.startsWith(s1));
> assertTrue(s1.endsWith(s1));
> {code}
> The previous change broken lint-java check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23892) Improve coverage and fix lint error in UTF8StringSuite

2018-04-06 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23892:


 Summary: Improve coverage and fix lint error in UTF8StringSuite
 Key: SPARK-23892
 URL: https://issues.apache.org/jira/browse/SPARK-23892
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


The following code in {{UTF8StringSuite}} has no sense.
{code}
assertTrue(s1.startsWith(s1));
assertTrue(s1.endsWith(s1));
{code}

The previous change broken lint-java check.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23882) Is UTF8StringSuite.writeToOutputStreamUnderflow() supported?

2018-04-06 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23882:


 Summary: Is UTF8StringSuite.writeToOutputStreamUnderflow() 
supported?
 Key: SPARK-23882
 URL: https://issues.apache.org/jira/browse/SPARK-23882
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


The unit test {{UTF8StringSuite.writeToOutputStreamUnderflow()}} accesses 
metadata of an Java byte array objected where {{Platform.BYTE_ARRAY_OFFSET}} 
reserves.
Is this test valid? Is this test necessary for Spark implementation?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23762) UTF8StringBuilder uses MemoryBlock

2018-04-05 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23762:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-23879

> UTF8StringBuilder uses MemoryBlock
> --
>
> Key: SPARK-23762
> URL: https://issues.apache.org/jira/browse/SPARK-23762
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry tries to use {{MemoryBlock}} in UTF8StringBuffer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >