[jira] [Resolved] (SPARK-28695) Make Kafka source more robust with CaseInsensitiveMap

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28695.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25418
[https://github.com/apache/spark/pull/25418]

> Make Kafka source more robust with CaseInsensitiveMap
> -
>
> Key: SPARK-28695
> URL: https://issues.apache.org/jira/browse/SPARK-28695
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-28163 fixed a bug and during the analysis we've concluded it would be 
> more robust to use CaseInsensitiveMap inside Kafka source. This case less 
> lower/upper case problem would rise in the the future.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28695) Make Kafka source more robust with CaseInsensitiveMap

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28695:
---

Assignee: Gabor Somogyi

> Make Kafka source more robust with CaseInsensitiveMap
> -
>
> Key: SPARK-28695
> URL: https://issues.apache.org/jira/browse/SPARK-28695
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Minor
>
> SPARK-28163 fixed a bug and during the analysis we've concluded it would be 
> more robust to use CaseInsensitiveMap inside Kafka source. This case less 
> lower/upper case problem would rise in the the future.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11

2019-08-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907821#comment-16907821
 ] 

Hyukjin Kwon commented on SPARK-28735:
--

Let me take a look tomorrow in KST.

> MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails 
> on JDK11
> -
>
> Key: SPARK-28735
> URL: https://issues.apache.org/jira/browse/SPARK-28735
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` 
> failed.
> {code}
> $ build/sbt -Phadoop-3.2 test:package
> $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' 
> --python-executables python
> ...
> ==
> FAIL: test_raw_and_probability_prediction 
> (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
> --
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py",
>  line 89, in test_raw_and_probability_prediction
> self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1E-4))
> AssertionError: False is not true
> {code}
> {code:python}
> class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
> def test_raw_and_probability_prediction(self):
> data_path = "data/mllib/sample_multiclass_classification_data.txt"
> df = self.spark.read.format("libsvm").load(data_path)
> mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
>  blockSize=128, seed=123)
> model = mlp.fit(df)
> test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 
> 0.25, 0.25))]).toDF()
> result = model.transform(test).head()
> expected_prediction = 2.0
> expected_probability = [0.0, 0.0, 1.0]
>   expected_rawPrediction = [-11.6081922998, -8.15827998691, 
> 22.17757045]
>   self.assertTrue(result.prediction, expected_prediction)
>   self.assertTrue(np.allclose(result.probability, 
> expected_probability, atol=1E-4))
>   self.assertTrue(np.allclose(result.rawPrediction, 
> expected_rawPrediction, atol=1E-4))
>   # self.assertTrue(np.allclose(result.rawPrediction, 
> expected_rawPrediction, atol=1E-4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28740) Add support for building with bloop

2019-08-14 Thread holdenk (JIRA)
holdenk created SPARK-28740:
---

 Summary: Add support for building with bloop
 Key: SPARK-28740
 URL: https://issues.apache.org/jira/browse/SPARK-28740
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: holdenk


bloop can, in theory, build scala faster. However the JAR layout is a little 
different when you try and run the tests. It would be useful if we updated our 
test JAR discovery to work with bloop.

Before working on this check to make sure that bloop it's self has changed to 
work with Spark. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28666) Support the V2SessionCatalog in saveAsTable

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28666.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25402
[https://github.com/apache/spark/pull/25402]

> Support the V2SessionCatalog in saveAsTable
> ---
>
> Key: SPARK-28666
> URL: https://issues.apache.org/jira/browse/SPARK-28666
> Project: Spark
>  Issue Type: Planned Work
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
> Fix For: 3.0.0
>
>
> We need to support the V2SessionCatalog in the old saveAsTable code paths so 
> that V2 DataSources can leverage the old DataFrameWriter code path.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28666) Support the V2SessionCatalog in saveAsTable

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28666:
---

Assignee: Burak Yavuz

> Support the V2SessionCatalog in saveAsTable
> ---
>
> Key: SPARK-28666
> URL: https://issues.apache.org/jira/browse/SPARK-28666
> Project: Spark
>  Issue Type: Planned Work
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Blocker
>
> We need to support the V2SessionCatalog in the old saveAsTable code paths so 
> that V2 DataSources can leverage the old DataFrameWriter code path.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28351) Support DELETE in DataSource V2

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-28351:
---

Assignee: Xianyin Xin

> Support DELETE in DataSource V2
> ---
>
> Key: SPARK-28351
> URL: https://issues.apache.org/jira/browse/SPARK-28351
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
>Priority: Major
>
> This ticket add the DELETE support for V2 datasources.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28351) Support DELETE in DataSource V2

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28351.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25115
[https://github.com/apache/spark/pull/25115]

> Support DELETE in DataSource V2
> ---
>
> Key: SPARK-28351
> URL: https://issues.apache.org/jira/browse/SPARK-28351
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xianyin Xin
>Assignee: Xianyin Xin
>Priority: Major
> Fix For: 3.0.0
>
>
> This ticket add the DELETE support for V2 datasources.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap

2019-08-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28203.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 25002
[https://github.com/apache/spark/pull/25002]

> PythonRDD should respect SparkContext's conf when passing user confMap
> --
>
> Key: SPARK-28203
> URL: https://issues.apache.org/jira/browse/SPARK-28203
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.3
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
> Fix For: 3.0.0
>
>
> PythonRDD have several API which accepts user configs from python side. The 
> parameter is called confAsMap and it's intended to merge with RDD's hadoop 
> configuration.
>  However, the confAsMap is first mapped to Configuration then merged into 
> SparkContext's hadoop configuration. The mapped Configuration will load 
> default key values in core-default.xml etc., which may be updated in 
> SparkContext's hadoop configuration. The default value will override updated 
> value in the merge process.
> I will submit a pr to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap

2019-08-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-28203:


Assignee: Xianjin YE

> PythonRDD should respect SparkContext's conf when passing user confMap
> --
>
> Key: SPARK-28203
> URL: https://issues.apache.org/jira/browse/SPARK-28203
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.3
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Minor
>
> PythonRDD have several API which accepts user configs from python side. The 
> parameter is called confAsMap and it's intended to merge with RDD's hadoop 
> configuration.
>  However, the confAsMap is first mapped to Configuration then merged into 
> SparkContext's hadoop configuration. The mapped Configuration will load 
> default key values in core-default.xml etc., which may be updated in 
> SparkContext's hadoop configuration. The default value will override updated 
> value in the merge process.
> I will submit a pr to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28739) Add a simple cost check for Adaptive Query Execution

2019-08-14 Thread Maryann Xue (JIRA)
Maryann Xue created SPARK-28739:
---

 Summary: Add a simple cost check for Adaptive Query Execution
 Key: SPARK-28739
 URL: https://issues.apache.org/jira/browse/SPARK-28739
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maryann Xue


Add a mechanism to compare the costs of the before and after plans of 
re-optimization in Adaptive Query Execution.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28723) Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28723:
--
Parent Issue: SPARK-28684  (was: SPARK-24417)

> Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile
> -
>
> Key: SPARK-28723
> URL: https://issues.apache.org/jira/browse/SPARK-28723
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28710) [UDF] create or replace permanent function does not clear the jar in class path

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28710:
--
Description: 
{code}
 0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS 
'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 
'hdfs://hacluster/user/AddDoublesUDF.jar';
+-+
| Result  |
+-+
+-+
No rows selected (0.216 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> create or replace function 
addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
'hdfs://hacluster/user/Multiply.jar';
+-+
| Result  |
+-+
+-+
No rows selected (0.292 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3);
INFO  : Added 
[/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
class path
INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
INFO  : Added 
[/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
class path
INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
Error: org.apache.spark.sql.AnalysisException: Can not load class 
'com.huawei.bigdata.hive.example.udf.multiply' when registering the function 
'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 
(state=,code=0)
{code}

  was:
 0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS 
'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 
'hdfs://hacluster/user/AddDoublesUDF.jar';
+-+
| Result  |
+-+
+-+
No rows selected (0.216 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> create or replace function 
addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
'hdfs://hacluster/user/Multiply.jar';
+-+
| Result  |
+-+
+-+
No rows selected (0.292 seconds)
0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3);
INFO  : Added 
[/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
class path
INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
INFO  : Added 
[/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
class path
INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
Error: org.apache.spark.sql.AnalysisException: Can not load class 
'com.huawei.bigdata.hive.example.udf.multiply' when registering the function 
'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 
(state=,code=0)



> [UDF] create or replace permanent function does not clear the jar in class 
> path
> ---
>
> Key: SPARK-28710
> URL: https://issues.apache.org/jira/browse/SPARK-28710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
>  0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS 
> 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 
> 'hdfs://hacluster/user/AddDoublesUDF.jar';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.216 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> create or replace function 
> addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.292 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3);
> INFO  : Added 
> [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
> class path
> INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
> INFO  : Added 
> [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
> class path
> INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
> Error: org.apache.spark.sql.AnalysisException: Can not load class 
> 'com.huawei.bigdata.hive.example.udf.multiply' when registering the function 
> 'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 
> (state=,code=0)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28710) [UDF] create or replace permanent function does not clear the jar in class path

2019-08-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907681#comment-16907681
 ] 

Dongjoon Hyun commented on SPARK-28710:
---

Thank you for reporting this, [~abhishek.akg].
Thank you for pinging me, [~sandeep.katta2007]. I'll review your PR.

> [UDF] create or replace permanent function does not clear the jar in class 
> path
> ---
>
> Key: SPARK-28710
> URL: https://issues.apache.org/jira/browse/SPARK-28710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
>  0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS 
> 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 
> 'hdfs://hacluster/user/AddDoublesUDF.jar';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.216 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> create or replace function 
> addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 
> 'hdfs://hacluster/user/Multiply.jar';
> +-+
> | Result  |
> +-+
> +-+
> No rows selected (0.292 seconds)
> 0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3);
> INFO  : Added 
> [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
> class path
> INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
> INFO  : Added 
> [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to 
> class path
> INFO  : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar]
> Error: org.apache.spark.sql.AnalysisException: Can not load class 
> 'com.huawei.bigdata.hive.example.udf.multiply' when registering the function 
> 'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 
> (state=,code=0)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API

2019-08-14 Thread Joseph Cooper (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642
 ] 

Joseph Cooper edited comment on SPARK-28738 at 8/14/19 10:43 PM:
-

I think I see why commitAsync might not support this. During the polling loop 
for offset commits if a higher offset is encountered, a lesser one is skipped 
and the metadata for that one wont get committed.

 

Seems like commitSync would work, just add a function with a parameter for the 
metadata, and then call the underlying consumer's commitSync method.


was (Author: jrciii):
I think I see why commitAsync might not support this. During the polling loop 
for offset commits if a higher offset is encountered, a lesser one is skipped 
and the metadata for that one wont get committed.

> Add ability to include metadata in CanCommitOffsets API
> ---
>
> Key: SPARK-28738
> URL: https://issues.apache.org/jira/browse/SPARK-28738
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.4
>Reporter: Joseph Cooper
>Priority: Major
>
> It is possible to commit metadata with an offset to Kafka. Currently, the 
> CanCommitOffsets API does not expose this functionality. See 
> [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]
>  
> We could add a commitSync function which commits an offset right away and 
> accepts metadata.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API

2019-08-14 Thread Joseph Cooper (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Cooper updated SPARK-28738:
--
Description: 
It is possible to commit metadata with an offset to Kafka. Currently, the 
CanCommitOffsets API does not expose this functionality. See 
[https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]

 

We could add a commitSync function which commits an offset right away and 
accepts metadata.

  was:
It is possible to commit metadata with an offset to Kafka. Currently, the 
CanCommitOffsets API does not expose this functionality. See 
[https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]

 

We could make the commit queue take (OffsetRange, String) instead of just 
OffsetRange and copy the two existing commitAsync functions and make them take 
Array[(OffsetRange, String)].


> Add ability to include metadata in CanCommitOffsets API
> ---
>
> Key: SPARK-28738
> URL: https://issues.apache.org/jira/browse/SPARK-28738
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.4
>Reporter: Joseph Cooper
>Priority: Major
>
> It is possible to commit metadata with an offset to Kafka. Currently, the 
> CanCommitOffsets API does not expose this functionality. See 
> [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]
>  
> We could add a commitSync function which commits an offset right away and 
> accepts metadata.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28110) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28110.
---
Resolution: Duplicate

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-28110
> URL: https://issues.apache.org/jira/browse/SPARK-28110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28110) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-08-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907671#comment-16907671
 ] 

Dongjoon Hyun commented on SPARK-28110:
---

Yes, it does. Although SPARK-28723 provides a solution for Hadoop-3.2/Hive2.3.6 
profile, I believe we can close this as `Superceded by` SPARK-28723. I'll 
resolve this one.

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-28110
> URL: https://issues.apache.org/jira/browse/SPARK-28110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28701) add java11 support for spark pull request builds

2019-08-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907654#comment-16907654
 ] 

Dongjoon Hyun commented on SPARK-28701:
---

Thanks, [~shaneknapp]! :D

> add java11 support for spark pull request builds
> 
>
> Key: SPARK-28701
> URL: https://issues.apache.org/jira/browse/SPARK-28701
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, jenkins
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> from https://github.com/apache/spark/pull/25405
> add a PRB subject check for [test-java11] and update JAVA_HOME env var to 
> point to /usr/java/jdk-11.0.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API

2019-08-14 Thread Joseph Cooper (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642
 ] 

Joseph Cooper edited comment on SPARK-28738 at 8/14/19 9:52 PM:


I think I see why commitAsync might not support this. During the polling loop 
for offset commits if a higher offset is encountered, a lesser one is skipped 
and the metadata for that one wont get committed.


was (Author: jrciii):
I think I see why commitAsync might not support this. During the polling loop 
for offset commits if a higher offset is encountered, a lesser one might get 
skipped and the metadata for that one wont get committed.

 

At least I think that is the spirit of the commitAll function, but it doesn't 
seem to make sense. Once that map is full, wont there be no more polling of the 
queue?

[https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L303]

 

Maybe Im missing something

> Add ability to include metadata in CanCommitOffsets API
> ---
>
> Key: SPARK-28738
> URL: https://issues.apache.org/jira/browse/SPARK-28738
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.4
>Reporter: Joseph Cooper
>Priority: Major
>
> It is possible to commit metadata with an offset to Kafka. Currently, the 
> CanCommitOffsets API does not expose this functionality. See 
> [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]
>  
> We could make the commit queue take (OffsetRange, String) instead of just 
> OffsetRange and copy the two existing commitAsync functions and make them 
> take Array[(OffsetRange, String)].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API

2019-08-14 Thread Joseph Cooper (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642
 ] 

Joseph Cooper edited comment on SPARK-28738 at 8/14/19 9:42 PM:


I think I see why commitAsync might not support this. During the polling loop 
for offset commits if a higher offset is encountered, a lesser one might get 
skipped and the metadata for that one wont get committed.

 

At least I think that is the spirit of the commitAll function, but it doesn't 
seem to make sense. Once that map is full, wont there be no more polling of the 
queue?

[https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L303]

 

Maybe Im missing something


was (Author: jrciii):
I think I see why commitAsync might not support this. During the polling loop 
for offset commits if a higher offset is encountered, a lesser one might get 
skipped and the metadata for that one wont get committed.

> Add ability to include metadata in CanCommitOffsets API
> ---
>
> Key: SPARK-28738
> URL: https://issues.apache.org/jira/browse/SPARK-28738
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.4
>Reporter: Joseph Cooper
>Priority: Major
>
> It is possible to commit metadata with an offset to Kafka. Currently, the 
> CanCommitOffsets API does not expose this functionality. See 
> [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]
>  
> We could make the commit queue take (OffsetRange, String) instead of just 
> OffsetRange and copy the two existing commitAsync functions and make them 
> take Array[(OffsetRange, String)].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name

2019-08-14 Thread Franck Tago (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franck Tago reopened SPARK-23519:
-

Ok Spark Community 

I am sorry for being a pest about this , but I re-opening this Jira because I 
really believe that this should be addressed . 

Right now I do not have any way satisfying my  customer's requirement . 

My current use case is the following . 

My customer can provide any customer  Hive query . I am oblivious to the 
actually content of the query and parsing the query is not an option .

All I know if the number of fields projected from the customer query and the 
type of those fields . 

I do not know the name of the fields projected from the custom query.

What is currently  do with spark sql is run a  query of the form . 

Create view view_name 

> Create View Commands Fails with  The view output (col1,col1) contains 
> duplicate column name
> ---
>
> Key: SPARK-23519
> URL: https://issues.apache.org/jira/browse/SPARK-23519
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.2.1
>Reporter: Franck Tago
>Priority: Major
>  Labels: bulk-closed
> Attachments: image-2018-05-10-10-48-57-259.png
>
>
> 1- create and populate a hive table  . I did this in a hive cli session .[ 
> not that this matters ]
> create table  atable (col1 int) ;
> insert  into atable values (10 ) , (100)  ;
> 2. create a view from the table.  
> [These actions were performed from a spark shell ]
> spark.sql("create view  default.aview  (int1 , int2 ) as select  col1 , col1 
> from atable ")
>  java.lang.AssertionError: assertion failed: The view output (col1,col1) 
> contains duplicate column name.
>  at scala.Predef$.assert(Predef.scala:170)
>  at 
> org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67)
>  at org.apache.spark.sql.Dataset.(Dataset.scala:183)
>  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68)
>  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API

2019-08-14 Thread Joseph Cooper (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642
 ] 

Joseph Cooper commented on SPARK-28738:
---

I think I see why commitAsync might not support this. During the polling loop 
for offset commits if a higher offset is encountered, a lesser one might get 
skipped and the metadata for that one wont get committed.

> Add ability to include metadata in CanCommitOffsets API
> ---
>
> Key: SPARK-28738
> URL: https://issues.apache.org/jira/browse/SPARK-28738
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.4.4
>Reporter: Joseph Cooper
>Priority: Major
>
> It is possible to commit metadata with an offset to Kafka. Currently, the 
> CanCommitOffsets API does not expose this functionality. See 
> [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]
>  
> We could make the commit queue take (OffsetRange, String) instead of just 
> OffsetRange and copy the two existing commitAsync functions and make them 
> take Array[(OffsetRange, String)].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907635#comment-16907635
 ] 

Dongjoon Hyun commented on SPARK-28728:
---

[~Fokko].
Thank you for making a JIRA and PR, but Apache Spark community has [the 
following guideline.|https://spark.apache.org/contributing.html]. Please don't 
set the `Fix Versions` next time.
{code}
Do not set the following fields: Fix Version. This is assigned by committers 
only when resolved.
{code}

> Bump Jackson Databind to 2.9.9.3
> 
>
> Key: SPARK-28728
> URL: https://issues.apache.org/jira/browse/SPARK-28728
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Fokko Driesprong
>Priority: Major
>
> Needs to be upgraded due to issues.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28728:
--
Fix Version/s: (was: 2.4.4)
   (was: 3.0.0)

> Bump Jackson Databind to 2.9.9.3
> 
>
> Key: SPARK-28728
> URL: https://issues.apache.org/jira/browse/SPARK-28728
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Fokko Driesprong
>Priority: Major
>
> Needs to be upgraded due to issues.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28721) Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and Executors

2019-08-14 Thread Patrick Clay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Clay resolved SPARK-28721.
--
Resolution: Duplicate

Ah sorry I didn't search carefully enough for a duplicate

> Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and 
> Executors
> ---
>
> Key: SPARK-28721
> URL: https://issues.apache.org/jira/browse/SPARK-28721
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Patrick Clay
>Priority: Minor
>
> This does not seem to affect 2.4.0.
> To repro:
>  # Download pristine Spark 2.4.3 binary
>  # Edit pi.py to not call spark.stop()
>  # ./bin/docker-image-tool.sh -r MY_IMAGE -t MY_TAG build push
>  # spark-submit --master k8s://IP --deploy-mode cluster --conf 
> spark.kubernetes.driver.pod.name=spark-driver --conf 
> spark.kubernetes.container.image=MY_IMAGE:MY_TAG 
> file:/opt/spark/examples/src/main/python/pi.py
> The driver runs successfully and Python exits but the Driver and Executor 
> JVMs and Pods remain up.
>  
> I realize that explicitly calling spark.stop() is always best practice, but 
> since this does not repro in 2.4.0 it seems like a regression.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API

2019-08-14 Thread Joseph Cooper (JIRA)
Joseph Cooper created SPARK-28738:
-

 Summary: Add ability to include metadata in CanCommitOffsets API
 Key: SPARK-28738
 URL: https://issues.apache.org/jira/browse/SPARK-28738
 Project: Spark
  Issue Type: New Feature
  Components: DStreams
Affects Versions: 2.4.4
Reporter: Joseph Cooper


It is possible to commit metadata with an offset to Kafka. Currently, the 
CanCommitOffsets API does not expose this functionality. See 
[https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300]

 

We could make the commit queue take (OffsetRange, String) instead of just 
OffsetRange and copy the two existing commitAsync functions and make them take 
Array[(OffsetRange, String)].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28683) Upgrade Scala to 2.12.10

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28683:
--
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-24417)

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1543097847070&to=1564631199344&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28683) Upgrade Scala to 2.12.10

2019-08-14 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907619#comment-16907619
 ] 

Dongjoon Hyun commented on SPARK-28683:
---

I got it. Yes. This is a nice-to-have. I'll put this out this umbrella.

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1543097847070&to=1564631199344&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28737) Update jersey to 2.27+ (2.29)

2019-08-14 Thread Sean Owen (JIRA)
Sean Owen created SPARK-28737:
-

 Summary: Update jersey to 2.27+ (2.29)
 Key: SPARK-28737
 URL: https://issues.apache.org/jira/browse/SPARK-28737
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Sean Owen


Looks like we might need to update Jersey after all, from recent JDK 11 
testing: 
{code}
Caused by: java.lang.IllegalArgumentException
at 
jersey.repackaged.org.objectweb.asm.ClassReader.(ClassReader.java:170)
at 
jersey.repackaged.org.objectweb.asm.ClassReader.(ClassReader.java:153)
at 
jersey.repackaged.org.objectweb.asm.ClassReader.(ClassReader.java:424)
at 
org.glassfish.jersey.server.internal.scanning.AnnotationAcceptingListener.process(AnnotationAcceptingListener.java:170)
{code}

It looks like 2.27+ may solve the issue, so worth trying 2.29. 
I'm not 100% sure this is an issue as the JDK 11 testing process is still 
undergoing change, but will work on it to see how viable it is anyway, as it 
may be worthwhile to update for 3.0 in any event.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28683) Upgrade Scala to 2.12.10

2019-08-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907606#comment-16907606
 ] 

Sean Owen commented on SPARK-28683:
---

I think we can detach this from the JDK 11 umbrella. This doesn't appear to be 
strictly necessary for JDK 11 in Spark.

> Upgrade Scala to 2.12.10
> 
>
> Key: SPARK-28683
> URL: https://issues.apache.org/jira/browse/SPARK-28683
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 
> and found that 2.12.9 has a serious bug, 
> https://github.com/scala/bug/issues/11665 *
> We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR.
> h3. Highlights (2.12.9)
>  * Faster compiler: [5–10% faster since 
> 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1543097847070&to=1564631199344&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench%40scalabench%40],
>  thanks to many optimizations (mostly by Jason Zaugg and Diego E. 
> Alonso-Blas: kudos!)
>  * Improved compatibility with JDK 11, 12, and 13
>  * Experimental support for build pipelining and outline type checking
> [https://github.com/scala/scala/releases/tag/v2.12.9]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28110) on JDK11, IsolatedClientLoader must be able to load java.sql classes

2019-08-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907604#comment-16907604
 ] 

Sean Owen commented on SPARK-28110:
---

Is this one still an issue? I think this is a clone of an older issue that was 
mostly resolved, and the rest is I think a subset of the Hive update?

> on JDK11, IsolatedClientLoader must be able to load java.sql classes
> 
>
> Key: SPARK-28110
> URL: https://issues.apache.org/jira/browse/SPARK-28110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Imran Rashid
>Priority: Major
>
> This might be very specific to my fork & a kind of weird system setup I'm 
> working on, I haven't completely confirmed yet, but I wanted to report it 
> anyway in case anybody else sees this.
> When I try to do anything which touches the metastore on java11, I 
> immediately get errors from IsolatedClientLoader that it can't load anything 
> in java.sql.  eg.
> {noformat}
> scala> spark.sql("show tables").show()
> java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: 
> java/sql/SQLTransientException when creating Hive client using classpath: 
> file:/home/systest/jdk-11.0.2/, ...
> ...
> Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException
>   at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230)
>   at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
> {noformat}
> After a bit of debugging, I also discovered that the {{rootClassLoader}} is 
> {{null}} in {{IsolatedClientLoader}}.  I think this would work if either 
> {{rootClassLoader}} could load those classes, or if {{isShared()}} was 
> changed to allow any class starting with "java."  (I'm not sure why it only 
> allows "java.lang" and "java.net" currently.)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28736) pyspark.mllib.clustering fails on JDK11

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28736:
--
Description: 
Build Spark and run PySpark UT with JDK11.

{code}
$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.mllib.clustering' --python-executables 
python
...
File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", 
line 386, in __main__.GaussianMixtureModel
Failed example:
abs(softPredicted[0] - 1.0) < 0.001
Expected:
True
Got:
False
**
File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", 
line 388, in __main__.GaussianMixtureModel
Failed example:
abs(softPredicted[1] - 0.0) < 0.001
Expected:
True
Got:
False
**
   2 of  31 in __main__.GaussianMixtureModel
***Test Failed*** 2 failures.
{code}

  was:
Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` 
failed.

{code}
$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' 
--python-executables python
...
==
FAIL: test_raw_and_probability_prediction 
(pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
--
Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py",
 line 89, in test_raw_and_probability_prediction
self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
atol=1E-4))
AssertionError: False is not true
{code}

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}


> pyspark.mllib.clustering fails on JDK11
> ---
>
> Key: SPARK-28736
> URL: https://issues.apache.org/jira/browse/SPARK-28736
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Build Spark and run PySpark UT with JDK11.
> {code}
> $ build/sbt -Phadoop-3.2 test:package
> $ python/run-tests --testnames 'pyspark.mllib.clustering' 
> --python-executables python
> ...
> File 
> "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", 
> line 386, in __main__.GaussianMixtureModel
> Failed example:
> abs(softPredicted[0] - 1.0) < 0.001
> Expected:
> True
> Got:
> False
> **
> File 
> "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", 
> line 388, in __main__.GaussianMixtureModel
> Failed example:
> abs(softPredicted[1] - 0.0) < 0.001
> Expected:
> True
> Got:
> False
> **
>2 of  31 in __main__.GaussianMixtureModel
> ***Test Failed*** 2 failures.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28736) pyspark.mllib.clustering fails on JDK11

2019-08-14 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28736:
-

 Summary: pyspark.mllib.clustering fails on JDK11
 Key: SPARK-28736
 URL: https://issues.apache.org/jira/browse/SPARK-28736
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` 
failed.

{code}
$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' 
--python-executables python
...
==
FAIL: test_raw_and_probability_prediction 
(pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
--
Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py",
 line 89, in test_raw_and_probability_prediction
self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
atol=1E-4))
AssertionError: False is not true
{code}

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28735:
--
Description: 
Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` 
failed.

{code}
$ build/sbt -Phadoop-3.2 test:package
$ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' 
--python-executables python
...
==
FAIL: test_raw_and_probability_prediction 
(pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
--
Traceback (most recent call last):
  File 
"/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py",
 line 89, in test_raw_and_probability_prediction
self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
atol=1E-4))
AssertionError: False is not true
{code}

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}

  was:
Build Spark with JDK11 and run `python/run-tests --testnames 
'pyspark.ml.tests.test_algorithms' --python-executables python`. The last 
commented `assertTrue` failed.

- 
https://github.com/apache/spark/pull/25443/commits/593a154813880fb13e3091043d809e0c00e57bc5

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}


> MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails 
> on JDK11
> -
>
> Key: SPARK-28735
> URL: https://issues.apache.org/jira/browse/SPARK-28735
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` 
> failed.
> {code}
> $ build/sbt -Phadoop-3.2 test:package
> $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' 
> --python-executables python
> ...
> ==
> FAIL: test_raw_and_probability_prediction 
> (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest)
> --
> Traceback (most recent call last):
>   File 
> "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py",
>  line 89, in test_raw_and_probability_prediction
> self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, 
> atol=1E-4))
> AssertionError: False is not true
> {code}
> {code:python}
> class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
> def test_raw_and_probability_prediction(se

[jira] [Updated] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28735:
--
Description: 
Build Spark with JDK11 and run `python/run-tests --testnames 
'pyspark.ml.tests.test_algorithms' --python-executables python`. The last 
commented `assertTrue` failed.

- 
https://github.com/apache/spark/pull/25443/commits/593a154813880fb13e3091043d809e0c00e57bc5

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}

  was:
Build Spark with JDK11 and run `python/run-tests --testnames 
'pyspark.ml.tests.test_algorithms' --python-executables python`. The last 
commented `assertTrue` failed.

- 593a154813880fb13e3091043d809e0c00e57bc5

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}


> MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails 
> on JDK11
> -
>
> Key: SPARK-28735
> URL: https://issues.apache.org/jira/browse/SPARK-28735
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Build Spark with JDK11 and run `python/run-tests --testnames 
> 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last 
> commented `assertTrue` failed.
> - 
> https://github.com/apache/spark/pull/25443/commits/593a154813880fb13e3091043d809e0c00e57bc5
> {code:python}
> class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
> def test_raw_and_probability_prediction(self):
> data_path = "data/mllib/sample_multiclass_classification_data.txt"
> df = self.spark.read.format("libsvm").load(data_path)
> mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
>  blockSize=128, seed=123)
> model = mlp.fit(df)
> test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 
> 0.25, 0.25))]).toDF()
> result = model.transform(test).head()
> expected_prediction = 2.0
> expected_probability = [0.0, 0.0, 1.0]
>   expected_rawPrediction = [-11.6081922998, -8.15827998691, 
> 22.17757045]
>   self.assertTrue(result.prediction, expected_prediction)
>   self.assertTrue(np.allclose(result.probability, 
> expected_probability, atol=1E-4))
>   self.assertTrue(np.allclose(result.rawPrediction, 
> expected_rawPrediction, atol=1E-4))
>   # self.assertTrue(np.allclose(result.rawPrediction, 
> expected_rawPrediction, atol=1E-4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---

[jira] [Updated] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28735:
--
Description: 
Build Spark with JDK11 and run `python/run-tests --testnames 
'pyspark.ml.tests.test_algorithms' --python-executables python`. The last 
commented `assertTrue` failed.

- 593a154813880fb13e3091043d809e0c00e57bc5

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}

  was:

{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}


> MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails 
> on JDK11
> -
>
> Key: SPARK-28735
> URL: https://issues.apache.org/jira/browse/SPARK-28735
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Build Spark with JDK11 and run `python/run-tests --testnames 
> 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last 
> commented `assertTrue` failed.
> - 593a154813880fb13e3091043d809e0c00e57bc5
> {code:python}
> class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
> def test_raw_and_probability_prediction(self):
> data_path = "data/mllib/sample_multiclass_classification_data.txt"
> df = self.spark.read.format("libsvm").load(data_path)
> mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
>  blockSize=128, seed=123)
> model = mlp.fit(df)
> test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 
> 0.25, 0.25))]).toDF()
> result = model.transform(test).head()
> expected_prediction = 2.0
> expected_probability = [0.0, 0.0, 1.0]
>   expected_rawPrediction = [-11.6081922998, -8.15827998691, 
> 22.17757045]
>   self.assertTrue(result.prediction, expected_prediction)
>   self.assertTrue(np.allclose(result.probability, 
> expected_probability, atol=1E-4))
>   self.assertTrue(np.allclose(result.rawPrediction, 
> expected_rawPrediction, atol=1E-4))
>   # self.assertTrue(np.allclose(result.rawPrediction, 
> expected_rawPrediction, atol=1E-4))
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11

2019-08-14 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-28735:
-

 Summary: 
MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on 
JDK11
 Key: SPARK-28735
 URL: https://issues.apache.org/jira/browse/SPARK-28735
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun



{code:python}
class MultilayerPerceptronClassifierTest(SparkSessionTestCase):
def test_raw_and_probability_prediction(self):
data_path = "data/mllib/sample_multiclass_classification_data.txt"
df = self.spark.read.format("libsvm").load(data_path)
mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3],
 blockSize=128, seed=123)
model = mlp.fit(df)
test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 
0.25))]).toDF()
result = model.transform(test).head()
expected_prediction = 2.0
expected_probability = [0.0, 0.0, 1.0]
expected_rawPrediction = [-11.6081922998, -8.15827998691, 
22.17757045]
self.assertTrue(result.prediction, expected_prediction)
self.assertTrue(np.allclose(result.probability, 
expected_probability, atol=1E-4))
self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
# self.assertTrue(np.allclose(result.rawPrediction, 
expected_rawPrediction, atol=1E-4))
{code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27361) YARN support for GPU-aware scheduling

2019-08-14 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907551#comment-16907551
 ] 

Parth Gandhi commented on SPARK-27361:
--

Makes sense, thank you.

> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Design and implement YARN support for GPU-aware scheduling:
>  * User can request GPU resources at Spark application level.
>  * How the Spark executor discovers GPU's when run on YARN
>  * Integrate with YARN 3.2 GPU support.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28734) Create a table of content in the left hand side bar for SQL doc.

2019-08-14 Thread Dilip Biswal (JIRA)
Dilip Biswal created SPARK-28734:


 Summary: Create a table of content in the left hand side bar for 
SQL doc.
 Key: SPARK-28734
 URL: https://issues.apache.org/jira/browse/SPARK-28734
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 2.4.3
Reporter: Dilip Biswal






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28721) Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and Executors

2019-08-14 Thread Patrick Clay (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907522#comment-16907522
 ] 

Patrick Clay commented on SPARK-28721:
--

I confirmed this affects 2.4.1, and re-confirmed that it does not affect 2.4.0.

> Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and 
> Executors
> ---
>
> Key: SPARK-28721
> URL: https://issues.apache.org/jira/browse/SPARK-28721
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Patrick Clay
>Priority: Minor
>
> This does not seem to affect 2.4.0.
> To repro:
>  # Download pristine Spark 2.4.3 binary
>  # Edit pi.py to not call spark.stop()
>  # ./bin/docker-image-tool.sh -r MY_IMAGE -t MY_TAG build push
>  # spark-submit --master k8s://IP --deploy-mode cluster --conf 
> spark.kubernetes.driver.pod.name=spark-driver --conf 
> spark.kubernetes.container.image=MY_IMAGE:MY_TAG 
> file:/opt/spark/examples/src/main/python/pi.py
> The driver runs successfully and Python exits but the Driver and Executor 
> JVMs and Pods remain up.
>  
> I realize that explicitly calling spark.stop() is always best practice, but 
> since this does not repro in 2.4.0 it seems like a regression.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28721) Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and Executors

2019-08-14 Thread Patrick Clay (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Clay updated SPARK-28721:
-
Affects Version/s: 2.4.1

> Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and 
> Executors
> ---
>
> Key: SPARK-28721
> URL: https://issues.apache.org/jira/browse/SPARK-28721
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 2.4.1, 2.4.3
>Reporter: Patrick Clay
>Priority: Minor
>
> This does not seem to affect 2.4.0.
> To repro:
>  # Download pristine Spark 2.4.3 binary
>  # Edit pi.py to not call spark.stop()
>  # ./bin/docker-image-tool.sh -r MY_IMAGE -t MY_TAG build push
>  # spark-submit --master k8s://IP --deploy-mode cluster --conf 
> spark.kubernetes.driver.pod.name=spark-driver --conf 
> spark.kubernetes.container.image=MY_IMAGE:MY_TAG 
> file:/opt/spark/examples/src/main/python/pi.py
> The driver runs successfully and Python exits but the Driver and Executor 
> JVMs and Pods remain up.
>  
> I realize that explicitly calling spark.stop() is always best practice, but 
> since this does not repro in 2.4.0 it seems like a regression.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type

2019-08-14 Thread YoungGyu Chun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907514#comment-16907514
 ] 

YoungGyu Chun commented on SPARK-27931:
---

I'll be working on this

> Accept 'on' and 'off' as input for boolean data type
> 
>
> Key: SPARK-27931
> URL: https://issues.apache.org/jira/browse/SPARK-27931
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> This ticket contains three things:
>  1. Accept 'on' and 'off' as input for boolean data type
> {code:sql}
> SELECT cast('no' as boolean) AS false;
> SELECT cast('off' as boolean) AS false;
> {code}
> 2. Accept unique prefixes thereof:
> {code:sql}
> SELECT cast('of' as boolean) AS false;
> SELECT cast('fal' as boolean) AS false;
> {code}
> 3. Trim the string when cast to boolean type
> {code:sql}
> SELECT cast('true   ' as boolean) AS true;
> SELECT cast(' FALSE' as boolean) AS true;
> {code}
> More details:
>  [https://www.postgresql.org/docs/devel/datatype-boolean.html]
>  
> [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25]
>  
> [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48]
>  
> [https://github.com/postgres/postgres/commit/9729c9360886bee7feddc6a1124b0742de4b9f3d]
> Other DBs:
>  [http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html]
>  [https://my.vertica.com/docs/5.0/HTML/Master/2983.htm]
>  
> [https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27361) YARN support for GPU-aware scheduling

2019-08-14 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907431#comment-16907431
 ] 

Thomas Graves commented on SPARK-27361:
---

Since that was done prior to this feature, I think its ok to leave it alone. It 
worked just fine on its own to purely request yarn get the resources (with no 
integration with spark scheduler, etc) We did modify how that worked in 
https://issues.apache.org/jira/browse/SPARK-27959 so that one should be linked 
here I think.

> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Design and implement YARN support for GPU-aware scheduling:
>  * User can request GPU resources at Spark application level.
>  * How the Spark executor discovers GPU's when run on YARN
>  * Integrate with YARN 3.2 GPU support.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28701) add java11 support for spark pull request builds

2019-08-14 Thread shane knapp (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907394#comment-16907394
 ] 

shane knapp commented on SPARK-28701:
-

[~dongjoon] whoops!  i just fixed that build...

also, i'm hoping to get the [test-java11] flag working fully and merged in the 
next day or so...


> add java11 support for spark pull request builds
> 
>
> Key: SPARK-28701
> URL: https://issues.apache.org/jira/browse/SPARK-28701
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, jenkins
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
>
> from https://github.com/apache/spark/pull/25405
> add a PRB subject check for [test-java11] and update JAVA_HOME env var to 
> point to /usr/java/jdk-11.0.1



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27361) YARN support for GPU-aware scheduling

2019-08-14 Thread Parth Gandhi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907391#comment-16907391
 ] 

Parth Gandhi commented on SPARK-27361:
--

[~tgraves], I was just wondering whether 
https://issues.apache.org/jira/browse/SPARK-20327 should be a sub task in this 
Jira in order to have all components for YARN support for GPU scheduling under 
one umbrella. Thank you.

> YARN support for GPU-aware scheduling
> -
>
> Key: SPARK-27361
> URL: https://issues.apache.org/jira/browse/SPARK-27361
> Project: Spark
>  Issue Type: Story
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> Design and implement YARN support for GPU-aware scheduling:
>  * User can request GPU resources at Spark application level.
>  * How the Spark executor discovers GPU's when run on YARN
>  * Integrate with YARN 3.2 GPU support.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28687) Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()`

2019-08-14 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28687.
---
Resolution: Fixed

Issue resolved by pull request 25408
[https://github.com/apache/spark/pull/25408]

> Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()`
> 
>
> Key: SPARK-28687
> URL: https://issues.apache.org/jira/browse/SPARK-28687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, we support these field for EXTRACT: CENTURY, MILLENNIUM, DECADE, 
> YEAR, QUARTER, MONTH, WEEK, DAY, DAYOFWEEK, HOUR, MINUTE, SECOND, DOW, 
> ISODOW, DOY, 
> We also need support: EPOCH, MICROSECONDS, MILLISECONDS, TIMEZONE, 
> TIMEZONE_M, TIMEZONE_H, ISOYEAR.
> https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27739) df.persist should save stats from optimized plan

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27739.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24623
[https://github.com/apache/spark/pull/24623]

> df.persist should save stats from optimized plan
> 
>
> Key: SPARK-27739
> URL: https://issues.apache.org/jira/browse/SPARK-27739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
> Fix For: 3.0.0
>
>
> CacheManager.cacheQuery passes the stats for `planToCache` to 
> InMemoryRelation. Since the plan has not been optimized, the stats is 
> inaccurate because project and filter have not been applied. I'd suggest 
> passing the stats from the optimized plan.
> {code:java}
> class CacheManager extends Logging {
> ...
>   def cacheQuery(
>   query: Dataset[_],
>   tableName: Option[String] = None,
>   storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
> val planToCache = query.logicalPlan
> if (lookupCachedData(planToCache).nonEmpty) {
>   logWarning("Asked to cache already cached data.")
> } else {
>   val sparkSession = query.sparkSession
>   val inMemoryRelation = InMemoryRelation(
> sparkSession.sessionState.conf.useCompression,
> sparkSession.sessionState.conf.columnBatchSize, storageLevel,
> sparkSession.sessionState.executePlan(planToCache).executedPlan,
> tableName,
> planToCache)  <==
> ...
> }
> object InMemoryRelation {
>   def apply(
>   useCompression: Boolean,
>   batchSize: Int,
>   storageLevel: StorageLevel,
>   child: SparkPlan,
>   tableName: Option[String],
>   logicalPlan: LogicalPlan): InMemoryRelation = {
> val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
> storageLevel, child, tableName)
> val relation = new InMemoryRelation(child.output, cacheBuilder, 
> logicalPlan.outputOrdering)
> relation.statsOfPlanToCache = logicalPlan.stats   <==
> relation
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28733) DataFrameReader of Spark not able to recognize the very first quote character, while custom unicode quote character is used

2019-08-14 Thread Mrinal Bhattacherjee (JIRA)
Mrinal Bhattacherjee created SPARK-28733:


 Summary: DataFrameReader of Spark not able to recognize the very 
first quote character, while custom unicode quote character is used
 Key: SPARK-28733
 URL: https://issues.apache.org/jira/browse/SPARK-28733
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.2
Reporter: Mrinal Bhattacherjee


I have encountered a strange behaviour recently, while reading a CSV file using 
DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). Here is 
my spark read code snippet.

_{color:#d04437}val sepChar = "\u00C7" // Ç{color}_
_{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_
_{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_
_{color:#d04437}val inputCsvFile = 
"\\input_ab.csv"{color}_
 
_{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_
 _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_
 _{color:#d04437}.option("quote", quoteChar){color}_
 _{color:#d04437}.option("escape", escapeChar){color}_
 _{color:#d04437}.option("header", "false"){color}_
 _{color:#d04437}.option("multiLine", "true"){color}_
 _{color:#d04437}.csv(inputCsvFile){color}_
 _{color:#d04437}readDF.cache(){color}_
 _{color:#d04437}readDF.show(20, false){color}_

Due to some awful data, I'm forced to use some unicode characters as sep 
character, quote character, escape character instead of default ones. Below is 
my input sample data.

{color:#33}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color}
{color:#33}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color}
{color:#33}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color}

Here Ç is field separator, Ḝ is quote character and all the fields values are 
wrapped with this custom quote character.

The problem I'm getting is, the first occurance of the quote character is not 
getting recognized by Spark somehow. I tried with any charcter other than 
Unicode like ` ~ X (alphabet X just for a testing scenario), even default quote 
(") as well. It works fine in all the scenarios except when Unicode is used as 
quote character. The first occurance of the Unicode quote character is coming 
as some non printable character �� , hence the wrap end quote character of the 
first field of first record is getting included in data.

Here is the output of df show.

+---++-+
|id |name |class|
+---++-+
|��1Ḝ |smith |5 |
|2 |douson |6 |
|3 |sr,tendulkar|10 |
+---++-+

It happens only for the first field of the very first record. Other quote 
characters in this file is being read as expected without any issues. When I 
keep an extra empty record at the top of the file, i.e., simply a new line (\n) 
at the very first line, the issue doesn't occur. Even, that empty row is not 
being considered as an empty record in df as well. Thus my problem gets solved. 
But this manipulation cannot be done in the production and hence it is an issue 
to be bothered about.

I feel, this is a bug. If it is not, kindly let me know the way to process the 
same without getting this issue; or else kindly provide a fix at the earliest. 
Thanks in advance.

Best Regards,
Mrinal



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing t

2019-08-14 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alix Métivier updated SPARK-28732:
--
Description: 
I am using agg function on a dataset, and i want to count the number of lines 
upon grouping columns. I would like to store the result of this count in an 
integer, but it fails with this output : 

[ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 89, Column 53: No applicable constructor/method found 
for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
"java.lang.Integer(java.lang.String)"

Here is the line 89 and a few others to understand :

/* 085 */ long value13 = i.getLong(5);
 /* 086 */ argValue4 = value13;
 /* 087 */
 /* 088 */
 /* 089 */ final java.lang.Integer value12 = false ? null : new 
java.lang.Integer(argValue4);

 

As per Integer documentation, there is not constructor for the type Long, so 
this is why the generated code fails.

 

Here is my code : 

org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1
 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
 org.apache.spark.sql.functions.col("o_year").as("o_yearN"))
 .agg(org.apache.spark.sql.functions.count("n_name").as("countN"),
 .as(org.apache.spark.sql.Encoders.bean(row2Struct.class));

row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int

If countN is a Long, code above wont fail

If it is an Int, it works in 1.6 and 2.0, but fails on version 2.1+

 

  was:
I am using agg function on a dataset, and i want to count the number of lines 
upon grouping columns. I would like to store the result of this count in an 
integer, but it fails with this output : 

[ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 89, Column 53: No applicable constructor/method found 
for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
"java.lang.Integer(java.lang.String)"

Here is the line 89 and a few others to understand :

/* 085 */ long value13 = i.getLong(5);
 /* 086 */ argValue4 = value13;
 /* 087 */
 /* 088 */
 /* 089 */ final java.lang.Integer value12 = false ? null : new 
java.lang.Integer(argValue4);

 

As per Integer documentation, there is not constructor for the type Long, so 
this is why the generated code fails.

 

Here is my code : 

org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1
 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
 org.apache.spark.sql.functions.col("o_year").as("o_yearN"))
 .agg(org.apache.spark.sql.functions.count("n_name").as("countN"),
 .as(org.apache.spark.sql.Encoders.bean(row2Struct.class));

row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int

If countN is a Long, code above wont fail

If it is a Long, it works in 1.6 and 2.0, but fails on version 2.1+

 


> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java' when storing the result of a count aggregation in an integer
> ---
>
> Key: SPARK-28732
> URL: https://issues.apache.org/jira/browse/SPARK-28732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Alix Métivier
>Priority: Blocker
>
> I am using agg function on a dataset, and i want to count the number of lines 
> upon grouping columns. I would like to store the result of this count in an 
> integer, but it fails with this output : 
> [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 89, Column 53: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
> "java.lang.Integer(java.lang.String)"
> Here is the line 89 and a few others to understand :
> /* 085 */ long value13 = i.getLong(5);
>  /* 086 */ argValue4 = value13;
>  /* 087 */
>  /* 088 */
>  /* 089 */ final java.lang.Integer value12 = false ? null : new 
> java.lang.Integer(argValue4);
>  
> As per Integer documentation, there is not constructor for the type Long, so 
> this is why the generated code fails.
>  
> Here is my code : 
> org.apache.spark.sql.Dataset ds_row2 = 
> ds_conntAggregateRow_1_Out_1
>  .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
>  org.apache.spark.sql.fun

[jira] [Updated] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing t

2019-08-14 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alix Métivier updated SPARK-28732:
--
Description: 
I am using agg function on a dataset, and i want to count the number of lines 
upon grouping columns. I would like to store the result of this count in an 
integer, but it fails with this output : 

[ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 89, Column 53: No applicable constructor/method found 
for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
"java.lang.Integer(java.lang.String)"

Here is the line 89 and a few others to understand :

/* 085 */ long value13 = i.getLong(5);
 /* 086 */ argValue4 = value13;
 /* 087 */
 /* 088 */
 /* 089 */ final java.lang.Integer value12 = false ? null : new 
java.lang.Integer(argValue4);

 

As per Integer documentation, there is not constructor for the type Long, so 
this is why the generated code fails.

 

Here is my code : 

org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1
 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
 org.apache.spark.sql.functions.col("o_year").as("o_yearN"))
 .agg(org.apache.spark.sql.functions.count("n_name").as("countN"),
 .as(org.apache.spark.sql.Encoders.bean(row2Struct.class));

row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int

If countN is a Long, code above wont fail

If it is a Long, it works in 1.6 and 2.0, but fails on version 2.1+

 

  was:
I am using agg function on a dataset, and i want to count the number of lines 
upon grouping columns. I would like to store the result of this count in an 
integer, but it fails with this output : 

[ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 89, Column 53: No applicable constructor/method found 
for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
"java.lang.Integer(java.lang.String)"

Here is the line 89 and a few others to understand :

/* 085 */ long value13 = i.getLong(5);
/* 086 */ argValue4 = value13;
/* 087 */
/* 088 */
/* 089 */ final java.lang.Integer value12 = false ? null : new 
java.lang.Integer(argValue4);

 

As per Integer documentation, there is not constructor for the type Long, so 
this is why the generated code fails.

 

Here is my code : 

org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1
 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
 org.apache.spark.sql.functions.col("o_year").as("o_yearN"))
 .agg(org.apache.spark.sql.functions.count("n_name").as("countN"),
 .as(org.apache.spark.sql.Encoders.bean(row2Struct.class));

row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int

If countN is a Long, code above wont fail

If it is a Long, is works in 1.6 and 2.0, but fails on version 2.1+

 


> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java' when storing the result of a count aggregation in an integer
> ---
>
> Key: SPARK-28732
> URL: https://issues.apache.org/jira/browse/SPARK-28732
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0
>Reporter: Alix Métivier
>Priority: Blocker
>
> I am using agg function on a dataset, and i want to count the number of lines 
> upon grouping columns. I would like to store the result of this count in an 
> integer, but it fails with this output : 
> [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
> failed to compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 89, Column 53: No applicable constructor/method found 
> for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
> "java.lang.Integer(java.lang.String)"
> Here is the line 89 and a few others to understand :
> /* 085 */ long value13 = i.getLong(5);
>  /* 086 */ argValue4 = value13;
>  /* 087 */
>  /* 088 */
>  /* 089 */ final java.lang.Integer value12 = false ? null : new 
> java.lang.Integer(argValue4);
>  
> As per Integer documentation, there is not constructor for the type Long, so 
> this is why the generated code fails.
>  
> Here is my code : 
> org.apache.spark.sql.Dataset ds_row2 = 
> ds_conntAggregateRow_1_Out_1
>  .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
>  org.apache.spark.sql.functio

[jira] [Created] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing t

2019-08-14 Thread JIRA
Alix Métivier created SPARK-28732:
-

 Summary: 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' 
when storing the result of a count aggregation in an integer
 Key: SPARK-28732
 URL: https://issues.apache.org/jira/browse/SPARK-28732
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0
Reporter: Alix Métivier


I am using agg function on a dataset, and i want to count the number of lines 
upon grouping columns. I would like to store the result of this count in an 
integer, but it fails with this output : 

[ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - 
failed to compile: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 89, Column 53: No applicable constructor/method found 
for actual parameters "long"; candidates are: "java.lang.Integer(int)", 
"java.lang.Integer(java.lang.String)"

Here is the line 89 and a few others to understand :

/* 085 */ long value13 = i.getLong(5);
/* 086 */ argValue4 = value13;
/* 087 */
/* 088 */
/* 089 */ final java.lang.Integer value12 = false ? null : new 
java.lang.Integer(argValue4);

 

As per Integer documentation, there is not constructor for the type Long, so 
this is why the generated code fails.

 

Here is my code : 

org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1
 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"),
 org.apache.spark.sql.functions.col("o_year").as("o_yearN"))
 .agg(org.apache.spark.sql.functions.count("n_name").as("countN"),
 .as(org.apache.spark.sql.Encoders.bean(row2Struct.class));

row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int

If countN is a Long, code above wont fail

If it is a Long, is works in 1.6 and 2.0, but fails on version 2.1+

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28731) Support limit on recursive queries

2019-08-14 Thread Peter Toth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peter Toth updated SPARK-28731:
---
Description: Recursive queries should support LIMIT and stop recursion if 
the required amount of rows are reached.  (was: PostgreSQL does support 
recursive view syntax:
{noformat}
CREATE RECURSIVE VIEW nums (n) AS
  VALUES (1)
  UNION ALL
  SELECT n+1 FROM nums WHERE n < 5
{noformat})

> Support limit on recursive queries
> --
>
> Key: SPARK-28731
> URL: https://issues.apache.org/jira/browse/SPARK-28731
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Peter Toth
>Priority: Minor
>
> Recursive queries should support LIMIT and stop recursion if the required 
> amount of rows are reached.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28731) Support limit on recursive queries

2019-08-14 Thread Peter Toth (JIRA)
Peter Toth created SPARK-28731:
--

 Summary: Support limit on recursive queries
 Key: SPARK-28731
 URL: https://issues.apache.org/jira/browse/SPARK-28731
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Peter Toth


PostgreSQL does support recursive view syntax:
{noformat}
CREATE RECURSIVE VIEW nums (n) AS
  VALUES (1)
  UNION ALL
  SELECT n+1 FROM nums WHERE n < 5
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28730) Configurable type coercion policy for table insertion

2019-08-14 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-28730:
--

 Summary: Configurable type coercion policy for table insertion
 Key: SPARK-28730
 URL: https://issues.apache.org/jira/browse/SPARK-28730
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


After all the discussions in the dev list: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
 
Here I propose that we can make the store assignment rules in the analyzer 
configurable, and the behavior of V1 and V2 should be consistent.
When inserting a value into a column with a different data type, Spark will 
perform type coercion. After this PR, we support 2 policies for the type 
coercion rules: 
legacy and strict. 
1. With legacy policy, Spark allows casting any value to any data type and null 
result is returned when the conversion is invalid. The legacy policy is the 
only behavior in Spark 2.x and it is compatible with Hive. 
2. With strict policy, Spark doesn't allow any possible precision loss or data 
truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not 
allowed.

To ensure backward compatibility with existing queries, the default store 
assignment policy is "legacy".



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27739) df.persist should save stats from optimized plan

2019-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-27739:
---

Assignee: John Zhuge

> df.persist should save stats from optimized plan
> 
>
> Key: SPARK-27739
> URL: https://issues.apache.org/jira/browse/SPARK-27739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Minor
>
> CacheManager.cacheQuery passes the stats for `planToCache` to 
> InMemoryRelation. Since the plan has not been optimized, the stats is 
> inaccurate because project and filter have not been applied. I'd suggest 
> passing the stats from the optimized plan.
> {code:java}
> class CacheManager extends Logging {
> ...
>   def cacheQuery(
>   query: Dataset[_],
>   tableName: Option[String] = None,
>   storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
> val planToCache = query.logicalPlan
> if (lookupCachedData(planToCache).nonEmpty) {
>   logWarning("Asked to cache already cached data.")
> } else {
>   val sparkSession = query.sparkSession
>   val inMemoryRelation = InMemoryRelation(
> sparkSession.sessionState.conf.useCompression,
> sparkSession.sessionState.conf.columnBatchSize, storageLevel,
> sparkSession.sessionState.executePlan(planToCache).executedPlan,
> tableName,
> planToCache)  <==
> ...
> }
> object InMemoryRelation {
>   def apply(
>   useCompression: Boolean,
>   batchSize: Int,
>   storageLevel: StorageLevel,
>   child: SparkPlan,
>   tableName: Option[String],
>   logicalPlan: LogicalPlan): InMemoryRelation = {
> val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
> storageLevel, child, tableName)
> val relation = new InMemoryRelation(child.output, cacheBuilder, 
> logicalPlan.outputOrdering)
> relation.statsOfPlanToCache = logicalPlan.stats   <==
> relation
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-28728:
-
Description: (was: Due to CVE's: 
https://www.cvedetails.com/vulnerability-list/vendor_id-15866/product_id-42991/version_id-238179/opec-1/Fasterxml-Jackson-databind-2.9.0.html)

> Bump Jackson Databind to 2.9.9.3
> 
>
> Key: SPARK-28728
> URL: https://issues.apache.org/jira/browse/SPARK-28728
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Fokko Driesprong (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fokko Driesprong updated SPARK-28728:
-
Description: Needs to be upgraded due to issues.

> Bump Jackson Databind to 2.9.9.3
> 
>
> Key: SPARK-28728
> URL: https://issues.apache.org/jira/browse/SPARK-28728
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Fokko Driesprong
>Priority: Major
> Fix For: 2.4.4, 3.0.0
>
>
> Needs to be upgraded due to issues.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28726) Spark with DynamicAllocation always got connect rest by peers

2019-08-14 Thread angerszhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907131#comment-16907131
 ] 

angerszhu commented on SPARK-28726:
---

[~ajithshetty]   also happen when higher timeouts

> Spark with DynamicAllocation always got connect rest by peers
> -
>
> Key: SPARK-28726
> URL: https://issues.apache.org/jira/browse/SPARK-28726
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> When use Spark with dynamic allocation, we set idle time to 5s
> We always got exception about neety 'Connect reset by peers'
>  
> I suspect that it's because we set idle time 5s is too small, it will cause 
> when Blockmanager call netty io, the executor has been remove because of 
> timeout.
> But not timely notify driver's BlocakManager
> {code:java}
> 19/08/14 00:00:46 WARN 
> org.apache.spark.network.server.TransportChannelHandler: "Exception in 
> connection from /host:port"
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>  at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>  at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>  at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
>  at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> --
> 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: 
> "Error trying to remove broadcast 67 from block manager BlockManagerId(967, 
> host, port, None)"
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>  at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>  at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>  at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
>  at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> --
> 19/08/14 00:00:46 INFO org.apache.spark.ContextCleaner: "Cleaned accumulator 
> 162174"
> 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMaster: "Failed 
> to remove shuffle 22 - Connection reset by peer"
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28726) Spark with DynamicAllocation always got connect rest by peers

2019-08-14 Thread Ajith S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907112#comment-16907112
 ] 

Ajith S commented on SPARK-28726:
-

As i see, this is driver trying to clean up RDDs, broadcasts etc from the 
expiring executor and meanwhile the executor has gone down, which is why such 
exceptions are under warning. Does the issue occur with higher timeouts too.? 

> Spark with DynamicAllocation always got connect rest by peers
> -
>
> Key: SPARK-28726
> URL: https://issues.apache.org/jira/browse/SPARK-28726
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> When use Spark with dynamic allocation, we set idle time to 5s
> We always got exception about neety 'Connect reset by peers'
>  
> I suspect that it's because we set idle time 5s is too small, it will cause 
> when Blockmanager call netty io, the executor has been remove because of 
> timeout.
> But not timely notify driver's BlocakManager
> {code:java}
> 19/08/14 00:00:46 WARN 
> org.apache.spark.network.server.TransportChannelHandler: "Exception in 
> connection from /host:port"
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>  at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>  at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>  at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
>  at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> --
> 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: 
> "Error trying to remove broadcast 67 from block manager BlockManagerId(967, 
> host, port, None)"
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>  at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>  at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
>  at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
>  at 
> io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
>  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
>  at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
>  at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> --
> 19/08/14 00:00:46 INFO org.apache.spark.ContextCleaner: "Cleaned accumulator 
> 162174"
> 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMaster: "Failed 
> to remove shuffle 22 - Connection reset by peer"
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39){code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28729) Comparison between DecimalType and StringType may lead to wrong results

2019-08-14 Thread ShuMing Li (JIRA)
ShuMing Li created SPARK-28729:
--

 Summary: Comparison between  DecimalType and StringType may lead 
to wrong results
 Key: SPARK-28729
 URL: https://issues.apache.org/jira/browse/SPARK-28729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: ShuMing Li


 
{code:java}
desc test_table;
a int NULL
b string NULL
dt string NULL
hh string NULL
# Partition Information
# col_name data_type comment
dt string NULL
hh string NULL

select dt from test_table where dt=201908010023825200017638;
201908010023825200017638
201908010023825200017638
201908010023825200016558
{code}
In the sql above, column `dt` is string type. when users forget to add '' in 
query, Spark returns wrong results.

In `TypeCoercion` class,  DecimalType/StringType is casted as `DoubleType` when 
DecimalType compares with StringType which maybe not safe with precision lose 
or truncating.
{code:java}
/**
val findCommonTypeForBinaryComparison: (DataType, DataType) => Option[DataType] 
= {


// There is no proper decimal type we can pick,
// using double type is the best we can do.
// See SPARK-22469 for details.
case (n: DecimalType, s: StringType) => Some(DoubleType)
case (s: StringType, n: DecimalType) => Some(DoubleType)

...
}
{code}
However I cannot find a good solution to avoid this: maybe just throw exception 
when meets `precision lose` or add a config to avoid this?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2019-08-14 Thread Gabor Somogyi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907016#comment-16907016
 ] 

Gabor Somogyi commented on SPARK-28367:
---

It has been turned out new API from Kafka side is needed for the clean 
solution. The discussion has been initiated. I'm actively tracking the progress 
and intended to create a new PR when it's available.

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28728) Bump Jackson Databind to 2.9.9.3

2019-08-14 Thread Fokko Driesprong (JIRA)
Fokko Driesprong created SPARK-28728:


 Summary: Bump Jackson Databind to 2.9.9.3
 Key: SPARK-28728
 URL: https://issues.apache.org/jira/browse/SPARK-28728
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 2.4.3
Reporter: Fokko Driesprong
 Fix For: 2.4.4, 3.0.0


Due to CVE's: 
https://www.cvedetails.com/vulnerability-list/vendor_id-15866/product_id-42991/version_id-238179/opec-1/Fasterxml-Jackson-databind-2.9.0.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28727) Request for partial least square (PLS) regression model

2019-08-14 Thread Nikunj (JIRA)
Nikunj created SPARK-28727:
--

 Summary: Request for partial least square (PLS) regression model
 Key: SPARK-28727
 URL: https://issues.apache.org/jira/browse/SPARK-28727
 Project: Spark
  Issue Type: New Feature
  Components: ML, SparkR
Affects Versions: 2.4.3
 Environment: I am using Windows 10, Spark v2.3.2
Reporter: Nikunj


Hi.

Is there any development going on with regards to a PLS model? Or is there a 
plan for it in the near future? The application I am developing needs a PLS 
model as it is mandatory in that particular industry. I am using sparklyr, and 
have started a bit of the implementation, but was wondering if something is 
already in the pipeline.

Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28726) Spark with DynamicAllocation always got connect rest by peers

2019-08-14 Thread angerszhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-28726:
--
Description: 
When use Spark with dynamic allocation, we set idle time to 5s

We always got exception about neety 'Connect reset by peers'

 

I suspect that it's because we set idle time 5s is too small, it will cause 
when Blockmanager call netty io, the executor has been remove because of 
timeout.

But not timely notify driver's BlocakManager
{code:java}

19/08/14 00:00:46 WARN org.apache.spark.network.server.TransportChannelHandler: 
"Exception in connection from /host:port"
java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
 at sun.nio.ch.IOUtil.read(IOUtil.java:192)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
 at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
--
19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: 
"Error trying to remove broadcast 67 from block manager BlockManagerId(967, 
host, port, None)"
java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
 at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
 at sun.nio.ch.IOUtil.read(IOUtil.java:192)
 at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
 at 
io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)
 at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)
 at 
io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
 at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
 at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
 at 
io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
 at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
--
19/08/14 00:00:46 INFO org.apache.spark.ContextCleaner: "Cleaned accumulator 
162174"
19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMaster: "Failed to 
remove shuffle 22 - Connection reset by peer"
java.io.IOException: Connection reset by peer
 at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
 at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39){code}

> Spark with DynamicAllocation always got connect rest by peers
> -
>
> Key: SPARK-28726
> URL: https://issues.apache.org/jira/browse/SPARK-28726
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: angerszhu
>Priority: Major
>
> When use Spark with dynamic allocation, we set idle time to 5s
> We always got exception about neety 'Connect reset by peers'
>  
> I suspect that it's because we set idle time 5s is too small, it will cause 
> when Blockmanager call netty io, the executor has been remove because of 
> timeout.
> But not timely notify driver's BlocakManager
> {code:java}
> 19/08/14 00:00:46 WARN 
> org.apache.spark.network.server.TransportChannelHandler: "Exception in 
> connection from /host:port"
> java.io.IOException: Connection reset by peer
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
>  at sun.nio.ch.IOUtil.read(IOUtil.java:192)
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
>  at 
> io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledU