[jira] [Resolved] (SPARK-28695) Make Kafka source more robust with CaseInsensitiveMap
[ https://issues.apache.org/jira/browse/SPARK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28695. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25418 [https://github.com/apache/spark/pull/25418] > Make Kafka source more robust with CaseInsensitiveMap > - > > Key: SPARK-28695 > URL: https://issues.apache.org/jira/browse/SPARK-28695 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > Fix For: 3.0.0 > > > SPARK-28163 fixed a bug and during the analysis we've concluded it would be > more robust to use CaseInsensitiveMap inside Kafka source. This case less > lower/upper case problem would rise in the the future. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28695) Make Kafka source more robust with CaseInsensitiveMap
[ https://issues.apache.org/jira/browse/SPARK-28695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28695: --- Assignee: Gabor Somogyi > Make Kafka source more robust with CaseInsensitiveMap > - > > Key: SPARK-28695 > URL: https://issues.apache.org/jira/browse/SPARK-28695 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Minor > > SPARK-28163 fixed a bug and during the analysis we've concluded it would be > more robust to use CaseInsensitiveMap inside Kafka source. This case less > lower/upper case problem would rise in the the future. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11
[ https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907821#comment-16907821 ] Hyukjin Kwon commented on SPARK-28735: -- Let me take a look tomorrow in KST. > MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails > on JDK11 > - > > Key: SPARK-28735 > URL: https://issues.apache.org/jira/browse/SPARK-28735 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` > failed. > {code} > $ build/sbt -Phadoop-3.2 test:package > $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' > --python-executables python > ... > == > FAIL: test_raw_and_probability_prediction > (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest) > -- > Traceback (most recent call last): > File > "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", > line 89, in test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1E-4)) > AssertionError: False is not true > {code} > {code:python} > class MultilayerPerceptronClassifierTest(SparkSessionTestCase): > def test_raw_and_probability_prediction(self): > data_path = "data/mllib/sample_multiclass_classification_data.txt" > df = self.spark.read.format("libsvm").load(data_path) > mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], > blockSize=128, seed=123) > model = mlp.fit(df) > test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, > 0.25, 0.25))]).toDF() > result = model.transform(test).head() > expected_prediction = 2.0 > expected_probability = [0.0, 0.0, 1.0] > expected_rawPrediction = [-11.6081922998, -8.15827998691, > 22.17757045] > self.assertTrue(result.prediction, expected_prediction) > self.assertTrue(np.allclose(result.probability, > expected_probability, atol=1E-4)) > self.assertTrue(np.allclose(result.rawPrediction, > expected_rawPrediction, atol=1E-4)) > # self.assertTrue(np.allclose(result.rawPrediction, > expected_rawPrediction, atol=1E-4)) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28740) Add support for building with bloop
holdenk created SPARK-28740: --- Summary: Add support for building with bloop Key: SPARK-28740 URL: https://issues.apache.org/jira/browse/SPARK-28740 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: holdenk bloop can, in theory, build scala faster. However the JAR layout is a little different when you try and run the tests. It would be useful if we updated our test JAR discovery to work with bloop. Before working on this check to make sure that bloop it's self has changed to work with Spark. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28666) Support the V2SessionCatalog in saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-28666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28666. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25402 [https://github.com/apache/spark/pull/25402] > Support the V2SessionCatalog in saveAsTable > --- > > Key: SPARK-28666 > URL: https://issues.apache.org/jira/browse/SPARK-28666 > Project: Spark > Issue Type: Planned Work > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Blocker > Fix For: 3.0.0 > > > We need to support the V2SessionCatalog in the old saveAsTable code paths so > that V2 DataSources can leverage the old DataFrameWriter code path. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28666) Support the V2SessionCatalog in saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-28666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28666: --- Assignee: Burak Yavuz > Support the V2SessionCatalog in saveAsTable > --- > > Key: SPARK-28666 > URL: https://issues.apache.org/jira/browse/SPARK-28666 > Project: Spark > Issue Type: Planned Work > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Assignee: Burak Yavuz >Priority: Blocker > > We need to support the V2SessionCatalog in the old saveAsTable code paths so > that V2 DataSources can leverage the old DataFrameWriter code path. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28351) Support DELETE in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-28351: --- Assignee: Xianyin Xin > Support DELETE in DataSource V2 > --- > > Key: SPARK-28351 > URL: https://issues.apache.org/jira/browse/SPARK-28351 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > > This ticket add the DELETE support for V2 datasources. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28351) Support DELETE in DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-28351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-28351. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25115 [https://github.com/apache/spark/pull/25115] > Support DELETE in DataSource V2 > --- > > Key: SPARK-28351 > URL: https://issues.apache.org/jira/browse/SPARK-28351 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Xianyin Xin >Assignee: Xianyin Xin >Priority: Major > Fix For: 3.0.0 > > > This ticket add the DELETE support for V2 datasources. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap
[ https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28203. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 25002 [https://github.com/apache/spark/pull/25002] > PythonRDD should respect SparkContext's conf when passing user confMap > -- > > Key: SPARK-28203 > URL: https://issues.apache.org/jira/browse/SPARK-28203 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.3 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > Fix For: 3.0.0 > > > PythonRDD have several API which accepts user configs from python side. The > parameter is called confAsMap and it's intended to merge with RDD's hadoop > configuration. > However, the confAsMap is first mapped to Configuration then merged into > SparkContext's hadoop configuration. The mapped Configuration will load > default key values in core-default.xml etc., which may be updated in > SparkContext's hadoop configuration. The default value will override updated > value in the merge process. > I will submit a pr to fix this. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28203) PythonRDD should respect SparkContext's conf when passing user confMap
[ https://issues.apache.org/jira/browse/SPARK-28203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-28203: Assignee: Xianjin YE > PythonRDD should respect SparkContext's conf when passing user confMap > -- > > Key: SPARK-28203 > URL: https://issues.apache.org/jira/browse/SPARK-28203 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.3 >Reporter: Xianjin YE >Assignee: Xianjin YE >Priority: Minor > > PythonRDD have several API which accepts user configs from python side. The > parameter is called confAsMap and it's intended to merge with RDD's hadoop > configuration. > However, the confAsMap is first mapped to Configuration then merged into > SparkContext's hadoop configuration. The mapped Configuration will load > default key values in core-default.xml etc., which may be updated in > SparkContext's hadoop configuration. The default value will override updated > value in the merge process. > I will submit a pr to fix this. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28739) Add a simple cost check for Adaptive Query Execution
Maryann Xue created SPARK-28739: --- Summary: Add a simple cost check for Adaptive Query Execution Key: SPARK-28739 URL: https://issues.apache.org/jira/browse/SPARK-28739 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maryann Xue Add a mechanism to compare the costs of the before and after plans of re-optimization in Adaptive Query Execution. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28723) Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile
[ https://issues.apache.org/jira/browse/SPARK-28723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28723: -- Parent Issue: SPARK-28684 (was: SPARK-24417) > Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile > - > > Key: SPARK-28723 > URL: https://issues.apache.org/jira/browse/SPARK-28723 > Project: Spark > Issue Type: Sub-task > Components: Build, SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Yuming Wang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28710) [UDF] create or replace permanent function does not clear the jar in class path
[ https://issues.apache.org/jira/browse/SPARK-28710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28710: -- Description: {code} 0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 'hdfs://hacluster/user/AddDoublesUDF.jar'; +-+ | Result | +-+ +-+ No rows selected (0.216 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> create or replace function addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 'hdfs://hacluster/user/Multiply.jar'; +-+ | Result | +-+ +-+ No rows selected (0.292 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3); INFO : Added [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to class path INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] INFO : Added [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to class path INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] Error: org.apache.spark.sql.AnalysisException: Can not load class 'com.huawei.bigdata.hive.example.udf.multiply' when registering the function 'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 (state=,code=0) {code} was: 0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar 'hdfs://hacluster/user/AddDoublesUDF.jar'; +-+ | Result | +-+ +-+ No rows selected (0.216 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> create or replace function addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar 'hdfs://hacluster/user/Multiply.jar'; +-+ | Result | +-+ +-+ No rows selected (0.292 seconds) 0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3); INFO : Added [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to class path INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] INFO : Added [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to class path INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] Error: org.apache.spark.sql.AnalysisException: Can not load class 'com.huawei.bigdata.hive.example.udf.multiply' when registering the function 'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 (state=,code=0) > [UDF] create or replace permanent function does not clear the jar in class > path > --- > > Key: SPARK-28710 > URL: https://issues.apache.org/jira/browse/SPARK-28710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > {code} > 0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS > 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar > 'hdfs://hacluster/user/AddDoublesUDF.jar'; > +-+ > | Result | > +-+ > +-+ > No rows selected (0.216 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> create or replace function > addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar > 'hdfs://hacluster/user/Multiply.jar'; > +-+ > | Result | > +-+ > +-+ > No rows selected (0.292 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3); > INFO : Added > [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to > class path > INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] > INFO : Added > [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to > class path > INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] > Error: org.apache.spark.sql.AnalysisException: Can not load class > 'com.huawei.bigdata.hive.example.udf.multiply' when registering the function > 'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 > (state=,code=0) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28710) [UDF] create or replace permanent function does not clear the jar in class path
[ https://issues.apache.org/jira/browse/SPARK-28710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907681#comment-16907681 ] Dongjoon Hyun commented on SPARK-28710: --- Thank you for reporting this, [~abhishek.akg]. Thank you for pinging me, [~sandeep.katta2007]. I'll review your PR. > [UDF] create or replace permanent function does not clear the jar in class > path > --- > > Key: SPARK-28710 > URL: https://issues.apache.org/jira/browse/SPARK-28710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: ABHISHEK KUMAR GUPTA >Priority: Major > > 0: jdbc:hive2://10.18.19.208:23040/default> create function addDoubles AS > 'com.huawei.bigdata.hive.example.udf.AddDoublesUDF' using jar > 'hdfs://hacluster/user/AddDoublesUDF.jar'; > +-+ > | Result | > +-+ > +-+ > No rows selected (0.216 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> create or replace function > addDoubles AS 'com.huawei.bigdata.hive.example.udf.multiply' using jar > 'hdfs://hacluster/user/Multiply.jar'; > +-+ > | Result | > +-+ > +-+ > No rows selected (0.292 seconds) > 0: jdbc:hive2://10.18.19.208:23040/default> select addDoubles(3,3); > INFO : Added > [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to > class path > INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] > INFO : Added > [/tmp/8f3d7e87-469e-45e9-b5d1-7c714c5e0183_resources/AddDoublesUDF.jar] to > class path > INFO : Added resources: [hdfs://hacluster/user/AddDoublesUDF.jar] > Error: org.apache.spark.sql.AnalysisException: Can not load class > 'com.huawei.bigdata.hive.example.udf.multiply' when registering the function > 'default.addDoubles', please make sure it is on the classpath; line 1 pos 7 > (state=,code=0) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API
[ https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642 ] Joseph Cooper edited comment on SPARK-28738 at 8/14/19 10:43 PM: - I think I see why commitAsync might not support this. During the polling loop for offset commits if a higher offset is encountered, a lesser one is skipped and the metadata for that one wont get committed. Seems like commitSync would work, just add a function with a parameter for the metadata, and then call the underlying consumer's commitSync method. was (Author: jrciii): I think I see why commitAsync might not support this. During the polling loop for offset commits if a higher offset is encountered, a lesser one is skipped and the metadata for that one wont get committed. > Add ability to include metadata in CanCommitOffsets API > --- > > Key: SPARK-28738 > URL: https://issues.apache.org/jira/browse/SPARK-28738 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Joseph Cooper >Priority: Major > > It is possible to commit metadata with an offset to Kafka. Currently, the > CanCommitOffsets API does not expose this functionality. See > [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] > > We could add a commitSync function which commits an offset right away and > accepts metadata. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API
[ https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Cooper updated SPARK-28738: -- Description: It is possible to commit metadata with an offset to Kafka. Currently, the CanCommitOffsets API does not expose this functionality. See [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] We could add a commitSync function which commits an offset right away and accepts metadata. was: It is possible to commit metadata with an offset to Kafka. Currently, the CanCommitOffsets API does not expose this functionality. See [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] We could make the commit queue take (OffsetRange, String) instead of just OffsetRange and copy the two existing commitAsync functions and make them take Array[(OffsetRange, String)]. > Add ability to include metadata in CanCommitOffsets API > --- > > Key: SPARK-28738 > URL: https://issues.apache.org/jira/browse/SPARK-28738 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Joseph Cooper >Priority: Major > > It is possible to commit metadata with an offset to Kafka. Currently, the > CanCommitOffsets API does not expose this functionality. See > [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] > > We could add a commitSync function which commits an offset right away and > accepts metadata. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28110) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-28110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28110. --- Resolution: Duplicate > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-28110 > URL: https://issues.apache.org/jira/browse/SPARK-28110 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28110) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-28110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907671#comment-16907671 ] Dongjoon Hyun commented on SPARK-28110: --- Yes, it does. Although SPARK-28723 provides a solution for Hadoop-3.2/Hive2.3.6 profile, I believe we can close this as `Superceded by` SPARK-28723. I'll resolve this one. > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-28110 > URL: https://issues.apache.org/jira/browse/SPARK-28110 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28701) add java11 support for spark pull request builds
[ https://issues.apache.org/jira/browse/SPARK-28701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907654#comment-16907654 ] Dongjoon Hyun commented on SPARK-28701: --- Thanks, [~shaneknapp]! :D > add java11 support for spark pull request builds > > > Key: SPARK-28701 > URL: https://issues.apache.org/jira/browse/SPARK-28701 > Project: Spark > Issue Type: Improvement > Components: Build, jenkins >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > from https://github.com/apache/spark/pull/25405 > add a PRB subject check for [test-java11] and update JAVA_HOME env var to > point to /usr/java/jdk-11.0.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API
[ https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642 ] Joseph Cooper edited comment on SPARK-28738 at 8/14/19 9:52 PM: I think I see why commitAsync might not support this. During the polling loop for offset commits if a higher offset is encountered, a lesser one is skipped and the metadata for that one wont get committed. was (Author: jrciii): I think I see why commitAsync might not support this. During the polling loop for offset commits if a higher offset is encountered, a lesser one might get skipped and the metadata for that one wont get committed. At least I think that is the spirit of the commitAll function, but it doesn't seem to make sense. Once that map is full, wont there be no more polling of the queue? [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L303] Maybe Im missing something > Add ability to include metadata in CanCommitOffsets API > --- > > Key: SPARK-28738 > URL: https://issues.apache.org/jira/browse/SPARK-28738 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Joseph Cooper >Priority: Major > > It is possible to commit metadata with an offset to Kafka. Currently, the > CanCommitOffsets API does not expose this functionality. See > [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] > > We could make the commit queue take (OffsetRange, String) instead of just > OffsetRange and copy the two existing commitAsync functions and make them > take Array[(OffsetRange, String)]. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API
[ https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642 ] Joseph Cooper edited comment on SPARK-28738 at 8/14/19 9:42 PM: I think I see why commitAsync might not support this. During the polling loop for offset commits if a higher offset is encountered, a lesser one might get skipped and the metadata for that one wont get committed. At least I think that is the spirit of the commitAll function, but it doesn't seem to make sense. Once that map is full, wont there be no more polling of the queue? [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L303] Maybe Im missing something was (Author: jrciii): I think I see why commitAsync might not support this. During the polling loop for offset commits if a higher offset is encountered, a lesser one might get skipped and the metadata for that one wont get committed. > Add ability to include metadata in CanCommitOffsets API > --- > > Key: SPARK-28738 > URL: https://issues.apache.org/jira/browse/SPARK-28738 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Joseph Cooper >Priority: Major > > It is possible to commit metadata with an offset to Kafka. Currently, the > CanCommitOffsets API does not expose this functionality. See > [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] > > We could make the commit queue take (OffsetRange, String) instead of just > OffsetRange and copy the two existing commitAsync functions and make them > take Array[(OffsetRange, String)]. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-23519) Create View Commands Fails with The view output (col1,col1) contains duplicate column name
[ https://issues.apache.org/jira/browse/SPARK-23519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Franck Tago reopened SPARK-23519: - Ok Spark Community I am sorry for being a pest about this , but I re-opening this Jira because I really believe that this should be addressed . Right now I do not have any way satisfying my customer's requirement . My current use case is the following . My customer can provide any customer Hive query . I am oblivious to the actually content of the query and parsing the query is not an option . All I know if the number of fields projected from the customer query and the type of those fields . I do not know the name of the fields projected from the custom query. What is currently do with spark sql is run a query of the form . Create view view_name > Create View Commands Fails with The view output (col1,col1) contains > duplicate column name > --- > > Key: SPARK-23519 > URL: https://issues.apache.org/jira/browse/SPARK-23519 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.2.1 >Reporter: Franck Tago >Priority: Major > Labels: bulk-closed > Attachments: image-2018-05-10-10-48-57-259.png > > > 1- create and populate a hive table . I did this in a hive cli session .[ > not that this matters ] > create table atable (col1 int) ; > insert into atable values (10 ) , (100) ; > 2. create a view from the table. > [These actions were performed from a spark shell ] > spark.sql("create view default.aview (int1 , int2 ) as select col1 , col1 > from atable ") > java.lang.AssertionError: assertion failed: The view output (col1,col1) > contains duplicate column name. > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.sql.execution.command.ViewHelper$.generateViewProperties(views.scala:361) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:236) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:174) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:632) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API
[ https://issues.apache.org/jira/browse/SPARK-28738?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907642#comment-16907642 ] Joseph Cooper commented on SPARK-28738: --- I think I see why commitAsync might not support this. During the polling loop for offset commits if a higher offset is encountered, a lesser one might get skipped and the metadata for that one wont get committed. > Add ability to include metadata in CanCommitOffsets API > --- > > Key: SPARK-28738 > URL: https://issues.apache.org/jira/browse/SPARK-28738 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.4.4 >Reporter: Joseph Cooper >Priority: Major > > It is possible to commit metadata with an offset to Kafka. Currently, the > CanCommitOffsets API does not expose this functionality. See > [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] > > We could make the commit queue take (OffsetRange, String) instead of just > OffsetRange and copy the two existing commitAsync functions and make them > take Array[(OffsetRange, String)]. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28728) Bump Jackson Databind to 2.9.9.3
[ https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907635#comment-16907635 ] Dongjoon Hyun commented on SPARK-28728: --- [~Fokko]. Thank you for making a JIRA and PR, but Apache Spark community has [the following guideline.|https://spark.apache.org/contributing.html]. Please don't set the `Fix Versions` next time. {code} Do not set the following fields: Fix Version. This is assigned by committers only when resolved. {code} > Bump Jackson Databind to 2.9.9.3 > > > Key: SPARK-28728 > URL: https://issues.apache.org/jira/browse/SPARK-28728 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Fokko Driesprong >Priority: Major > > Needs to be upgraded due to issues. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3
[ https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28728: -- Fix Version/s: (was: 2.4.4) (was: 3.0.0) > Bump Jackson Databind to 2.9.9.3 > > > Key: SPARK-28728 > URL: https://issues.apache.org/jira/browse/SPARK-28728 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Fokko Driesprong >Priority: Major > > Needs to be upgraded due to issues. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28721) Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and Executors
[ https://issues.apache.org/jira/browse/SPARK-28721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Clay resolved SPARK-28721. -- Resolution: Duplicate Ah sorry I didn't search carefully enough for a duplicate > Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and > Executors > --- > > Key: SPARK-28721 > URL: https://issues.apache.org/jira/browse/SPARK-28721 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.1, 2.4.3 >Reporter: Patrick Clay >Priority: Minor > > This does not seem to affect 2.4.0. > To repro: > # Download pristine Spark 2.4.3 binary > # Edit pi.py to not call spark.stop() > # ./bin/docker-image-tool.sh -r MY_IMAGE -t MY_TAG build push > # spark-submit --master k8s://IP --deploy-mode cluster --conf > spark.kubernetes.driver.pod.name=spark-driver --conf > spark.kubernetes.container.image=MY_IMAGE:MY_TAG > file:/opt/spark/examples/src/main/python/pi.py > The driver runs successfully and Python exits but the Driver and Executor > JVMs and Pods remain up. > > I realize that explicitly calling spark.stop() is always best practice, but > since this does not repro in 2.4.0 it seems like a regression. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28738) Add ability to include metadata in CanCommitOffsets API
Joseph Cooper created SPARK-28738: - Summary: Add ability to include metadata in CanCommitOffsets API Key: SPARK-28738 URL: https://issues.apache.org/jira/browse/SPARK-28738 Project: Spark Issue Type: New Feature Components: DStreams Affects Versions: 2.4.4 Reporter: Joseph Cooper It is possible to commit metadata with an offset to Kafka. Currently, the CanCommitOffsets API does not expose this functionality. See [https://github.com/apache/spark/blob/017919b636fd3ce43ccf5ec57f1c1842aa2130db/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/DirectKafkaInputDStream.scala#L300] We could make the commit queue take (OffsetRange, String) instead of just OffsetRange and copy the two existing commitAsync functions and make them take Array[(OffsetRange, String)]. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28683) Upgrade Scala to 2.12.10
[ https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28683: -- Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-24417) > Upgrade Scala to 2.12.10 > > > Key: SPARK-28683 > URL: https://issues.apache.org/jira/browse/SPARK-28683 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 > and found that 2.12.9 has a serious bug, > https://github.com/scala/bug/issues/11665 * > We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR. > h3. Highlights (2.12.9) > * Faster compiler: [5–10% faster since > 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1543097847070&to=1564631199344&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench%40scalabench%40], > thanks to many optimizations (mostly by Jason Zaugg and Diego E. > Alonso-Blas: kudos!) > * Improved compatibility with JDK 11, 12, and 13 > * Experimental support for build pipelining and outline type checking > [https://github.com/scala/scala/releases/tag/v2.12.9] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28683) Upgrade Scala to 2.12.10
[ https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907619#comment-16907619 ] Dongjoon Hyun commented on SPARK-28683: --- I got it. Yes. This is a nice-to-have. I'll put this out this umbrella. > Upgrade Scala to 2.12.10 > > > Key: SPARK-28683 > URL: https://issues.apache.org/jira/browse/SPARK-28683 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 > and found that 2.12.9 has a serious bug, > https://github.com/scala/bug/issues/11665 * > We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR. > h3. Highlights (2.12.9) > * Faster compiler: [5–10% faster since > 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1543097847070&to=1564631199344&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench%40scalabench%40], > thanks to many optimizations (mostly by Jason Zaugg and Diego E. > Alonso-Blas: kudos!) > * Improved compatibility with JDK 11, 12, and 13 > * Experimental support for build pipelining and outline type checking > [https://github.com/scala/scala/releases/tag/v2.12.9] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28737) Update jersey to 2.27+ (2.29)
Sean Owen created SPARK-28737: - Summary: Update jersey to 2.27+ (2.29) Key: SPARK-28737 URL: https://issues.apache.org/jira/browse/SPARK-28737 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Sean Owen Looks like we might need to update Jersey after all, from recent JDK 11 testing: {code} Caused by: java.lang.IllegalArgumentException at jersey.repackaged.org.objectweb.asm.ClassReader.(ClassReader.java:170) at jersey.repackaged.org.objectweb.asm.ClassReader. (ClassReader.java:153) at jersey.repackaged.org.objectweb.asm.ClassReader. (ClassReader.java:424) at org.glassfish.jersey.server.internal.scanning.AnnotationAcceptingListener.process(AnnotationAcceptingListener.java:170) {code} It looks like 2.27+ may solve the issue, so worth trying 2.29. I'm not 100% sure this is an issue as the JDK 11 testing process is still undergoing change, but will work on it to see how viable it is anyway, as it may be worthwhile to update for 3.0 in any event. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28683) Upgrade Scala to 2.12.10
[ https://issues.apache.org/jira/browse/SPARK-28683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907606#comment-16907606 ] Sean Owen commented on SPARK-28683: --- I think we can detach this from the JDK 11 umbrella. This doesn't appear to be strictly necessary for JDK 11 in Spark. > Upgrade Scala to 2.12.10 > > > Key: SPARK-28683 > URL: https://issues.apache.org/jira/browse/SPARK-28683 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > *Note that we tested 2.12.9 via https://github.com/apache/spark/pull/25404 > and found that 2.12.9 has a serious bug, > https://github.com/scala/bug/issues/11665 * > We will skip 2.12.9 and try to upgrade 2.12.10 directly in this PR. > h3. Highlights (2.12.9) > * Faster compiler: [5–10% faster since > 2.12.8|https://scala-ci.typesafe.com/grafana/dashboard/db/scala-benchmark?orgId=1&from=1543097847070&to=1564631199344&var-branch=2.12.x&var-source=All&var-bench=HotScalacBenchmark.compile&var-host=scalabench%40scalabench%40], > thanks to many optimizations (mostly by Jason Zaugg and Diego E. > Alonso-Blas: kudos!) > * Improved compatibility with JDK 11, 12, and 13 > * Experimental support for build pipelining and outline type checking > [https://github.com/scala/scala/releases/tag/v2.12.9] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28110) on JDK11, IsolatedClientLoader must be able to load java.sql classes
[ https://issues.apache.org/jira/browse/SPARK-28110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907604#comment-16907604 ] Sean Owen commented on SPARK-28110: --- Is this one still an issue? I think this is a clone of an older issue that was mostly resolved, and the rest is I think a subset of the Hive update? > on JDK11, IsolatedClientLoader must be able to load java.sql classes > > > Key: SPARK-28110 > URL: https://issues.apache.org/jira/browse/SPARK-28110 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Imran Rashid >Priority: Major > > This might be very specific to my fork & a kind of weird system setup I'm > working on, I haven't completely confirmed yet, but I wanted to report it > anyway in case anybody else sees this. > When I try to do anything which touches the metastore on java11, I > immediately get errors from IsolatedClientLoader that it can't load anything > in java.sql. eg. > {noformat} > scala> spark.sql("show tables").show() > java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError: > java/sql/SQLTransientException when creating Hive client using classpath: > file:/home/systest/jdk-11.0.2/, ... > ... > Caused by: java.lang.ClassNotFoundException: java.sql.SQLTransientException > at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:471) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:588) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.doLoadClass(IsolatedClientLoader.scala:230) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1.loadClass(IsolatedClientLoader.scala:219) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) > {noformat} > After a bit of debugging, I also discovered that the {{rootClassLoader}} is > {{null}} in {{IsolatedClientLoader}}. I think this would work if either > {{rootClassLoader}} could load those classes, or if {{isShared()}} was > changed to allow any class starting with "java." (I'm not sure why it only > allows "java.lang" and "java.net" currently.) -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28736) pyspark.mllib.clustering fails on JDK11
[ https://issues.apache.org/jira/browse/SPARK-28736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28736: -- Description: Build Spark and run PySpark UT with JDK11. {code} $ build/sbt -Phadoop-3.2 test:package $ python/run-tests --testnames 'pyspark.mllib.clustering' --python-executables python ... File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 386, in __main__.GaussianMixtureModel Failed example: abs(softPredicted[0] - 1.0) < 0.001 Expected: True Got: False ** File "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", line 388, in __main__.GaussianMixtureModel Failed example: abs(softPredicted[1] - 0.0) < 0.001 Expected: True Got: False ** 2 of 31 in __main__.GaussianMixtureModel ***Test Failed*** 2 failures. {code} was: Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` failed. {code} $ build/sbt -Phadoop-3.2 test:package $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python ... == FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest) -- Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) AssertionError: False is not true {code} {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} > pyspark.mllib.clustering fails on JDK11 > --- > > Key: SPARK-28736 > URL: https://issues.apache.org/jira/browse/SPARK-28736 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Build Spark and run PySpark UT with JDK11. > {code} > $ build/sbt -Phadoop-3.2 test:package > $ python/run-tests --testnames 'pyspark.mllib.clustering' > --python-executables python > ... > File > "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", > line 386, in __main__.GaussianMixtureModel > Failed example: > abs(softPredicted[0] - 1.0) < 0.001 > Expected: > True > Got: > False > ** > File > "/Users/dongjoon/APACHE/spark-master/python/pyspark/mllib/clustering.py", > line 388, in __main__.GaussianMixtureModel > Failed example: > abs(softPredicted[1] - 0.0) < 0.001 > Expected: > True > Got: > False > ** >2 of 31 in __main__.GaussianMixtureModel > ***Test Failed*** 2 failures. > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28736) pyspark.mllib.clustering fails on JDK11
Dongjoon Hyun created SPARK-28736: - Summary: pyspark.mllib.clustering fails on JDK11 Key: SPARK-28736 URL: https://issues.apache.org/jira/browse/SPARK-28736 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.0.0 Reporter: Dongjoon Hyun Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` failed. {code} $ build/sbt -Phadoop-3.2 test:package $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python ... == FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest) -- Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) AssertionError: False is not true {code} {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11
[ https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28735: -- Description: Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` failed. {code} $ build/sbt -Phadoop-3.2 test:package $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python ... == FAIL: test_raw_and_probability_prediction (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest) -- Traceback (most recent call last): File "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", line 89, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) AssertionError: False is not true {code} {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} was: Build Spark with JDK11 and run `python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last commented `assertTrue` failed. - https://github.com/apache/spark/pull/25443/commits/593a154813880fb13e3091043d809e0c00e57bc5 {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} > MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails > on JDK11 > - > > Key: SPARK-28735 > URL: https://issues.apache.org/jira/browse/SPARK-28735 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Build Spark and run PySpark UT with JDK11. The last commented `assertTrue` > failed. > {code} > $ build/sbt -Phadoop-3.2 test:package > $ python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' > --python-executables python > ... > == > FAIL: test_raw_and_probability_prediction > (pyspark.ml.tests.test_algorithms.MultilayerPerceptronClassifierTest) > -- > Traceback (most recent call last): > File > "/Users/dongjoon/APACHE/spark-master/python/pyspark/ml/tests/test_algorithms.py", > line 89, in test_raw_and_probability_prediction > self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, > atol=1E-4)) > AssertionError: False is not true > {code} > {code:python} > class MultilayerPerceptronClassifierTest(SparkSessionTestCase): > def test_raw_and_probability_prediction(se
[jira] [Updated] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11
[ https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28735: -- Description: Build Spark with JDK11 and run `python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last commented `assertTrue` failed. - https://github.com/apache/spark/pull/25443/commits/593a154813880fb13e3091043d809e0c00e57bc5 {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} was: Build Spark with JDK11 and run `python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last commented `assertTrue` failed. - 593a154813880fb13e3091043d809e0c00e57bc5 {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} > MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails > on JDK11 > - > > Key: SPARK-28735 > URL: https://issues.apache.org/jira/browse/SPARK-28735 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Build Spark with JDK11 and run `python/run-tests --testnames > 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last > commented `assertTrue` failed. > - > https://github.com/apache/spark/pull/25443/commits/593a154813880fb13e3091043d809e0c00e57bc5 > {code:python} > class MultilayerPerceptronClassifierTest(SparkSessionTestCase): > def test_raw_and_probability_prediction(self): > data_path = "data/mllib/sample_multiclass_classification_data.txt" > df = self.spark.read.format("libsvm").load(data_path) > mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], > blockSize=128, seed=123) > model = mlp.fit(df) > test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, > 0.25, 0.25))]).toDF() > result = model.transform(test).head() > expected_prediction = 2.0 > expected_probability = [0.0, 0.0, 1.0] > expected_rawPrediction = [-11.6081922998, -8.15827998691, > 22.17757045] > self.assertTrue(result.prediction, expected_prediction) > self.assertTrue(np.allclose(result.probability, > expected_probability, atol=1E-4)) > self.assertTrue(np.allclose(result.rawPrediction, > expected_rawPrediction, atol=1E-4)) > # self.assertTrue(np.allclose(result.rawPrediction, > expected_rawPrediction, atol=1E-4)) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) ---
[jira] [Updated] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11
[ https://issues.apache.org/jira/browse/SPARK-28735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-28735: -- Description: Build Spark with JDK11 and run `python/run-tests --testnames 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last commented `assertTrue` failed. - 593a154813880fb13e3091043d809e0c00e57bc5 {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} was: {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} > MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails > on JDK11 > - > > Key: SPARK-28735 > URL: https://issues.apache.org/jira/browse/SPARK-28735 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Build Spark with JDK11 and run `python/run-tests --testnames > 'pyspark.ml.tests.test_algorithms' --python-executables python`. The last > commented `assertTrue` failed. > - 593a154813880fb13e3091043d809e0c00e57bc5 > {code:python} > class MultilayerPerceptronClassifierTest(SparkSessionTestCase): > def test_raw_and_probability_prediction(self): > data_path = "data/mllib/sample_multiclass_classification_data.txt" > df = self.spark.read.format("libsvm").load(data_path) > mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], > blockSize=128, seed=123) > model = mlp.fit(df) > test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, > 0.25, 0.25))]).toDF() > result = model.transform(test).head() > expected_prediction = 2.0 > expected_probability = [0.0, 0.0, 1.0] > expected_rawPrediction = [-11.6081922998, -8.15827998691, > 22.17757045] > self.assertTrue(result.prediction, expected_prediction) > self.assertTrue(np.allclose(result.probability, > expected_probability, atol=1E-4)) > self.assertTrue(np.allclose(result.rawPrediction, > expected_rawPrediction, atol=1E-4)) > # self.assertTrue(np.allclose(result.rawPrediction, > expected_rawPrediction, atol=1E-4)) > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28735) MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11
Dongjoon Hyun created SPARK-28735: - Summary: MultilayerPerceptronClassifierTest.test_raw_and_probability_prediction fails on JDK11 Key: SPARK-28735 URL: https://issues.apache.org/jira/browse/SPARK-28735 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.0.0 Reporter: Dongjoon Hyun {code:python} class MultilayerPerceptronClassifierTest(SparkSessionTestCase): def test_raw_and_probability_prediction(self): data_path = "data/mllib/sample_multiclass_classification_data.txt" df = self.spark.read.format("libsvm").load(data_path) mlp = MultilayerPerceptronClassifier(maxIter=100, layers=[4, 5, 4, 3], blockSize=128, seed=123) model = mlp.fit(df) test = self.sc.parallelize([Row(features=Vectors.dense(0.1, 0.1, 0.25, 0.25))]).toDF() result = model.transform(test).head() expected_prediction = 2.0 expected_probability = [0.0, 0.0, 1.0] expected_rawPrediction = [-11.6081922998, -8.15827998691, 22.17757045] self.assertTrue(result.prediction, expected_prediction) self.assertTrue(np.allclose(result.probability, expected_probability, atol=1E-4)) self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) # self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1E-4)) {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907551#comment-16907551 ] Parth Gandhi commented on SPARK-27361: -- Makes sense, thank you. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * How the Spark executor discovers GPU's when run on YARN > * Integrate with YARN 3.2 GPU support. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28734) Create a table of content in the left hand side bar for SQL doc.
Dilip Biswal created SPARK-28734: Summary: Create a table of content in the left hand side bar for SQL doc. Key: SPARK-28734 URL: https://issues.apache.org/jira/browse/SPARK-28734 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 2.4.3 Reporter: Dilip Biswal -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28721) Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and Executors
[ https://issues.apache.org/jira/browse/SPARK-28721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907522#comment-16907522 ] Patrick Clay commented on SPARK-28721: -- I confirmed this affects 2.4.1, and re-confirmed that it does not affect 2.4.0. > Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and > Executors > --- > > Key: SPARK-28721 > URL: https://issues.apache.org/jira/browse/SPARK-28721 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.1, 2.4.3 >Reporter: Patrick Clay >Priority: Minor > > This does not seem to affect 2.4.0. > To repro: > # Download pristine Spark 2.4.3 binary > # Edit pi.py to not call spark.stop() > # ./bin/docker-image-tool.sh -r MY_IMAGE -t MY_TAG build push > # spark-submit --master k8s://IP --deploy-mode cluster --conf > spark.kubernetes.driver.pod.name=spark-driver --conf > spark.kubernetes.container.image=MY_IMAGE:MY_TAG > file:/opt/spark/examples/src/main/python/pi.py > The driver runs successfully and Python exits but the Driver and Executor > JVMs and Pods remain up. > > I realize that explicitly calling spark.stop() is always best practice, but > since this does not repro in 2.4.0 it seems like a regression. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28721) Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and Executors
[ https://issues.apache.org/jira/browse/SPARK-28721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Clay updated SPARK-28721: - Affects Version/s: 2.4.1 > Failing to stop SparkSession in K8S cluster mode PySpark leaks Driver and > Executors > --- > > Key: SPARK-28721 > URL: https://issues.apache.org/jira/browse/SPARK-28721 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 2.4.1, 2.4.3 >Reporter: Patrick Clay >Priority: Minor > > This does not seem to affect 2.4.0. > To repro: > # Download pristine Spark 2.4.3 binary > # Edit pi.py to not call spark.stop() > # ./bin/docker-image-tool.sh -r MY_IMAGE -t MY_TAG build push > # spark-submit --master k8s://IP --deploy-mode cluster --conf > spark.kubernetes.driver.pod.name=spark-driver --conf > spark.kubernetes.container.image=MY_IMAGE:MY_TAG > file:/opt/spark/examples/src/main/python/pi.py > The driver runs successfully and Python exits but the Driver and Executor > JVMs and Pods remain up. > > I realize that explicitly calling spark.stop() is always best practice, but > since this does not repro in 2.4.0 it seems like a regression. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27931) Accept 'on' and 'off' as input for boolean data type
[ https://issues.apache.org/jira/browse/SPARK-27931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907514#comment-16907514 ] YoungGyu Chun commented on SPARK-27931: --- I'll be working on this > Accept 'on' and 'off' as input for boolean data type > > > Key: SPARK-27931 > URL: https://issues.apache.org/jira/browse/SPARK-27931 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > This ticket contains three things: > 1. Accept 'on' and 'off' as input for boolean data type > {code:sql} > SELECT cast('no' as boolean) AS false; > SELECT cast('off' as boolean) AS false; > {code} > 2. Accept unique prefixes thereof: > {code:sql} > SELECT cast('of' as boolean) AS false; > SELECT cast('fal' as boolean) AS false; > {code} > 3. Trim the string when cast to boolean type > {code:sql} > SELECT cast('true ' as boolean) AS true; > SELECT cast(' FALSE' as boolean) AS true; > {code} > More details: > [https://www.postgresql.org/docs/devel/datatype-boolean.html] > > [https://github.com/postgres/postgres/blob/REL_12_BETA1/src/backend/utils/adt/bool.c#L25] > > [https://github.com/postgres/postgres/commit/05a7db05826c5eb68173b6d7ef1553c19322ef48] > > [https://github.com/postgres/postgres/commit/9729c9360886bee7feddc6a1124b0742de4b9f3d] > Other DBs: > [http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html] > [https://my.vertica.com/docs/5.0/HTML/Master/2983.htm] > > [https://github.com/prestosql/presto/blob/b845cd66da3eb1fcece50efba83ea12bc40afbaa/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L108-L138] -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907431#comment-16907431 ] Thomas Graves commented on SPARK-27361: --- Since that was done prior to this feature, I think its ok to leave it alone. It worked just fine on its own to purely request yarn get the resources (with no integration with spark scheduler, etc) We did modify how that worked in https://issues.apache.org/jira/browse/SPARK-27959 so that one should be linked here I think. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * How the Spark executor discovers GPU's when run on YARN > * Integrate with YARN 3.2 GPU support. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28701) add java11 support for spark pull request builds
[ https://issues.apache.org/jira/browse/SPARK-28701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907394#comment-16907394 ] shane knapp commented on SPARK-28701: - [~dongjoon] whoops! i just fixed that build... also, i'm hoping to get the [test-java11] flag working fully and merged in the next day or so... > add java11 support for spark pull request builds > > > Key: SPARK-28701 > URL: https://issues.apache.org/jira/browse/SPARK-28701 > Project: Spark > Issue Type: Improvement > Components: Build, jenkins >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > > from https://github.com/apache/spark/pull/25405 > add a PRB subject check for [test-java11] and update JAVA_HOME env var to > point to /usr/java/jdk-11.0.1 -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27361) YARN support for GPU-aware scheduling
[ https://issues.apache.org/jira/browse/SPARK-27361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907391#comment-16907391 ] Parth Gandhi commented on SPARK-27361: -- [~tgraves], I was just wondering whether https://issues.apache.org/jira/browse/SPARK-20327 should be a sub task in this Jira in order to have all components for YARN support for GPU scheduling under one umbrella. Thank you. > YARN support for GPU-aware scheduling > - > > Key: SPARK-27361 > URL: https://issues.apache.org/jira/browse/SPARK-27361 > Project: Spark > Issue Type: Story > Components: YARN >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Design and implement YARN support for GPU-aware scheduling: > * User can request GPU resources at Spark application level. > * How the Spark executor discovers GPU's when run on YARN > * Integrate with YARN 3.2 GPU support. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28687) Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()`
[ https://issues.apache.org/jira/browse/SPARK-28687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-28687. --- Resolution: Fixed Issue resolved by pull request 25408 [https://github.com/apache/spark/pull/25408] > Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()` > > > Key: SPARK-28687 > URL: https://issues.apache.org/jira/browse/SPARK-28687 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, we support these field for EXTRACT: CENTURY, MILLENNIUM, DECADE, > YEAR, QUARTER, MONTH, WEEK, DAY, DAYOFWEEK, HOUR, MINUTE, SECOND, DOW, > ISODOW, DOY, > We also need support: EPOCH, MICROSECONDS, MILLISECONDS, TIMEZONE, > TIMEZONE_M, TIMEZONE_H, ISOYEAR. > https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27739) df.persist should save stats from optimized plan
[ https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-27739. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24623 [https://github.com/apache/spark/pull/24623] > df.persist should save stats from optimized plan > > > Key: SPARK-27739 > URL: https://issues.apache.org/jira/browse/SPARK-27739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Minor > Fix For: 3.0.0 > > > CacheManager.cacheQuery passes the stats for `planToCache` to > InMemoryRelation. Since the plan has not been optimized, the stats is > inaccurate because project and filter have not been applied. I'd suggest > passing the stats from the optimized plan. > {code:java} > class CacheManager extends Logging { > ... > def cacheQuery( > query: Dataset[_], > tableName: Option[String] = None, > storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { > val planToCache = query.logicalPlan > if (lookupCachedData(planToCache).nonEmpty) { > logWarning("Asked to cache already cached data.") > } else { > val sparkSession = query.sparkSession > val inMemoryRelation = InMemoryRelation( > sparkSession.sessionState.conf.useCompression, > sparkSession.sessionState.conf.columnBatchSize, storageLevel, > sparkSession.sessionState.executePlan(planToCache).executedPlan, > tableName, > planToCache) <== > ... > } > object InMemoryRelation { > def apply( > useCompression: Boolean, > batchSize: Int, > storageLevel: StorageLevel, > child: SparkPlan, > tableName: Option[String], > logicalPlan: LogicalPlan): InMemoryRelation = { > val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, > storageLevel, child, tableName) > val relation = new InMemoryRelation(child.output, cacheBuilder, > logicalPlan.outputOrdering) > relation.statsOfPlanToCache = logicalPlan.stats <== > relation > } > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28733) DataFrameReader of Spark not able to recognize the very first quote character, while custom unicode quote character is used
Mrinal Bhattacherjee created SPARK-28733: Summary: DataFrameReader of Spark not able to recognize the very first quote character, while custom unicode quote character is used Key: SPARK-28733 URL: https://issues.apache.org/jira/browse/SPARK-28733 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.2 Reporter: Mrinal Bhattacherjee I have encountered a strange behaviour recently, while reading a CSV file using DataFrameReader of org.apache.spark.sql package (Spark version 2.3.2). Here is my spark read code snippet. _{color:#d04437}val sepChar = "\u00C7" // Ç{color}_ _{color:#d04437}val quoteChar = "\u1E1C" // Ḝ{color}_ _{color:#d04437}val escapeChar = "\u1E1D" // ḝ{color}_ _{color:#d04437}val inputCsvFile = "\\input_ab.csv"{color}_ _{color:#d04437}val readDF = sparkSession.read.option("sep", sepChar){color}_ _{color:#d04437}.option("encoding", encoding.toUpperCase){color}_ _{color:#d04437}.option("quote", quoteChar){color}_ _{color:#d04437}.option("escape", escapeChar){color}_ _{color:#d04437}.option("header", "false"){color}_ _{color:#d04437}.option("multiLine", "true"){color}_ _{color:#d04437}.csv(inputCsvFile){color}_ _{color:#d04437}readDF.cache(){color}_ _{color:#d04437}readDF.show(20, false){color}_ Due to some awful data, I'm forced to use some unicode characters as sep character, quote character, escape character instead of default ones. Below is my input sample data. {color:#33}*Ḝ1ḜÇḜsmithḜÇḜ5Ḝ*{color} {color:#33}*Ḝ2ḜÇḜdousonḜÇḜ6Ḝ*{color} {color:#33}*Ḝ3ḜÇḜsr,tendulkarḜÇḜ10Ḝ*{color} Here Ç is field separator, Ḝ is quote character and all the fields values are wrapped with this custom quote character. The problem I'm getting is, the first occurance of the quote character is not getting recognized by Spark somehow. I tried with any charcter other than Unicode like ` ~ X (alphabet X just for a testing scenario), even default quote (") as well. It works fine in all the scenarios except when Unicode is used as quote character. The first occurance of the Unicode quote character is coming as some non printable character �� , hence the wrap end quote character of the first field of first record is getting included in data. Here is the output of df show. +---++-+ |id |name |class| +---++-+ |��1Ḝ |smith |5 | |2 |douson |6 | |3 |sr,tendulkar|10 | +---++-+ It happens only for the first field of the very first record. Other quote characters in this file is being read as expected without any issues. When I keep an extra empty record at the top of the file, i.e., simply a new line (\n) at the very first line, the issue doesn't occur. Even, that empty row is not being considered as an empty record in df as well. Thus my problem gets solved. But this manipulation cannot be done in the production and hence it is an issue to be bothered about. I feel, this is a bug. If it is not, kindly let me know the way to process the same without getting this issue; or else kindly provide a fix at the earliest. Thanks in advance. Best Regards, Mrinal -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing t
[ https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alix Métivier updated SPARK-28732: -- Description: I am using agg function on a dataset, and i want to count the number of lines upon grouping columns. I would like to store the result of this count in an integer, but it fails with this output : [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 89, Column 53: No applicable constructor/method found for actual parameters "long"; candidates are: "java.lang.Integer(int)", "java.lang.Integer(java.lang.String)" Here is the line 89 and a few others to understand : /* 085 */ long value13 = i.getLong(5); /* 086 */ argValue4 = value13; /* 087 */ /* 088 */ /* 089 */ final java.lang.Integer value12 = false ? null : new java.lang.Integer(argValue4); As per Integer documentation, there is not constructor for the type Long, so this is why the generated code fails. Here is my code : org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), org.apache.spark.sql.functions.col("o_year").as("o_yearN")) .agg(org.apache.spark.sql.functions.count("n_name").as("countN"), .as(org.apache.spark.sql.Encoders.bean(row2Struct.class)); row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int If countN is a Long, code above wont fail If it is an Int, it works in 1.6 and 2.0, but fails on version 2.1+ was: I am using agg function on a dataset, and i want to count the number of lines upon grouping columns. I would like to store the result of this count in an integer, but it fails with this output : [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 89, Column 53: No applicable constructor/method found for actual parameters "long"; candidates are: "java.lang.Integer(int)", "java.lang.Integer(java.lang.String)" Here is the line 89 and a few others to understand : /* 085 */ long value13 = i.getLong(5); /* 086 */ argValue4 = value13; /* 087 */ /* 088 */ /* 089 */ final java.lang.Integer value12 = false ? null : new java.lang.Integer(argValue4); As per Integer documentation, there is not constructor for the type Long, so this is why the generated code fails. Here is my code : org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), org.apache.spark.sql.functions.col("o_year").as("o_yearN")) .agg(org.apache.spark.sql.functions.count("n_name").as("countN"), .as(org.apache.spark.sql.Encoders.bean(row2Struct.class)); row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int If countN is a Long, code above wont fail If it is a Long, it works in 1.6 and 2.0, but fails on version 2.1+ > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java' when storing the result of a count aggregation in an integer > --- > > Key: SPARK-28732 > URL: https://issues.apache.org/jira/browse/SPARK-28732 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Alix Métivier >Priority: Blocker > > I am using agg function on a dataset, and i want to count the number of lines > upon grouping columns. I would like to store the result of this count in an > integer, but it fails with this output : > [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 89, Column 53: No applicable constructor/method found > for actual parameters "long"; candidates are: "java.lang.Integer(int)", > "java.lang.Integer(java.lang.String)" > Here is the line 89 and a few others to understand : > /* 085 */ long value13 = i.getLong(5); > /* 086 */ argValue4 = value13; > /* 087 */ > /* 088 */ > /* 089 */ final java.lang.Integer value12 = false ? null : new > java.lang.Integer(argValue4); > > As per Integer documentation, there is not constructor for the type Long, so > this is why the generated code fails. > > Here is my code : > org.apache.spark.sql.Dataset ds_row2 = > ds_conntAggregateRow_1_Out_1 > .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), > org.apache.spark.sql.fun
[jira] [Updated] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing t
[ https://issues.apache.org/jira/browse/SPARK-28732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alix Métivier updated SPARK-28732: -- Description: I am using agg function on a dataset, and i want to count the number of lines upon grouping columns. I would like to store the result of this count in an integer, but it fails with this output : [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 89, Column 53: No applicable constructor/method found for actual parameters "long"; candidates are: "java.lang.Integer(int)", "java.lang.Integer(java.lang.String)" Here is the line 89 and a few others to understand : /* 085 */ long value13 = i.getLong(5); /* 086 */ argValue4 = value13; /* 087 */ /* 088 */ /* 089 */ final java.lang.Integer value12 = false ? null : new java.lang.Integer(argValue4); As per Integer documentation, there is not constructor for the type Long, so this is why the generated code fails. Here is my code : org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), org.apache.spark.sql.functions.col("o_year").as("o_yearN")) .agg(org.apache.spark.sql.functions.count("n_name").as("countN"), .as(org.apache.spark.sql.Encoders.bean(row2Struct.class)); row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int If countN is a Long, code above wont fail If it is a Long, it works in 1.6 and 2.0, but fails on version 2.1+ was: I am using agg function on a dataset, and i want to count the number of lines upon grouping columns. I would like to store the result of this count in an integer, but it fails with this output : [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 89, Column 53: No applicable constructor/method found for actual parameters "long"; candidates are: "java.lang.Integer(int)", "java.lang.Integer(java.lang.String)" Here is the line 89 and a few others to understand : /* 085 */ long value13 = i.getLong(5); /* 086 */ argValue4 = value13; /* 087 */ /* 088 */ /* 089 */ final java.lang.Integer value12 = false ? null : new java.lang.Integer(argValue4); As per Integer documentation, there is not constructor for the type Long, so this is why the generated code fails. Here is my code : org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), org.apache.spark.sql.functions.col("o_year").as("o_yearN")) .agg(org.apache.spark.sql.functions.count("n_name").as("countN"), .as(org.apache.spark.sql.Encoders.bean(row2Struct.class)); row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int If countN is a Long, code above wont fail If it is a Long, is works in 1.6 and 2.0, but fails on version 2.1+ > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to > compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java' when storing the result of a count aggregation in an integer > --- > > Key: SPARK-28732 > URL: https://issues.apache.org/jira/browse/SPARK-28732 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0, 2.3.0, 2.4.0 >Reporter: Alix Métivier >Priority: Blocker > > I am using agg function on a dataset, and i want to count the number of lines > upon grouping columns. I would like to store the result of this count in an > integer, but it fails with this output : > [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - > failed to compile: org.codehaus.commons.compiler.CompileException: File > 'generated.java', Line 89, Column 53: No applicable constructor/method found > for actual parameters "long"; candidates are: "java.lang.Integer(int)", > "java.lang.Integer(java.lang.String)" > Here is the line 89 and a few others to understand : > /* 085 */ long value13 = i.getLong(5); > /* 086 */ argValue4 = value13; > /* 087 */ > /* 088 */ > /* 089 */ final java.lang.Integer value12 = false ? null : new > java.lang.Integer(argValue4); > > As per Integer documentation, there is not constructor for the type Long, so > this is why the generated code fails. > > Here is my code : > org.apache.spark.sql.Dataset ds_row2 = > ds_conntAggregateRow_1_Out_1 > .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), > org.apache.spark.sql.functio
[jira] [Created] (SPARK-28732) org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing t
Alix Métivier created SPARK-28732: - Summary: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java' when storing the result of a count aggregation in an integer Key: SPARK-28732 URL: https://issues.apache.org/jira/browse/SPARK-28732 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.0, 2.2.0, 2.1.0 Reporter: Alix Métivier I am using agg function on a dataset, and i want to count the number of lines upon grouping columns. I would like to store the result of this count in an integer, but it fails with this output : [ERROR]: org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator - failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 89, Column 53: No applicable constructor/method found for actual parameters "long"; candidates are: "java.lang.Integer(int)", "java.lang.Integer(java.lang.String)" Here is the line 89 and a few others to understand : /* 085 */ long value13 = i.getLong(5); /* 086 */ argValue4 = value13; /* 087 */ /* 088 */ /* 089 */ final java.lang.Integer value12 = false ? null : new java.lang.Integer(argValue4); As per Integer documentation, there is not constructor for the type Long, so this is why the generated code fails. Here is my code : org.apache.spark.sql.Dataset ds_row2 = ds_conntAggregateRow_1_Out_1 .groupBy(org.apache.spark.sql.functions.col("n_name").as("n_nameN"), org.apache.spark.sql.functions.col("o_year").as("o_yearN")) .agg(org.apache.spark.sql.functions.count("n_name").as("countN"), .as(org.apache.spark.sql.Encoders.bean(row2Struct.class)); row2Struct class is composed of n_nameN: String, o_yearN: String, countN: Int If countN is a Long, code above wont fail If it is a Long, is works in 1.6 and 2.0, but fails on version 2.1+ -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28731) Support limit on recursive queries
[ https://issues.apache.org/jira/browse/SPARK-28731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peter Toth updated SPARK-28731: --- Description: Recursive queries should support LIMIT and stop recursion if the required amount of rows are reached. (was: PostgreSQL does support recursive view syntax: {noformat} CREATE RECURSIVE VIEW nums (n) AS VALUES (1) UNION ALL SELECT n+1 FROM nums WHERE n < 5 {noformat}) > Support limit on recursive queries > -- > > Key: SPARK-28731 > URL: https://issues.apache.org/jira/browse/SPARK-28731 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Peter Toth >Priority: Minor > > Recursive queries should support LIMIT and stop recursion if the required > amount of rows are reached. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28731) Support limit on recursive queries
Peter Toth created SPARK-28731: -- Summary: Support limit on recursive queries Key: SPARK-28731 URL: https://issues.apache.org/jira/browse/SPARK-28731 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Peter Toth PostgreSQL does support recursive view syntax: {noformat} CREATE RECURSIVE VIEW nums (n) AS VALUES (1) UNION ALL SELECT n+1 FROM nums WHERE n < 5 {noformat} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28730) Configurable type coercion policy for table insertion
Gengliang Wang created SPARK-28730: -- Summary: Configurable type coercion policy for table insertion Key: SPARK-28730 URL: https://issues.apache.org/jira/browse/SPARK-28730 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang After all the discussions in the dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562. Here I propose that we can make the store assignment rules in the analyzer configurable, and the behavior of V1 and V2 should be consistent. When inserting a value into a column with a different data type, Spark will perform type coercion. After this PR, we support 2 policies for the type coercion rules: legacy and strict. 1. With legacy policy, Spark allows casting any value to any data type and null result is returned when the conversion is invalid. The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive. 2. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not allowed. To ensure backward compatibility with existing queries, the default store assignment policy is "legacy". -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27739) df.persist should save stats from optimized plan
[ https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-27739: --- Assignee: John Zhuge > df.persist should save stats from optimized plan > > > Key: SPARK-27739 > URL: https://issues.apache.org/jira/browse/SPARK-27739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Minor > > CacheManager.cacheQuery passes the stats for `planToCache` to > InMemoryRelation. Since the plan has not been optimized, the stats is > inaccurate because project and filter have not been applied. I'd suggest > passing the stats from the optimized plan. > {code:java} > class CacheManager extends Logging { > ... > def cacheQuery( > query: Dataset[_], > tableName: Option[String] = None, > storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { > val planToCache = query.logicalPlan > if (lookupCachedData(planToCache).nonEmpty) { > logWarning("Asked to cache already cached data.") > } else { > val sparkSession = query.sparkSession > val inMemoryRelation = InMemoryRelation( > sparkSession.sessionState.conf.useCompression, > sparkSession.sessionState.conf.columnBatchSize, storageLevel, > sparkSession.sessionState.executePlan(planToCache).executedPlan, > tableName, > planToCache) <== > ... > } > object InMemoryRelation { > def apply( > useCompression: Boolean, > batchSize: Int, > storageLevel: StorageLevel, > child: SparkPlan, > tableName: Option[String], > logicalPlan: LogicalPlan): InMemoryRelation = { > val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, > storageLevel, child, tableName) > val relation = new InMemoryRelation(child.output, cacheBuilder, > logicalPlan.outputOrdering) > relation.statsOfPlanToCache = logicalPlan.stats <== > relation > } > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3
[ https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated SPARK-28728: - Description: (was: Due to CVE's: https://www.cvedetails.com/vulnerability-list/vendor_id-15866/product_id-42991/version_id-238179/opec-1/Fasterxml-Jackson-databind-2.9.0.html) > Bump Jackson Databind to 2.9.9.3 > > > Key: SPARK-28728 > URL: https://issues.apache.org/jira/browse/SPARK-28728 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Fokko Driesprong >Priority: Major > Fix For: 2.4.4, 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28728) Bump Jackson Databind to 2.9.9.3
[ https://issues.apache.org/jira/browse/SPARK-28728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fokko Driesprong updated SPARK-28728: - Description: Needs to be upgraded due to issues. > Bump Jackson Databind to 2.9.9.3 > > > Key: SPARK-28728 > URL: https://issues.apache.org/jira/browse/SPARK-28728 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Fokko Driesprong >Priority: Major > Fix For: 2.4.4, 3.0.0 > > > Needs to be upgraded due to issues. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28726) Spark with DynamicAllocation always got connect rest by peers
[ https://issues.apache.org/jira/browse/SPARK-28726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907131#comment-16907131 ] angerszhu commented on SPARK-28726: --- [~ajithshetty] also happen when higher timeouts > Spark with DynamicAllocation always got connect rest by peers > - > > Key: SPARK-28726 > URL: https://issues.apache.org/jira/browse/SPARK-28726 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > > When use Spark with dynamic allocation, we set idle time to 5s > We always got exception about neety 'Connect reset by peers' > > I suspect that it's because we set idle time 5s is too small, it will cause > when Blockmanager call netty io, the executor has been remove because of > timeout. > But not timely notify driver's BlocakManager > {code:java} > 19/08/14 00:00:46 WARN > org.apache.spark.network.server.TransportChannelHandler: "Exception in > connection from /host:port" > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > -- > 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: > "Error trying to remove broadcast 67 from block manager BlockManagerId(967, > host, port, None)" > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > -- > 19/08/14 00:00:46 INFO org.apache.spark.ContextCleaner: "Cleaned accumulator > 162174" > 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMaster: "Failed > to remove shuffle 22 - Connection reset by peer" > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39){code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28726) Spark with DynamicAllocation always got connect rest by peers
[ https://issues.apache.org/jira/browse/SPARK-28726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907112#comment-16907112 ] Ajith S commented on SPARK-28726: - As i see, this is driver trying to clean up RDDs, broadcasts etc from the expiring executor and meanwhile the executor has gone down, which is why such exceptions are under warning. Does the issue occur with higher timeouts too.? > Spark with DynamicAllocation always got connect rest by peers > - > > Key: SPARK-28726 > URL: https://issues.apache.org/jira/browse/SPARK-28726 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > > When use Spark with dynamic allocation, we set idle time to 5s > We always got exception about neety 'Connect reset by peers' > > I suspect that it's because we set idle time 5s is too small, it will cause > when Blockmanager call netty io, the executor has been remove because of > timeout. > But not timely notify driver's BlocakManager > {code:java} > 19/08/14 00:00:46 WARN > org.apache.spark.network.server.TransportChannelHandler: "Exception in > connection from /host:port" > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > -- > 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: > "Error trying to remove broadcast 67 from block manager BlockManagerId(967, > host, port, None)" > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > -- > 19/08/14 00:00:46 INFO org.apache.spark.ContextCleaner: "Cleaned accumulator > 162174" > 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMaster: "Failed > to remove shuffle 22 - Connection reset by peer" > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39){code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28729) Comparison between DecimalType and StringType may lead to wrong results
ShuMing Li created SPARK-28729: -- Summary: Comparison between DecimalType and StringType may lead to wrong results Key: SPARK-28729 URL: https://issues.apache.org/jira/browse/SPARK-28729 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: ShuMing Li {code:java} desc test_table; a int NULL b string NULL dt string NULL hh string NULL # Partition Information # col_name data_type comment dt string NULL hh string NULL select dt from test_table where dt=201908010023825200017638; 201908010023825200017638 201908010023825200017638 201908010023825200016558 {code} In the sql above, column `dt` is string type. when users forget to add '' in query, Spark returns wrong results. In `TypeCoercion` class, DecimalType/StringType is casted as `DoubleType` when DecimalType compares with StringType which maybe not safe with precision lose or truncating. {code:java} /** val findCommonTypeForBinaryComparison: (DataType, DataType) => Option[DataType] = { // There is no proper decimal type we can pick, // using double type is the best we can do. // See SPARK-22469 for details. case (n: DecimalType, s: StringType) => Some(DoubleType) case (s: StringType, n: DecimalType) => Some(DoubleType) ... } {code} However I cannot find a good solution to avoid this: maybe just throw exception when meets `precision lose` or add a config to avoid this? -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated
[ https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16907016#comment-16907016 ] Gabor Somogyi commented on SPARK-28367: --- It has been turned out new API from Kafka side is needed for the clean solution. The discussion has been initiated. I'm actively tracking the progress and intended to create a new PR when it's available. > Kafka connector infinite wait because metadata never updated > > > Key: SPARK-28367 > URL: https://issues.apache.org/jira/browse/SPARK-28367 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.3, 2.2.3, 2.3.3, 3.0.0, 2.4.3 >Reporter: Gabor Somogyi >Priority: Critical > > Spark uses an old and deprecated API named poll(long) which never returns and > stays in live lock if metadata is not updated (for instance when broker > disappears at consumer creation). > I've created a small standalone application to test it and the alternatives: > https://github.com/gaborgsomogyi/kafka-get-assignment -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28728) Bump Jackson Databind to 2.9.9.3
Fokko Driesprong created SPARK-28728: Summary: Bump Jackson Databind to 2.9.9.3 Key: SPARK-28728 URL: https://issues.apache.org/jira/browse/SPARK-28728 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 2.4.3 Reporter: Fokko Driesprong Fix For: 2.4.4, 3.0.0 Due to CVE's: https://www.cvedetails.com/vulnerability-list/vendor_id-15866/product_id-42991/version_id-238179/opec-1/Fasterxml-Jackson-databind-2.9.0.html -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28727) Request for partial least square (PLS) regression model
Nikunj created SPARK-28727: -- Summary: Request for partial least square (PLS) regression model Key: SPARK-28727 URL: https://issues.apache.org/jira/browse/SPARK-28727 Project: Spark Issue Type: New Feature Components: ML, SparkR Affects Versions: 2.4.3 Environment: I am using Windows 10, Spark v2.3.2 Reporter: Nikunj Hi. Is there any development going on with regards to a PLS model? Or is there a plan for it in the near future? The application I am developing needs a PLS model as it is mandatory in that particular industry. I am using sparklyr, and have started a bit of the implementation, but was wondering if something is already in the pipeline. Thanks. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28726) Spark with DynamicAllocation always got connect rest by peers
[ https://issues.apache.org/jira/browse/SPARK-28726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-28726: -- Description: When use Spark with dynamic allocation, we set idle time to 5s We always got exception about neety 'Connect reset by peers' I suspect that it's because we set idle time 5s is too small, it will cause when Blockmanager call netty io, the executor has been remove because of timeout. But not timely notify driver's BlocakManager {code:java} 19/08/14 00:00:46 WARN org.apache.spark.network.server.TransportChannelHandler: "Exception in connection from /host:port" java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) -- 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMasterEndpoint: "Error trying to remove broadcast 67 from block manager BlockManagerId(967, host, port, None)" java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) at sun.nio.ch.IOUtil.read(IOUtil.java:192) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106) at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) -- 19/08/14 00:00:46 INFO org.apache.spark.ContextCleaner: "Cleaned accumulator 162174" 19/08/14 00:00:46 WARN org.apache.spark.storage.BlockManagerMaster: "Failed to remove shuffle 22 - Connection reset by peer" java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcherImpl.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39){code} > Spark with DynamicAllocation always got connect rest by peers > - > > Key: SPARK-28726 > URL: https://issues.apache.org/jira/browse/SPARK-28726 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: angerszhu >Priority: Major > > When use Spark with dynamic allocation, we set idle time to 5s > We always got exception about neety 'Connect reset by peers' > > I suspect that it's because we set idle time 5s is too small, it will cause > when Blockmanager call netty io, the executor has been remove because of > timeout. > But not timely notify driver's BlocakManager > {code:java} > 19/08/14 00:00:46 WARN > org.apache.spark.network.server.TransportChannelHandler: "Exception in > connection from /host:port" > java.io.IOException: Connection reset by peer > at sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) > at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) > at sun.nio.ch.IOUtil.read(IOUtil.java:192) > at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) > at > io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledU