[jira] [Resolved] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2
[ https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34371. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31478 [https://github.com/apache/spark/pull/31478] > Run datetime rebasing tests for parquet DSv1 and DSv2 > - > > Key: SPARK-34371 > URL: https://issues.apache.org/jira/browse/SPARK-34371 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Extract datetime rebasing tests from ParquetIOSuite and place them a separate > test suite to run it for both implementations DS v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2
[ https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34371: --- Assignee: Maxim Gekk > Run datetime rebasing tests for parquet DSv1 and DSv2 > - > > Key: SPARK-34371 > URL: https://issues.apache.org/jira/browse/SPARK-34371 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Extract datetime rebasing tests from ParquetIOSuite and place them a separate > test suite to run it for both implementations DS v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34374) Use standard methods to extract keys or values from a Map.
[ https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34374: Assignee: (was: Apache Spark) > Use standard methods to extract keys or values from a Map. > -- > > Key: SPARK-34374 > URL: https://issues.apache.org/jira/browse/SPARK-34374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > For keys: > *before* > {code:scala} > map.map(_._1) > {code} > *after* > {code:java} > map.keys > {code} > For values: > {code:scala} > map.map(_._2) > {code} > *after* > {code:java} > map.values > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34374) Use standard methods to extract keys or values from a Map.
[ https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34374: Assignee: Apache Spark > Use standard methods to extract keys or values from a Map. > -- > > Key: SPARK-34374 > URL: https://issues.apache.org/jira/browse/SPARK-34374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > For keys: > *before* > {code:scala} > map.map(_._1) > {code} > *after* > {code:java} > map.keys > {code} > For values: > {code:scala} > map.map(_._2) > {code} > *after* > {code:java} > map.values > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34374) Use standard methods to extract keys or values from a Map.
[ https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279401#comment-17279401 ] Apache Spark commented on SPARK-34374: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/31484 > Use standard methods to extract keys or values from a Map. > -- > > Key: SPARK-34374 > URL: https://issues.apache.org/jira/browse/SPARK-34374 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Priority: Minor > > For keys: > *before* > {code:scala} > map.map(_._1) > {code} > *after* > {code:java} > map.keys > {code} > For values: > {code:scala} > map.map(_._2) > {code} > *after* > {code:java} > map.values > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33434) Document spark.conf.isModifiable()
[ https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279398#comment-17279398 ] Apache Spark commented on SPARK-33434: -- User 'Eric-Lemmon' has created a pull request for this issue: https://github.com/apache/spark/pull/31483 > Document spark.conf.isModifiable() > -- > > Key: SPARK-33434 > URL: https://issues.apache.org/jira/browse/SPARK-33434 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears > to be a public method introduced in SPARK-24761. > http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33434) Document spark.conf.isModifiable()
[ https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33434: Assignee: (was: Apache Spark) > Document spark.conf.isModifiable() > -- > > Key: SPARK-33434 > URL: https://issues.apache.org/jira/browse/SPARK-33434 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears > to be a public method introduced in SPARK-24761. > http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33434) Document spark.conf.isModifiable()
[ https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33434: Assignee: Apache Spark > Document spark.conf.isModifiable() > -- > > Key: SPARK-33434 > URL: https://issues.apache.org/jira/browse/SPARK-33434 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Assignee: Apache Spark >Priority: Minor > > PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears > to be a public method introduced in SPARK-24761. > http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33434) Document spark.conf.isModifiable()
[ https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279397#comment-17279397 ] Apache Spark commented on SPARK-33434: -- User 'Eric-Lemmon' has created a pull request for this issue: https://github.com/apache/spark/pull/31483 > Document spark.conf.isModifiable() > -- > > Key: SPARK-33434 > URL: https://issues.apache.org/jira/browse/SPARK-33434 > Project: Spark > Issue Type: Improvement > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears > to be a public method introduced in SPARK-24761. > http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34374) Use standard methods to extract keys or values from a Map.
Yang Jie created SPARK-34374: Summary: Use standard methods to extract keys or values from a Map. Key: SPARK-34374 URL: https://issues.apache.org/jira/browse/SPARK-34374 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yang Jie For keys: *before* {code:scala} map.map(_._1) {code} *after* {code:java} map.keys {code} For values: {code:scala} map.map(_._2) {code} *after* {code:java} map.values {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34346) io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression
[ https://issues.apache.org/jira/browse/SPARK-34346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279367#comment-17279367 ] Apache Spark commented on SPARK-34346: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31482 > io.file.buffer.size set by spark.buffer.size will override by hive-site.xml > may cause perf regression > - > > Key: SPARK-34346 > URL: https://issues.apache.org/jira/browse/SPARK-34346 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.1.1 >Reporter: Kent Yao >Priority: Blocker > > In many real-world cases, when interacting with hive catalog through Spark > SQL, users may just share the `hive-site.xml` for their hive jobs and make a > copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop > configurations, we will use `spark.buffer.size(65536)` to reset > `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may > ignore this behavior and reset `io.file.buffer.size` again according to > `hive-site.xml`. > 1. The configuration priority for setting Hadoop and Hive config here is not > right, while literally, the order should be `spark > spark.hive > > spark.hadoop > hive > hadoop` > 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO > performance w/ HDFS if there is an existing `io.file.buffer.size` in > hive-site.xml -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34343) Add missing test for some non-array types in PostgreSQL
[ https://issues.apache.org/jira/browse/SPARK-34343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-34343. -- Fix Version/s: 3.2.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/31456 > Add missing test for some non-array types in PostgreSQL > --- > > Key: SPARK-34343 > URL: https://issues.apache.org/jira/browse/SPARK-34343 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > PostgresIntegrationSuite tests some non-array types for PostgreSQL but tests > for some types are missing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26399) Add new stage-level REST APIs and parameters
[ https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ron Hu updated SPARK-26399: --- Description: Add the peak values for the metrics to the stages REST API. Also add a new executorSummary REST API, which will return executor summary metrics for a specified stage: {code:java} curl http://:18080/api/v1/applicationsexecutorMetricsSummary{code} Add parameters to the stages REST API to specify: * filtering for task status, and returning tasks that match (for example, FAILED tasks). * task metric quantiles, add adding the task summary if specified * executor metric quantiles, and adding the executor summary if specified *. *. * Note that the above description is too brief to be clear. [~angerszhuuu] and [~ron8hu] discussed a generic and consistent way for endpoint /application/\{app-id}/stages. It can be: /application/\{app-id}/stages?details=[true|false]&status=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]&withSummaries=[true|false]&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING] where * query parameter details=true is to show the detailed task information within each stage. The default value is details=false; * query parameter status can select those stages with the specified status. When status parameter is not specified, a list of all stages are generated. * query parameter withSummaries=true is to show both task summary information in percentile distribution and executor summary information in percentile distribution. The default value is withSummaries=false. * query parameter taskStatus is to show only those tasks with the specified status within their corresponding stages. This parameter can be set when details=true (i.e. this parameter will be ignored when details=false). was: Add the peak values for the metrics to the stages REST API. Also add a new executorSummary REST API, which will return executor summary metrics for a specified stage: {code:java} curl http://:18080/api/v1/applicationsexecutorMetricsSummary{code} Add parameters to the stages REST API to specify: * filtering for task status, and returning tasks that match (for example, FAILED tasks). * task metric quantiles, add adding the task summary if specified * executor metric quantiles, and adding the executor summary if specified *. *. * Note that the above description is too brief to be clear. [~angerszhuuu] and [~ron8hu] discussed a generic and consistent way for endpoint /application/\{app-id}/stages. It can be: /application/\{app-id}/stages?details=[true|false]&status=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]&withSummaries=[true|false]&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING] where * query parameter details=true is to show the detailed task information within each stage. The default value is details=false; * query parameter status can select those stages with the specified status. When status parameter is not specified, a list of all stages are generated. * query parameter withSummaries=true is to show both task summary information in percentile distribution and executor summary information in percentile distribution. The default value is withSummaries=false. * query parameter taskStatus is to show only those tasks with the specified status within their corresponding stages. This parameter will be set when details=true (i.e. this parameter will be ignored when details=false). > Add new stage-level REST APIs and parameters > > > Key: SPARK-26399 > URL: https://issues.apache.org/jira/browse/SPARK-26399 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Edward Lu >Priority: Major > Attachments: executorMetricsSummary.json, > lispark230_restapi_ex2_stages_failedTasks.json, > lispark230_restapi_ex2_stages_withSummaries.json, > stage_executorSummary_image1.png > > > Add the peak values for the metrics to the stages REST API. Also add a new > executorSummary REST API, which will return executor summary metrics for a > specified stage: > {code:java} > curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code} > Add parameters to the stages REST API to specify: > * filtering for task status, and returning tasks that match (for example, > FAILED tasks). > * task metric quantiles, add adding the task summary if specified > * executor metric quantiles, and adding the executor summary if specified > *. *. * > Note that the above description is too brief to be clear. [~angerszhuuu] and > [~ron8hu] discussed a generic and consistent way for endpoint > /application/\{app-id}/stages. It can be: > /application/\{app-id}/stages?details=[true|false]&status=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]&withSummaries=[true|f
[jira] [Resolved] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34359. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31474 [https://github.com/apache/spark/pull/31474] > add a legacy config to restore the output schema of SHOW DATABASES > -- > > Key: SPARK-34359 > URL: https://issues.apache.org/jira/browse/SPARK-34359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-34359: --- Parent: SPARK-34156 Issue Type: Sub-task (was: Improvement) > add a legacy config to restore the output schema of SHOW DATABASES > -- > > Key: SPARK-34359 > URL: https://issues.apache.org/jira/browse/SPARK-34359 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.2 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34330) Literal constructor support UTFString
[ https://issues.apache.org/jira/browse/SPARK-34330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu resolved SPARK-34330. --- Resolution: Won't Fix > Literal constructor support UTFString > -- > > Key: SPARK-34330 > URL: https://issues.apache.org/jira/browse/SPARK-34330 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Literal constructor support UTFString -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32384) repartitionAndSortWithinPartitions avoid shuffle with same partitioner
[ https://issues.apache.org/jira/browse/SPARK-32384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279293#comment-17279293 ] Apache Spark commented on SPARK-32384: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31480 > repartitionAndSortWithinPartitions avoid shuffle with same partitioner > -- > > Key: SPARK-32384 > URL: https://issues.apache.org/jira/browse/SPARK-32384 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > In {{combineByKeyWithClassTag}}, there is a check so that if the partitioner > is the same as the one of the RDD: > {code:java} > if (self.partitioner == Some(partitioner)) { > self.mapPartitions(iter => { > val context = TaskContext.get() > new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, > context)) > }, preservesPartitioning = true) > } else { > new ShuffledRDD[K, V, C](self, partitioner) > .setSerializer(serializer) > .setAggregator(aggregator) > .setMapSideCombine(mapSideCombine) > } > {code} > > In {{repartitionAndSortWithinPartitions}}, this shuffle can also be skipped > in this case. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32384) repartitionAndSortWithinPartitions avoid shuffle with same partitioner
[ https://issues.apache.org/jira/browse/SPARK-32384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279292#comment-17279292 ] Apache Spark commented on SPARK-32384: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31480 > repartitionAndSortWithinPartitions avoid shuffle with same partitioner > -- > > Key: SPARK-32384 > URL: https://issues.apache.org/jira/browse/SPARK-32384 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > In {{combineByKeyWithClassTag}}, there is a check so that if the partitioner > is the same as the one of the RDD: > {code:java} > if (self.partitioner == Some(partitioner)) { > self.mapPartitions(iter => { > val context = TaskContext.get() > new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, > context)) > }, preservesPartitioning = true) > } else { > new ShuffledRDD[K, V, C](self, partitioner) > .setSerializer(serializer) > .setAggregator(aggregator) > .setMapSideCombine(mapSideCombine) > } > {code} > > In {{repartitionAndSortWithinPartitions}}, this shuffle can also be skipped > in this case. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue
[ https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279280#comment-17279280 ] Apache Spark commented on SPARK-34373: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31479 > HiveThriftServer2 startWithContext may hang with a race issue > -- > > Key: SPARK-34373 > URL: https://issues.apache.org/jira/browse/SPARK-34373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > ``` > 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error > occurred during acceptance of message. > org.apache.thrift.transport.TTransportException: No underlying server socket. > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) > at org.apache.thrift.transport.TServerTransport.acceException in thread > "Thread-15" java.io.IOException: Stream closed > at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) > at java.io.BufferedInputStream.read(BufferedInputStream.java:336) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) > at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) > at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) > at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) > ``` > the TServer might try to serve even the stop is called -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue
[ https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34373: Assignee: Apache Spark > HiveThriftServer2 startWithContext may hang with a race issue > -- > > Key: SPARK-34373 > URL: https://issues.apache.org/jira/browse/SPARK-34373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > ``` > 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error > occurred during acceptance of message. > org.apache.thrift.transport.TTransportException: No underlying server socket. > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) > at org.apache.thrift.transport.TServerTransport.acceException in thread > "Thread-15" java.io.IOException: Stream closed > at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) > at java.io.BufferedInputStream.read(BufferedInputStream.java:336) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) > at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) > at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) > at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) > ``` > the TServer might try to serve even the stop is called -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue
[ https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279279#comment-17279279 ] Apache Spark commented on SPARK-34373: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/31479 > HiveThriftServer2 startWithContext may hang with a race issue > -- > > Key: SPARK-34373 > URL: https://issues.apache.org/jira/browse/SPARK-34373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > ``` > 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error > occurred during acceptance of message. > org.apache.thrift.transport.TTransportException: No underlying server socket. > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) > at org.apache.thrift.transport.TServerTransport.acceException in thread > "Thread-15" java.io.IOException: Stream closed > at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) > at java.io.BufferedInputStream.read(BufferedInputStream.java:336) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) > at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) > at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) > at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) > ``` > the TServer might try to serve even the stop is called -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue
[ https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34373: Assignee: (was: Apache Spark) > HiveThriftServer2 startWithContext may hang with a race issue > -- > > Key: SPARK-34373 > URL: https://issues.apache.org/jira/browse/SPARK-34373 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kent Yao >Priority: Major > > ``` > 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error > occurred during acceptance of message. > org.apache.thrift.transport.TTransportException: No underlying server socket. > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) > at > org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) > at org.apache.thrift.transport.TServerTransport.acceException in thread > "Thread-15" java.io.IOException: Stream closed > at > java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) > at java.io.BufferedInputStream.read(BufferedInputStream.java:336) > at java.io.FilterInputStream.read(FilterInputStream.java:107) > at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) > at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) > at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) > at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) > ``` > the TServer might try to serve even the stop is called -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue
Kent Yao created SPARK-34373: Summary: HiveThriftServer2 startWithContext may hang with a race issue Key: SPARK-34373 URL: https://issues.apache.org/jira/browse/SPARK-34373 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.1.0 Reporter: Kent Yao ``` 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error occurred during acceptance of message. org.apache.thrift.transport.TTransportException: No underlying server socket. at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126) at org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35) at org.apache.thrift.transport.TServerTransport.acceException in thread "Thread-15" java.io.IOException: Stream closed at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170) at java.io.BufferedInputStream.read(BufferedInputStream.java:336) at java.io.FilterInputStream.read(FilterInputStream.java:107) at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238) at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246) at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227) at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221) ``` the TServer might try to serve even the stop is called -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34372) Speculation results in broken CSV files in Amazon S3
[ https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daehee Han updated SPARK-34372: --- Description: Hi, we've been experiencing some rows get corrupted while partitioned CSV files were written to Amazon S3. Some records were found broken without any error on Spark. Digging into the root cause, we found out Spark speculation tried to upload a partition being uploaded slowly and ended up uploading only a part of the partition, letting broken data uploaded to S3. Here're stacktraces we've found. There are two executor involved - A: the first executor which tried to upload the file, but it took much longer than other executor (but still succeeded), which made spark speculation cut in and kick off another executor B. Executor B started to upload the file too, but was interrupted during uploading (killed: another attempt succeeded), and ended up uploading only a part of the whole file. You can see in the log, the file executor A uploaded (8461990 bytes originally) was overwritten by executor B (uploaded only 3145728 bytes). Executor A: {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 13201) 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty blocks including 10 local blocks and 460 remote blocks 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches in 18 ms 21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true 21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class 21/01/28 17:22:21 INFO INFO CSEMultipartUploadOutputStream: close closed:false s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 8461990 bytes 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed seconds: 10. 21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20210128172219_0045_m_000426_13201 21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 13201). 8782 bytes result sent to driver {quote} Executor B: {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 13245) 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty blocks including 11 local blocks and 459 remote blocks 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches in 2 ms 21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true 21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in stage 45.0 (TID 13245), reason: another attempt succeeded 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 3145728 bytes 21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed seconds: 0. 21/01/28 17:22:32 ERROR Utils: Aborting task com.univocity.parsers.common.TextWritingException: Error writing row. Internal state when error was thrown: recordCount=18449, recordData=[ Unknown macro: \{obfuscated} ] at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:935) at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:714) at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:84) at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:181) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245) at org.apache.spark.sql.execution.datasources.FileFo
[jira] [Created] (SPARK-34372) Speculation results in broken CSV files in Amazon S3
Daehee Han created SPARK-34372: -- Summary: Speculation results in broken CSV files in Amazon S3 Key: SPARK-34372 URL: https://issues.apache.org/jira/browse/SPARK-34372 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.4.7 Environment: Amazon EMR with AMI version 5.32.0 Reporter: Daehee Han Hi, we've been experiencing some rows get corrupted while partitioned CSV files were written to Amazon S3. Some records were found broken without any error on Spark. Digging into the root cause, we found out Spark speculation tried to upload a partition being uploaded slowly and ended up uploading only a part of the partition, letting broken data uploaded to S3. Here're stacktraces we've found. There are two executor involved - A: the first executor which tried to upload the file, but it took much longer than other executor (but still succeeded), which made spark speculation cut in and kick off another executor B. Executor B started to upload the file too, but was interrupted during uploading (killed: another attempt succeeded), and ended up uploading only a part of the whole file. Executor A: {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 13201) 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty blocks including 10 local blocks and 460 remote blocks 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches in 18 ms 21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true 21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class 21/01/28 17:22:21 INFO INFO CSEMultipartUploadOutputStream: close closed:false s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 8461990 bytes 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed seconds: 10. 21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20210128172219_0045_m_000426_13201 21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 13201). 8782 bytes result sent to driver{quote} Executor B: {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 13245) 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty blocks including 11 local blocks and 459 remote blocks 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches in 2 ms 21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true 21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in stage 45.0 (TID 13245), reason: another attempt succeeded 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 3145728 bytes 21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed seconds: 0. 21/01/28 17:22:32 ERROR Utils: Aborting task com.univocity.parsers.common.TextWritingException: Error writing row. Internal state when error was thrown: recordCount=18449, recordData=[\{obfuscated}] at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:935) at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:714) at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:84) at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:181) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scal
[jira] [Updated] (SPARK-34372) Speculation results in broken CSV files in Amazon S3
[ https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daehee Han updated SPARK-34372: --- Description: Hi, we've been experiencing some rows get corrupted while partitioned CSV files were written to Amazon S3. Some records were found broken without any error on Spark. Digging into the root cause, we found out Spark speculation tried to upload a partition being uploaded slowly and ended up uploading only a part of the partition, letting broken data uploaded to S3. Here're stacktraces we've found. There are two executor involved - A: the first executor which tried to upload the file, but it took much longer than other executor (but still succeeded), which made spark speculation cut in and kick off another executor B. Executor B started to upload the file too, but was interrupted during uploading (killed: another attempt succeeded), and ended up uploading only a part of the whole file. Executor A: {quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 13201) 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty blocks including 10 local blocks and 460 remote blocks 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches in 18 ms 21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true 21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class 21/01/28 17:22:21 INFO INFO CSEMultipartUploadOutputStream: close closed:false s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 8461990 bytes 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed seconds: 10. 21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_20210128172219_0045_m_000426_13201 21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 13201). 8782 bytes result sent to driver {quote} Executor B: {quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 13245) 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty blocks including 11 local blocks and 459 remote blocks 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches in 2 ms 21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm version is 2 21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: true 21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer class org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in stage 45.0 (TID 13245), reason: another attempt succeeded 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart upload of 1 parts 3145728 bytes 21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading \{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed seconds: 0. 21/01/28 17:22:32 ERROR Utils: Aborting task com.univocity.parsers.common.TextWritingException: Error writing row. Internal state when error was thrown: recordCount=18449, recordData=[\\{obfuscated}] at com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:935) at com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:714) at org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:84) at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:181) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242) at org.apache.spark.util.Utils
[jira] [Assigned] (SPARK-34339) Expose the number of truncated paths in Utils.buildLocationMetadata()
[ https://issues.apache.org/jira/browse/SPARK-34339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34339: Assignee: Jungtaek Lim > Expose the number of truncated paths in Utils.buildLocationMetadata() > - > > Key: SPARK-34339 > URL: https://issues.apache.org/jira/browse/SPARK-34339 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > SPARK-31793 introduces Utils.buildLocationMetadata() to reduce the length of > location metadata. It effectively reduced the memory usage as only a few > paths will be included (and it's controlled by threshold value), but there's > no indication of the fact for truncation. > If the first 2 of 5 paths are only fit to the threshold, > Utils.buildLocationMetadata() shows the first 2 paths, but there's no mention > that 3 paths are truncated. Even no mention that some paths have been > truncated; it shows the same with the output just first 2 paths were > presented. This could bring confusion. > We should allow more space (like 10+ chars) to represent the fact N paths are > truncated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34339) Expose the number of truncated paths in Utils.buildLocationMetadata()
[ https://issues.apache.org/jira/browse/SPARK-34339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34339. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31464 [https://github.com/apache/spark/pull/31464] > Expose the number of truncated paths in Utils.buildLocationMetadata() > - > > Key: SPARK-34339 > URL: https://issues.apache.org/jira/browse/SPARK-34339 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.2.0 > > > SPARK-31793 introduces Utils.buildLocationMetadata() to reduce the length of > location metadata. It effectively reduced the memory usage as only a few > paths will be included (and it's controlled by threshold value), but there's > no indication of the fact for truncation. > If the first 2 of 5 paths are only fit to the threshold, > Utils.buildLocationMetadata() shows the first 2 paths, but there's no mention > that 3 paths are truncated. Even no mention that some paths have been > truncated; it shows the same with the output just first 2 paths were > presented. This could bring confusion. > We should allow more space (like 10+ chars) to represent the fact N paths are > truncated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2
[ https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34371: Assignee: Apache Spark > Run datetime rebasing tests for parquet DSv1 and DSv2 > - > > Key: SPARK-34371 > URL: https://issues.apache.org/jira/browse/SPARK-34371 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Extract datetime rebasing tests from ParquetIOSuite and place them a separate > test suite to run it for both implementations DS v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2
[ https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34371: Assignee: (was: Apache Spark) > Run datetime rebasing tests for parquet DSv1 and DSv2 > - > > Key: SPARK-34371 > URL: https://issues.apache.org/jira/browse/SPARK-34371 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Extract datetime rebasing tests from ParquetIOSuite and place them a separate > test suite to run it for both implementations DS v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2
[ https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279207#comment-17279207 ] Apache Spark commented on SPARK-34371: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31478 > Run datetime rebasing tests for parquet DSv1 and DSv2 > - > > Key: SPARK-34371 > URL: https://issues.apache.org/jira/browse/SPARK-34371 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Priority: Major > > Extract datetime rebasing tests from ParquetIOSuite and place them a separate > test suite to run it for both implementations DS v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"
[ https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-34370: --- Description: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. The use case is the following there is a partitioned Hive table with Avro data. The schema is specified via "avro.schema.url". With time the schema is evolved and the new schema is set for the table "avro.schema.url" when data is read from the old partition this new evolved schema must be used. was: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. The use case is the following there is a partitioned Hive table with Avro data. The schema is specified via "avro.schema.url". With time the schema is evolved and the new schema is set for the table "avro.schema.url" when data is read from the old p > Supporting Avro schema evolution for partitioned Hive tables using > "avro.schema.url" > > > Key: SPARK-34370 > URL: https://issues.apache.org/jira/browse/SPARK-34370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up in > https://github.com/apache/spark/pull/31133#issuecomment-773567152. > The use case is the following there is a partitioned Hive table with Avro > data. The schema is specified via "avro.schema.url". > With time the schema is evolved and the new schema is set for the table > "avro.schema.url" when data is read from the old partition this new evolved > schema must be used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"
[ https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-34370: --- Description: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. The use case is the following there is a partitioned Hive table with Avro data. The schema is specified via "avro.schema.url". With time the schema is evolved and the new schema is set for the table "avro.schema.url" when data is read from the old p was: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. The use case is the following there is a partitioned Hive table with Avro data. The schema is specified via https://github.com/apache/spark/pull/31133#discussion_r570561321 > Supporting Avro schema evolution for partitioned Hive tables using > "avro.schema.url" > > > Key: SPARK-34370 > URL: https://issues.apache.org/jira/browse/SPARK-34370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up in > https://github.com/apache/spark/pull/31133#issuecomment-773567152. > The use case is the following there is a partitioned Hive table with Avro > data. The schema is specified via "avro.schema.url". > With time the schema is evolved and the new schema is set for the table > "avro.schema.url" when data is read from the old p -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"
[ https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-34370: --- Description: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. The use case is the following there is a partitioned Hive table with Avro data. The schema is specified via https://github.com/apache/spark/pull/31133#discussion_r570561321 was: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. The use case is the following there is a partitioned Hive table with Avro data. The schema is specified via > Supporting Avro schema evolution for partitioned Hive tables using > "avro.schema.url" > > > Key: SPARK-34370 > URL: https://issues.apache.org/jira/browse/SPARK-34370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up in > https://github.com/apache/spark/pull/31133#issuecomment-773567152. > The use case is the following there is a partitioned Hive table with Avro > data. The schema is specified via > https://github.com/apache/spark/pull/31133#discussion_r570561321 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"
[ https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-34370: --- Description: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. The use case is the following there is a partitioned Hive table with Avro data. The schema is specified via was: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. > Supporting Avro schema evolution for partitioned Hive tables using > "avro.schema.url" > > > Key: SPARK-34370 > URL: https://issues.apache.org/jira/browse/SPARK-34370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up in > https://github.com/apache/spark/pull/31133#issuecomment-773567152. > The use case is the following there is a partitioned Hive table with Avro > data. The schema is specified via -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2
Maxim Gekk created SPARK-34371: -- Summary: Run datetime rebasing tests for parquet DSv1 and DSv2 Key: SPARK-34371 URL: https://issues.apache.org/jira/browse/SPARK-34371 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Extract datetime rebasing tests from ParquetIOSuite and place them a separate test suite to run it for both implementations DS v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"
[ https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279156#comment-17279156 ] Attila Zsolt Piros commented on SPARK-34370: I am working on this. > Supporting Avro schema evolution for partitioned Hive tables using > "avro.schema.url" > > > Key: SPARK-34370 > URL: https://issues.apache.org/jira/browse/SPARK-34370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up in > https://github.com/apache/spark/pull/31133#issuecomment-773567152. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"
[ https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-34370: --- Description: This came up in https://github.com/apache/spark/pull/31133#issuecomment-773567152. > Supporting Avro schema evolution for partitioned Hive tables using > "avro.schema.url" > > > Key: SPARK-34370 > URL: https://issues.apache.org/jira/browse/SPARK-34370 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0 >Reporter: Attila Zsolt Piros >Priority: Major > > This came up in > https://github.com/apache/spark/pull/31133#issuecomment-773567152. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"
Attila Zsolt Piros created SPARK-34370: -- Summary: Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url" Key: SPARK-34370 URL: https://issues.apache.org/jira/browse/SPARK-34370 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 2.4.0, 2.3.1, 3.1.0, 3.2.0 Reporter: Attila Zsolt Piros -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34351) Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook
[ https://issues.apache.org/jira/browse/SPARK-34351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279128#comment-17279128 ] Huseyin Elci commented on SPARK-34351: -- I used StackOverflow for this issue but I didn't find anything. I spent over the 3 days for solving of this issue. I looked http://spark.apache.org/community.html. And It has lots of "Py4JJavaError" error. I checked a few comment. Almost of there is not same issue or there is not solving about another error of "Py4JJavaError" > Running into "Py4JJavaError" while counting to text file or list using > Pyspark, Jupyter notebook > > > Key: SPARK-34351 > URL: https://issues.apache.org/jira/browse/SPARK-34351 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: PS> python --version > *Python 3.6.8* > PS> jupyter --version > j*upyter core : 4.7.0* > *jupyter-notebook : 6.2.0* > qtconsole : 5.0.2 > ipython : 7.16.1 > ipykernel : 5.4.3 > jupyter client : 6.1.11 > jupyter lab : not installed > nbconvert : 6.0.7 > ipywidgets : 7.6.3 > nbformat : 5.1.2 > traitlets : 4.3.3 > PS > java -version > *java version "1.8.0_271"* > Java(TM) SE Runtime Environment (build 1.8.0_271-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode) > > Spark versiyon > *spark-2.3.1-bin-hadoop2.7* >Reporter: Huseyin Elci >Priority: Major > > I run into the following error: > Any help resolving this error is greatly appreciated. > *My Code 1:* > {code:python} > import findspark > findspark.init("C:\Spark") > from pyspark.sql import SparkSession > from pyspark.conf import SparkConf > spark = SparkSession.builder\ > .master("local[4]")\ > .appName("WordCount_RDD")\ > .getOrCreate() > sc = spark.sparkContext > data = "D:\\05 Spark\\data\\MyArticle.txt" > story_rdd = sc.textFile(data) > story_rdd.count() > {code} > *My Code 2:* > {code:python} > import findspark > findspark.init("C:\Spark") > from pyspark import SparkContext > sc = SparkContext() > mylist = [1,2,2,3,5,48,98,62,14,55] > mylist_rdd = sc.parallelize(mylist) > mylist_rdd.map(lambda x: x*x) > mylist_rdd.map(lambda x: x*x).collect() > {code} > *ERROR:* > I took same error code for my codes. > {code:python} > --- > Py4JJavaError Traceback (most recent call last) > in > > 1 story_rdd.count() > C:\Spark\python\pyspark\rdd.py in count(self) > 1071 3 > 1072 """ > -> 1073 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > 1074 > 1075 def stats(self): > C:\Spark\python\pyspark\rdd.py in sum(self) > 1062 6.0 > 1063 """ > -> 1064 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) > 1065 > 1066 def count(self): > C:\Spark\python\pyspark\rdd.py in fold(self, zeroValue, op) > 933 # zeroValue provided to each partition is unique from the one provided > 934 # to the final reduce call > --> 935 vals = self.mapPartitions(func).collect() > 936 return reduce(op, vals, zeroValue) > 937 > C:\Spark\python\pyspark\rdd.py in collect(self) > 832 """ > 833 with SCCallSiteSync(self.context) as css: > --> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) > 835 return list(_load_from_socket(sock_info, self._jrdd_deserializer)) > 836 > C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in > __call__(self, *args) > 1255 answer = self.gateway_client.send_command(command) > 1256 return_value = get_return_value( > -> 1257 answer, self.gateway_client, self.target_id, self.name) > 1258 > 1259 for temp_arg in temp_args: > C:\Spark\python\pyspark\sql\utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 326 raise Py4JJavaError( > 327 "An error occurred while calling > {0} \{1} \{2} > .\n". > --> 328 format(target_id, ".", name), value) > 329 else: > 330 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 > (TID 1, localhost, executor driver): org.apache.spark.SparkException: Python > worker failed to connect back. > at > org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76) > at
[jira] [Commented] (SPARK-34369) Track number of pairs processed out of Join
[ https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279126#comment-17279126 ] Apache Spark commented on SPARK-34369: -- User 'sririshindra' has created a pull request for this issue: https://github.com/apache/spark/pull/31477 > Track number of pairs processed out of Join > --- > > Key: SPARK-34369 > URL: https://issues.apache.org/jira/browse/SPARK-34369 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Major > > Often users face a scenario where even a modest skew in a join can lead to > tasks appearing to be stuck, due to the O(n^2) nature of a join considering > all pairs of rows with matching keys. When this happens users think that > spark has gotten deadlocked. If there is a bound condition, the "number of > output rows" metric may look typical. Other metrics may look very modest (eg: > shuffle read). In those cases, it is very hard to understand what the problem > is. There is no conclusive proof without getting a heap dump and looking at > some internal data structures. > It would be much better if spark had a metric(which we propose be titled > “number of matched pairs” as a companion to “number of output rows”) which > showed the user how many pairs were being processed in the join. This would > get updated in the live UI (when metrics get collected during heartbeats), so > the user could easily see what was going on. > This would even help in cases where there was some other cause of a stuck > executor (eg. network issues) just to disprove this theory. For example, you > may have 100k records with the same key on each side of a join. That probably > won't really show up as extreme skew in task input data. But it'll become 10B > join pairs that spark works through, in one task. > > To further demonstrate the usefulness of this metric please follow the steps > below. > > _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", > "c")_ > _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_ > > _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_ > _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_ > > _val df5 = df1.union(df2)_ > _val df6 = df3.union(df4)_ > > _df5.createOrReplaceTempView("table1")_ > _df6.createOrReplaceTempView("table2")_ > h3. InnerJoin > _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. FullOuterJoin > _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = > p.b and f.c > p.c").count_ > _number of output rows: 6,099,964_ > _number of matched pairs: 90,000,490,000_ > h3. LeftOuterJoin > _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 6,079,964_ > _number of matched pairs: 90,000,490,000_ > h3. RightOuterJoin > _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b > = p.b and f.c > p.c").count_ > _number of output rows: 5,600,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftSemiJoin > _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and > f.c > p.c").count_ > _number of output rows: 36_ > _number of matched pairs: 89,994,910,036_ > h3. CrossJoin > _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftAntiJoin > _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > > p.c").count_ > number of output rows: 499,964 > number of matched pairs: 89,994,910,036 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34369) Track number of pairs processed out of Join
[ https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34369: Assignee: Apache Spark > Track number of pairs processed out of Join > --- > > Key: SPARK-34369 > URL: https://issues.apache.org/jira/browse/SPARK-34369 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Srinivas Rishindra Pothireddi >Assignee: Apache Spark >Priority: Major > > Often users face a scenario where even a modest skew in a join can lead to > tasks appearing to be stuck, due to the O(n^2) nature of a join considering > all pairs of rows with matching keys. When this happens users think that > spark has gotten deadlocked. If there is a bound condition, the "number of > output rows" metric may look typical. Other metrics may look very modest (eg: > shuffle read). In those cases, it is very hard to understand what the problem > is. There is no conclusive proof without getting a heap dump and looking at > some internal data structures. > It would be much better if spark had a metric(which we propose be titled > “number of matched pairs” as a companion to “number of output rows”) which > showed the user how many pairs were being processed in the join. This would > get updated in the live UI (when metrics get collected during heartbeats), so > the user could easily see what was going on. > This would even help in cases where there was some other cause of a stuck > executor (eg. network issues) just to disprove this theory. For example, you > may have 100k records with the same key on each side of a join. That probably > won't really show up as extreme skew in task input data. But it'll become 10B > join pairs that spark works through, in one task. > > To further demonstrate the usefulness of this metric please follow the steps > below. > > _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", > "c")_ > _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_ > > _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_ > _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_ > > _val df5 = df1.union(df2)_ > _val df6 = df3.union(df4)_ > > _df5.createOrReplaceTempView("table1")_ > _df6.createOrReplaceTempView("table2")_ > h3. InnerJoin > _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. FullOuterJoin > _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = > p.b and f.c > p.c").count_ > _number of output rows: 6,099,964_ > _number of matched pairs: 90,000,490,000_ > h3. LeftOuterJoin > _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 6,079,964_ > _number of matched pairs: 90,000,490,000_ > h3. RightOuterJoin > _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b > = p.b and f.c > p.c").count_ > _number of output rows: 5,600,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftSemiJoin > _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and > f.c > p.c").count_ > _number of output rows: 36_ > _number of matched pairs: 89,994,910,036_ > h3. CrossJoin > _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftAntiJoin > _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > > p.c").count_ > number of output rows: 499,964 > number of matched pairs: 89,994,910,036 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34369) Track number of pairs processed out of Join
[ https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34369: Assignee: (was: Apache Spark) > Track number of pairs processed out of Join > --- > > Key: SPARK-34369 > URL: https://issues.apache.org/jira/browse/SPARK-34369 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Major > > Often users face a scenario where even a modest skew in a join can lead to > tasks appearing to be stuck, due to the O(n^2) nature of a join considering > all pairs of rows with matching keys. When this happens users think that > spark has gotten deadlocked. If there is a bound condition, the "number of > output rows" metric may look typical. Other metrics may look very modest (eg: > shuffle read). In those cases, it is very hard to understand what the problem > is. There is no conclusive proof without getting a heap dump and looking at > some internal data structures. > It would be much better if spark had a metric(which we propose be titled > “number of matched pairs” as a companion to “number of output rows”) which > showed the user how many pairs were being processed in the join. This would > get updated in the live UI (when metrics get collected during heartbeats), so > the user could easily see what was going on. > This would even help in cases where there was some other cause of a stuck > executor (eg. network issues) just to disprove this theory. For example, you > may have 100k records with the same key on each side of a join. That probably > won't really show up as extreme skew in task input data. But it'll become 10B > join pairs that spark works through, in one task. > > To further demonstrate the usefulness of this metric please follow the steps > below. > > _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", > "c")_ > _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_ > > _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_ > _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_ > > _val df5 = df1.union(df2)_ > _val df6 = df3.union(df4)_ > > _df5.createOrReplaceTempView("table1")_ > _df6.createOrReplaceTempView("table2")_ > h3. InnerJoin > _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. FullOuterJoin > _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = > p.b and f.c > p.c").count_ > _number of output rows: 6,099,964_ > _number of matched pairs: 90,000,490,000_ > h3. LeftOuterJoin > _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 6,079,964_ > _number of matched pairs: 90,000,490,000_ > h3. RightOuterJoin > _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b > = p.b and f.c > p.c").count_ > _number of output rows: 5,600,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftSemiJoin > _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and > f.c > p.c").count_ > _number of output rows: 36_ > _number of matched pairs: 89,994,910,036_ > h3. CrossJoin > _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftAntiJoin > _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > > p.c").count_ > number of output rows: 499,964 > number of matched pairs: 89,994,910,036 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34369) Track number of pairs processed out of Join
[ https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279125#comment-17279125 ] Apache Spark commented on SPARK-34369: -- User 'sririshindra' has created a pull request for this issue: https://github.com/apache/spark/pull/31477 > Track number of pairs processed out of Join > --- > > Key: SPARK-34369 > URL: https://issues.apache.org/jira/browse/SPARK-34369 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Major > > Often users face a scenario where even a modest skew in a join can lead to > tasks appearing to be stuck, due to the O(n^2) nature of a join considering > all pairs of rows with matching keys. When this happens users think that > spark has gotten deadlocked. If there is a bound condition, the "number of > output rows" metric may look typical. Other metrics may look very modest (eg: > shuffle read). In those cases, it is very hard to understand what the problem > is. There is no conclusive proof without getting a heap dump and looking at > some internal data structures. > It would be much better if spark had a metric(which we propose be titled > “number of matched pairs” as a companion to “number of output rows”) which > showed the user how many pairs were being processed in the join. This would > get updated in the live UI (when metrics get collected during heartbeats), so > the user could easily see what was going on. > This would even help in cases where there was some other cause of a stuck > executor (eg. network issues) just to disprove this theory. For example, you > may have 100k records with the same key on each side of a join. That probably > won't really show up as extreme skew in task input data. But it'll become 10B > join pairs that spark works through, in one task. > > To further demonstrate the usefulness of this metric please follow the steps > below. > > _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", > "c")_ > _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_ > > _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_ > _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_ > > _val df5 = df1.union(df2)_ > _val df6 = df3.union(df4)_ > > _df5.createOrReplaceTempView("table1")_ > _df6.createOrReplaceTempView("table2")_ > h3. InnerJoin > _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. FullOuterJoin > _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = > p.b and f.c > p.c").count_ > _number of output rows: 6,099,964_ > _number of matched pairs: 90,000,490,000_ > h3. LeftOuterJoin > _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 6,079,964_ > _number of matched pairs: 90,000,490,000_ > h3. RightOuterJoin > _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b > = p.b and f.c > p.c").count_ > _number of output rows: 5,600,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftSemiJoin > _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and > f.c > p.c").count_ > _number of output rows: 36_ > _number of matched pairs: 89,994,910,036_ > h3. CrossJoin > _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftAntiJoin > _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > > p.c").count_ > number of output rows: 499,964 > number of matched pairs: 89,994,910,036 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34366) Add metric interfaces to DS v2
[ https://issues.apache.org/jira/browse/SPARK-34366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279114#comment-17279114 ] Apache Spark commented on SPARK-34366: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/31476 > Add metric interfaces to DS v2 > -- > > Key: SPARK-34366 > URL: https://issues.apache.org/jira/browse/SPARK-34366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Add a few public API change to DS v2, to make DS v2 scan can report metrics > to Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34366) Add metric interfaces to DS v2
[ https://issues.apache.org/jira/browse/SPARK-34366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34366: Assignee: L. C. Hsieh (was: Apache Spark) > Add metric interfaces to DS v2 > -- > > Key: SPARK-34366 > URL: https://issues.apache.org/jira/browse/SPARK-34366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Add a few public API change to DS v2, to make DS v2 scan can report metrics > to Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34366) Add metric interfaces to DS v2
[ https://issues.apache.org/jira/browse/SPARK-34366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34366: Assignee: Apache Spark (was: L. C. Hsieh) > Add metric interfaces to DS v2 > -- > > Key: SPARK-34366 > URL: https://issues.apache.org/jira/browse/SPARK-34366 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Add a few public API change to DS v2, to make DS v2 scan can report metrics > to Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name
[ https://issues.apache.org/jira/browse/SPARK-34365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-34365: Description: When reading an Avro dataset (using the dataset's schema or by overriding it with 'avroSchema') or writing an Avro dataset with a provided schema by 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by field name. This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at least on the write path, we would match the schemas by positionally ("structural" comparison). While I agree that this is much more sensible for default behavior, I propose that we make this behavior configurable using an {{option}} for the Avro datasource. Even at the time that SPARK-27762 was handled, there was [interest in making this behavior configurable|https://github.com/apache/spark/pull/24635#issuecomment-494205251], but it appears it went unaddressed. There is precedence for configurability of this behavior as seen in SPARK-32864, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]), so this is behavior that Hadoop/Hive ecosystem users are familiar with: {quote} Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance. {quote} was: When reading an Avro dataset (using the dataset's schema or by overriding it with 'avroSchema') or writing an Avro dataset with a provided schema by 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by field name. This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at least on the write path, we would match the schemas by positionally ("structural" comparison). While I agree that this is much more sensible for default behavior, I propose that we make this behavior configurable using an {{option}} for the Avro datasource. There is precedence for configurability of this behavior as seen in SPARK-32864, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]), so this is behavior that Hadoop/Hive ecosystem users are familiar with: {quote} Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance. {quote} > Support configurable Avro schema field matching for positional or by-name > - > > Key: SPARK-34365 > URL: https://issues.apache.org/jira/browse/SPARK-34365 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Major > > When reading an Avro dataset (using the dataset's schema or by overriding it > with 'avroSchema') or writing an Avro dataset with a provided schema by > 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by > field name. > This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at > least on the write path, we would match the schemas by positionally > ("structural" comparison). While I agree that this is much more sensible for > default behavior, I propose that we make this behavior configurable using an > {{option}} for the Avro datasource. Even at the time that SPARK-27762 was > handled, there was [interest in making this behavior > configurable|https://github.com/apache/spark/pull/24635#issuecomment-494205251], > but it appears it went unaddressed. > There is precedence for configurability of this behavior as seen in > SPARK-32864, which added this support for ORC. Besides this precedence, the > behavior of Hive is to perform matching positionally > ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]), > so this is behavior that Hadoop/Hive ecosystem users are familiar with: > {quote} > Hive is very forgiving about types: it will attempt to store whatever value > matches the provided column in the equivalent column position in the new > table. No matching is done on column names, for instance. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34369) Track number of pairs processed out of Join
[ https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-34369: -- Description: Often users face a scenario where even a modest skew in a join can lead to tasks appearing to be stuck, due to the O(n^2) nature of a join considering all pairs of rows with matching keys. When this happens users think that spark has gotten deadlocked. If there is a bound condition, the "number of output rows" metric may look typical. Other metrics may look very modest (eg: shuffle read). In those cases, it is very hard to understand what the problem is. There is no conclusive proof without getting a heap dump and looking at some internal data structures. It would be much better if spark had a metric(which we propose be titled “number of matched pairs” as a companion to “number of output rows”) which showed the user how many pairs were being processed in the join. This would get updated in the live UI (when metrics get collected during heartbeats), so the user could easily see what was going on. This would even help in cases where there was some other cause of a stuck executor (eg. network issues) just to disprove this theory. For example, you may have 100k records with the same key on each side of a join. That probably won't really show up as extreme skew in task input data. But it'll become 10B join pairs that spark works through, in one task. To further demonstrate the usefulness of this metric please follow the steps below. _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", "c")_ _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_ _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_ _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_ _val df5 = df1.union(df2)_ _val df6 = df3.union(df4)_ _df5.createOrReplaceTempView("table1")_ _df6.createOrReplaceTempView("table2")_ h3. InnerJoin _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 5,580,000_ _number of matched pairs: 90,000,490,000_ h3. FullOuterJoin _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 6,099,964_ _number of matched pairs: 90,000,490,000_ h3. LeftOuterJoin _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 6,079,964_ _number of matched pairs: 90,000,490,000_ h3. RightOuterJoin _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 5,600,000_ _number of matched pairs: 90,000,490,000_ h3. LeftSemiJoin _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 36_ _number of matched pairs: 89,994,910,036_ h3. CrossJoin _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 5,580,000_ _number of matched pairs: 90,000,490,000_ h3. LeftAntiJoin _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > p.c").count_ number of output rows: 499,964 number of matched pairs: 89,994,910,036 was: Often users face a scenario where even a modest skew in a join can lead to tasks appearing to be stuck, due to the O(n^2) nature of a join considering all pairs of rows with matching keys. When this happens users think that spark has gotten deadlocked. If there is a bound condition, the "number of output rows" metric may look typical. Other metrics may look very modest (eg: shuffle read). In those cases, it is very hard to understand what the problem is. There is no conclusive proof without getting a heap dump and looking at some internal data structures. It would be much better if spark had a metric(which we propose be titled “number of matched pairs” as a companion to “number of output rows”) which showed the user how many pairs were being processed in the join. This would get updated in the live UI (when metrics get collected during heartbeats), so the user could easily see what was going on. This would even help in cases where there was some other cause of a stuck executor (eg. network issues) just to disprove this theory. For example, you may have 100k records with the same key on each side of a join. That probably won't really show up as extreme skew in task input data. But it'll become 10B join pairs that spark works through, in one task. To further demonstrate the usefulness of this metric please follow the steps below. _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", "c")_ _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_
[jira] [Commented] (SPARK-34369) Track number of pairs processed out of Join
[ https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279098#comment-17279098 ] Srinivas Rishindra Pothireddi commented on SPARK-34369: --- I am working on this > Track number of pairs processed out of Join > --- > > Key: SPARK-34369 > URL: https://issues.apache.org/jira/browse/SPARK-34369 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Major > > Often users face a scenario where even a modest skew in a join can lead to > tasks appearing to be stuck, due to the O(n^2) nature of a join considering > all pairs of rows with matching keys. When this happens users think that > spark has gotten deadlocked. If there is a bound condition, the "number of > output rows" metric may look typical. Other metrics may look very modest (eg: > shuffle read). In those cases, it is very hard to understand what the problem > is. There is no conclusive proof without getting a heap dump and looking at > some internal data structures. > It would be much better if spark had a metric(which we propose be titled > “number of matched pairs” as a companion to “number of output rows”) which > showed the user how many pairs were being processed in the join. This would > get updated in the live UI (when metrics get collected during heartbeats), so > the user could easily see what was going on. > This would even help in cases where there was some other cause of a stuck > executor (eg. network issues) just to disprove this theory. For example, you > may have 100k records with the same key on each side of a join. That probably > won't really show up as extreme skew in task input data. But it'll become 10B > join pairs that spark works through, in one task. > > To further demonstrate the usefulness of this metric please follow the steps > below. > > _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", > "c")_ > _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_ > > _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_ > _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_ > > _val df5 = df1.union(df2)_ > _val df6 = df3.union(df4)_ > > _df5.createOrReplaceTempView("table1")_ > _df6.createOrReplaceTempView("table2")_ > h3. InnerJoin > _sql("select p.*, f.* from table2 p join table1 f on f.b = p.b and f.c > > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. FullOuterJoin > _spark.sql("select p.*, f.* from table2 p full outer join table1 f on f.b = > p.b and f.c > p.c").count_ > _number of output rows: 6,099,964_ > _number of matched pairs: 90,000,490,000_ > h3. LeftOuterJoin > _sql("select p.*, f.* from table2 p left outer join table1 f on f.b = p.b and > f.c > p.c").count_ > _number of output rows: 6,079,964_ > _number of matched pairs: 90,000,490,000_ > h3. RightOuterJoin > _spark.sql("select p.*, f.* from table2 p right outer join table1 f on f.b = > p.b and f.c > p.c").count_ > _number of output rows: 5,600,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftSemiJoin > _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and > f.c > p.c").count_ > _number of output rows: 36_ > _number of matched pairs: 89,994,910,036_ > h3. CrossJoin > _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b > and f.c > p.c").count_ > _number of output rows: 5,580,000_ > _number of matched pairs: 90,000,490,000_ > h3. LeftAntiJoin > _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > > p.c").count_ > number of output rows: 499,964 > number of matched pairs: 89,994,910,036 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34369) Track number of pairs processed out of Join
Srinivas Rishindra Pothireddi created SPARK-34369: - Summary: Track number of pairs processed out of Join Key: SPARK-34369 URL: https://issues.apache.org/jira/browse/SPARK-34369 Project: Spark Issue Type: New Feature Components: Web UI Affects Versions: 3.2.0 Reporter: Srinivas Rishindra Pothireddi Often users face a scenario where even a modest skew in a join can lead to tasks appearing to be stuck, due to the O(n^2) nature of a join considering all pairs of rows with matching keys. When this happens users think that spark has gotten deadlocked. If there is a bound condition, the "number of output rows" metric may look typical. Other metrics may look very modest (eg: shuffle read). In those cases, it is very hard to understand what the problem is. There is no conclusive proof without getting a heap dump and looking at some internal data structures. It would be much better if spark had a metric(which we propose be titled “number of matched pairs” as a companion to “number of output rows”) which showed the user how many pairs were being processed in the join. This would get updated in the live UI (when metrics get collected during heartbeats), so the user could easily see what was going on. This would even help in cases where there was some other cause of a stuck executor (eg. network issues) just to disprove this theory. For example, you may have 100k records with the same key on each side of a join. That probably won't really show up as extreme skew in task input data. But it'll become 10B join pairs that spark works through, in one task. To further demonstrate the usefulness of this metric please follow the steps below. _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", "c")_ _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_ _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_ _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_ _val df5 = df1.union(df2)_ _val df6 = df3.union(df4)_ _df5.createOrReplaceTempView("table1")_ _df6.createOrReplaceTempView("table2")_ h3. InnerJoin _sql("select p.*, f.* from table2 p join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 5,580,000_ _number of matched pairs: 90,000,490,000_ h3. FullOuterJoin _spark.sql("select p.*, f.* from table2 p full outer join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 6,099,964_ _number of matched pairs: 90,000,490,000_ h3. LeftOuterJoin _sql("select p.*, f.* from table2 p left outer join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 6,079,964_ _number of matched pairs: 90,000,490,000_ h3. RightOuterJoin _spark.sql("select p.*, f.* from table2 p right outer join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 5,600,000_ _number of matched pairs: 90,000,490,000_ h3. LeftSemiJoin _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 36_ _number of matched pairs: 89,994,910,036_ h3. CrossJoin _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b and f.c > p.c").count_ _number of output rows: 5,580,000_ _number of matched pairs: 90,000,490,000_ h3. LeftAntiJoin _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > p.c").count_ number of output rows: 499,964 number of matched pairs: 89,994,910,036 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34368) Streaming implementation for metrics from Datasource v2 scan
L. C. Hsieh created SPARK-34368: --- Summary: Streaming implementation for metrics from Datasource v2 scan Key: SPARK-34368 URL: https://issues.apache.org/jira/browse/SPARK-34368 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Using metrics interface of DS v2 to report metrics for streaming scan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34367) Batch implementation for metrics from Datasource v2 scan
[ https://issues.apache.org/jira/browse/SPARK-34367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-34367: Description: Using metrics interface of DS v2 to report metrics for batch scan. > Batch implementation for metrics from Datasource v2 scan > > > Key: SPARK-34367 > URL: https://issues.apache.org/jira/browse/SPARK-34367 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Using metrics interface of DS v2 to report metrics for batch scan. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34367) Batch implementation for metrics from Datasource v2 scan
L. C. Hsieh created SPARK-34367: --- Summary: Batch implementation for metrics from Datasource v2 scan Key: SPARK-34367 URL: https://issues.apache.org/jira/browse/SPARK-34367 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34338) Report metrics from Datasource v2 scan
[ https://issues.apache.org/jira/browse/SPARK-34338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-34338: Issue Type: Umbrella (was: Improvement) > Report metrics from Datasource v2 scan > -- > > Key: SPARK-34338 > URL: https://issues.apache.org/jira/browse/SPARK-34338 > Project: Spark > Issue Type: Umbrella > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > This is related to SPARK-34297. > In SPARK-34297, we want to add a couple of useful metrics when reading from > Kafka in SS. We need some public API change in DS v2 to make it possible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34366) Add metric interfaces to DS v2
L. C. Hsieh created SPARK-34366: --- Summary: Add metric interfaces to DS v2 Key: SPARK-34366 URL: https://issues.apache.org/jira/browse/SPARK-34366 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Add a few public API change to DS v2, to make DS v2 scan can report metrics to Spark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name
Erik Krogen created SPARK-34365: --- Summary: Support configurable Avro schema field matching for positional or by-name Key: SPARK-34365 URL: https://issues.apache.org/jira/browse/SPARK-34365 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.1 Reporter: Erik Krogen When reading an Avro dataset (using the dataset's schema or by overriding it with 'avroSchema') or writing an Avro dataset with a provided schema by 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by field name. This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at least on the write path, we would match the schemas by positionally ("structural" comparison). While I agree that this is much more sensible for default behavior, I propose that we make this behavior configurable using an {{option}} for the Avro datasource. There is precedence for configurability of this behavior as seen in SPARK-32864, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]), so this is behavior that Hadoop/Hive ecosystem users are familiar with: {quote} Hive is very forgiving about types: it will attempt to store whatever value matches the provided column in the equivalent column position in the new table. No matching is done on column names, for instance. {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name
[ https://issues.apache.org/jira/browse/SPARK-34365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279073#comment-17279073 ] Erik Krogen commented on SPARK-34365: - I plan to post a PR for this in the next few days, unless I hear pushback that this is a bad idea. > Support configurable Avro schema field matching for positional or by-name > - > > Key: SPARK-34365 > URL: https://issues.apache.org/jira/browse/SPARK-34365 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Erik Krogen >Priority: Major > > When reading an Avro dataset (using the dataset's schema or by overriding it > with 'avroSchema') or writing an Avro dataset with a provided schema by > 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by > field name. > This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at > least on the write path, we would match the schemas by positionally > ("structural" comparison). While I agree that this is much more sensible for > default behavior, I propose that we make this behavior configurable using an > {{option}} for the Avro datasource. > There is precedence for configurability of this behavior as seen in > SPARK-32864, which added this support for ORC. Besides this precedence, the > behavior of Hive is to perform matching positionally > ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]), > so this is behavior that Hadoop/Hive ecosystem users are familiar with: > {quote} > Hive is very forgiving about types: it will attempt to store whatever value > matches the provided column in the equivalent column position in the new > table. No matching is done on column names, for instance. > {quote} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34033) SparkR Daemon Initialization
[ https://issues.apache.org/jira/browse/SPARK-34033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Howland updated SPARK-34033: Description: Provide a way for users to initialize the sparkR daemon before it forks. I'm a contractor to Target, where we have several projects doing ML with sparkR. The changes proposed here result in weeks of compute-time saved with every run. Please see [docs/sparkr.md#daemon-initialization|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization]. was: Provide a way for users to initialize the sparkR daemon before it forks. I'm a contractor to Target, where we have several projects doing ML with sparkR. The changes proposed here results in weeks of compute-time saved with every run. (4 partitions) * (5 seconds to load our R libraries) * (2 calls to gapply in our app) / 60 / 60 = 111 hours. (from [docs/sparkr.md|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization]) h3. Daemon Initialization If your worker function has a lengthy initialization, and your application has lots of partitions, you may find you are spending weeks of compute time repeatedly doing something that should have taken a few seconds during daemon initialization. Every Spark executor spawns a process running an R daemon. The daemon "forks a copy" of itself whenever Spark finds work for it to do. It may be applying a predefined method such as "max", or it may be applying your worker function. SparkR::gapply arranges things so that your worker function will be called with each group. A group is the pair Key-Seq[Row]. In the absence of partitioning, the daemon will fork for every group found. With partitioning, the daemon will fork for every partition found. A partition may have several groups in it. All the initializations and library loading your worker function manages is thrown away when the fork concludes. Every fork has to be initialized. The configuration spark.r.daemonInit provides a way to avoid reloading packages every time the daemon forks by having the daemon pre-load packages. You do this by providing R code to initialize the daemon for your application. h4. Examples Suppose we want library(wow) to be pre-loaded for our workers. {{sparkR.session(spark.r.daemonInit = 'library(wow)')}} of course, that would only work if we knew that library(wow) was on our path and available on the executor. If we have to ship the library, we can use YARN sparkR.session( master = 'yarn', spark.r.daemonInit = '.libPaths(c("wowTarget", .libPaths())); library(wow)', spark.submit.deployMode = 'client', spark.yarn.dist.archives = 'wow.zip#wowTarget') YARN creates a directory for the new executor, unzips 'wow.zip' in some other directory, and then provides a symlink to it called ./wowTarget. When the executor starts the daemon, the daemon loads library(wow) from the newly created wowTarget. Warning: if your initialization takes longer than 10 seconds, consider increasing the configuration [spark.r.daemonTimeout](configuration.md#sparkr). > SparkR Daemon Initialization > > > Key: SPARK-34033 > URL: https://issues.apache.org/jira/browse/SPARK-34033 > Project: Spark > Issue Type: Improvement > Components: R, SparkR >Affects Versions: 3.2.0 > Environment: tested on centos 7 & spark 2.3.1 and on my mac & spark > at master >Reporter: Tom Howland >Priority: Major > Original Estimate: 0h > Remaining Estimate: 0h > > Provide a way for users to initialize the sparkR daemon before it forks. > I'm a contractor to Target, where we have several projects doing ML with > sparkR. The changes proposed here result in weeks of compute-time saved with > every run. > Please see > [docs/sparkr.md#daemon-initialization|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization]. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34364) Monitor disk usage and use to reject blocks when under disk pressure
Holden Karau created SPARK-34364: Summary: Monitor disk usage and use to reject blocks when under disk pressure Key: SPARK-34364 URL: https://issues.apache.org/jira/browse/SPARK-34364 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.2.0 Reporter: Holden Karau Has some limitations when combined with emptyDir on K8s. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34363) Allow users to configure a maximum amount of remote shuffle block storage
Holden Karau created SPARK-34363: Summary: Allow users to configure a maximum amount of remote shuffle block storage Key: SPARK-34363 URL: https://issues.apache.org/jira/browse/SPARK-34363 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.2.0 Reporter: Holden Karau -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths
[ https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279006#comment-17279006 ] cornel creanga commented on SPARK-34298: Thanks for the answer. In this wouldn't be better to implement the option a) - throw an explicit error with a meaningful message(eg root dirs are not supported in overwrite mode etc) when trying to use a root dir? Right now one will get an java.lang.IndexOutOfBoundsException and will have to dig into the Spark code in order to understand what's the problem (as it happened to me). > SaveMode.Overwrite not usable when using s3a root paths > > > Key: SPARK-34298 > URL: https://issues.apache.org/jira/browse/SPARK-34298 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: cornel creanga >Priority: Minor > > SaveMode.Overwrite does not work when using paths containing just the root eg > "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials > are needed): > {color:#0033b3}val {color}{color:#00}out {color}= > {color:#067d17}"s3a://peakhour-report"{color} > {color:#0033b3}val {color}{color:#00}sparkContext{color}: > {color:#00}SparkContext {color}= > {color:#00}SparkContext{color}.getOrCreate() > {color:#0033b3}val {color}{color:#00}someData {color}= > {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, > {color:#067d17}"mouse"{color})) > {color:#0033b3}val {color}{color:#00}someSchema {color}= > {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, > {color:#00}IntegerType{color}, > {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, > {color:#00}StringType{color},{color:#0033b3}true{color})) > {color:#0033b3}val {color}{color:#00}someDF {color}= > {color:#871094}spark{color}.createDataFrame( > > {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color})) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color}, > accessK{color:#00}ey{color})) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color}, > {color:#00}secretKey{color})) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color}, > > {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color}) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color}, > {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color}) > {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color}) > .save({color:#00}out{color}) > > Error stacktrace: > Exception in thread "main" java.lang.IllegalArgumentException: Can not create > a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[] > at org.apache.hadoop.fs.Path.suffix(Path.java:446) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240) > > If you change out from {color:#0033b3}val {color}{color:#00}out {color}= > {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val > {color}{color:#00}out {color}= > {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code > works.{color}{color} > {color:#067d17}{color:#172b4d}There are two problems in the actual code from > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color} > {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix > method that doesn't work on root paths > {color}{color} > {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly > (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem > class){color}{color} > {color:#067d17}{color:#172b4d}I think that there are two > choices:{color}{color} > {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite > mode for root folders {color}{color} > {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the > Path.suffix method and change the clean up code from > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root > folder content and delete the entries one by one.{color}{color} > I can provide a patch for both choices, assuming that they make sense. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
[jira] [Commented] (SPARK-34337) Reject disk blocks when out of disk space
[ https://issues.apache.org/jira/browse/SPARK-34337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279000#comment-17279000 ] Holden Karau commented on SPARK-34337: -- Initially we should allow the user to configure a maximum amount of shuffle blocks to be stored. In the future we can try and use underlying FS info. > Reject disk blocks when out of disk space > - > > Key: SPARK-34337 > URL: https://issues.apache.org/jira/browse/SPARK-34337 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0, 3.1.1, 3.1.2 >Reporter: Holden Karau >Priority: Major > > Now that we have the ability to store shuffle blocks on dis-aggregated > storage (when configured) we should add the option to reject storing blocks > locally on an executor at a certain disk pressure threshold. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis
[ https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33888: --- Assignee: Duc Hoa Nguyen (was: Apache Spark) > JDBC SQL TIME type represents incorrectly as TimestampType, it should be > physical Int in millis > --- > > Key: SPARK-33888 > URL: https://issues.apache.org/jira/browse/SPARK-33888 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3, 3.0.0, 3.0.1 >Reporter: Duc Hoa Nguyen >Assignee: Duc Hoa Nguyen >Priority: Minor > Fix For: 3.2.0 > > > Currently, for JDBC, SQL TIME type represents incorrectly as Spark > TimestampType. This should be represent as physical int in millis Represents > a time of day, with no reference to a particular calendar, time zone or date, > with a precision of one millisecond. It stores the number of milliseconds > after midnight, 00:00:00.000. > We encountered the issue of Avro logical type of `TimeMillis` not being > converted correctly to Spark `Timestamp` struct type using the > `SchemaConverters`, but it converts to regular `int` instead. Reproducible by > ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe > will get the correct type (Timestamp), but enforcing our avro schema > (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to > apply with the following exception: > {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type > for schema of int}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34357) Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
[ https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-34357: --- Assignee: Duc Hoa Nguyen > Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of > timezone > -- > > Key: SPARK-34357 > URL: https://issues.apache.org/jira/browse/SPARK-34357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Duc Hoa Nguyen >Assignee: Duc Hoa Nguyen >Priority: Minor > Fix For: 3.2.0 > > > Due to user-experience (confusing to Spark users - java.sql.Time using > milliseconds vs Spark using microseconds; and user losing useful functions > like hour(), minute(), etc on the column), we have decided to revert back to > use TimestampType but this time we will enforce the hour to be consistently > across system timezone (via offset manipulation) > Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here > https://github.com/apache/spark/pull/30902#discussion_r569186823 > Related issues: > [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34357) Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
[ https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-34357. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31473 [https://github.com/apache/spark/pull/31473] > Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of > timezone > -- > > Key: SPARK-34357 > URL: https://issues.apache.org/jira/browse/SPARK-34357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Duc Hoa Nguyen >Priority: Minor > Fix For: 3.2.0 > > > Due to user-experience (confusing to Spark users - java.sql.Time using > milliseconds vs Spark using microseconds; and user losing useful functions > like hour(), minute(), etc on the column), we have decided to revert back to > use TimestampType but this time we will enforce the hour to be consistently > across system timezone (via offset manipulation) > Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here > https://github.com/apache/spark/pull/30902#discussion_r569186823 > Related issues: > [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4
[ https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278954#comment-17278954 ] Svitlana Ponomarova commented on SPARK-34362: - According to Google Dataproc 1.4 documentation it supports Apache Hive 2.3.7: [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] As I see from sources in *branch-2.4*: [https://github.com/apache/spark/blob/e7acca22cd1ed9a70fabc9ca143aa06fa8573864/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L101] case "2.3" doesn't have any match to 2.3.7 version which causes MatchError. May this part be back-ported from *branch-3.0 ?* https://github.com/apache/spark/blob/06942331a7db1e6d5e6709ac7009c180c94cc7c0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L107 > scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc > 1.4 > - > > Key: SPARK-34362 > URL: https://issues.apache.org/jira/browse/SPARK-34362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Svitlana Ponomarova >Priority: Critical > > According to Google Dataproc 1.4 release notes: > > [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] > 2.4.7 is supported spark version. > Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes: > {noformat} > scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 > (of class java.lang.String) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89) > > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) > > at > org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4
[ https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278938#comment-17278938 ] Svitlana Ponomarova edited comment on SPARK-34362 at 2/4/21, 4:33 PM: -- Maybe "Components" field is wrong. Unfortunately, I don't see "Hive" in a list. was (Author: sponomarova): Maybe "Components" field was filled wrong. Unfortunately, I don't see "Hive" in a list. > scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc > 1.4 > - > > Key: SPARK-34362 > URL: https://issues.apache.org/jira/browse/SPARK-34362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Svitlana Ponomarova >Priority: Critical > > According to Google Dataproc 1.4 release notes: > > [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] > 2.4.7 is supported spark version. > Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes: > {noformat} > scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 > (of class java.lang.String) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89) > > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) > > at > org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4
[ https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Svitlana Ponomarova updated SPARK-34362: Description: According to Google Dataproc 1.4 release notes: [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] 2.4.7 is supported spark version. Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes: {noformat} scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 (of class java.lang.String) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) {noformat} was: According to Google Dataproc 1.4 release notes: [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] 2.4.7 is supported spark version. Use spark-hive_2.11-2.4.7.jar for Hive on Dataproc 1.4 causes: {noformat} scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 (of class java.lang.String) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) {noformat} > scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc > 1.4 > - > > Key: SPARK-34362 > URL: https://issues.apache.org/jira/browse/SPARK-34362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Svitlana Ponomarova >Priority: Critical > > According to Google Dataproc 1.4 release notes: > > [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] > 2.4.7 is supported spark version. > Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes: > {noformat} > scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 > (of class java.lang.String) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89) > > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) > > at > org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4
[ https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278938#comment-17278938 ] Svitlana Ponomarova commented on SPARK-34362: - Maybe "Components" field was filled wrong. Unfortunately, I don't see "Hive" in a list. > scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc > 1.4 > - > > Key: SPARK-34362 > URL: https://issues.apache.org/jira/browse/SPARK-34362 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.7 >Reporter: Svitlana Ponomarova >Priority: Critical > > According to Google Dataproc 1.4 release notes: > > [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] > 2.4.7 is supported spark version. > Use spark-hive_2.11-2.4.7.jar for Hive on Dataproc 1.4 causes: > {noformat} > scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 > (of class java.lang.String) > at > org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89) > > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) > at > org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) > at > org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) > > at > org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4
Svitlana Ponomarova created SPARK-34362: --- Summary: scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4 Key: SPARK-34362 URL: https://issues.apache.org/jira/browse/SPARK-34362 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.7 Reporter: Svitlana Ponomarova According to Google Dataproc 1.4 release notes: [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4] 2.4.7 is supported spark version. Use spark-hive_2.11-2.4.7.jar for Hive on Dataproc 1.4 causes: {noformat} scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 (of class java.lang.String) at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66) at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34361) Dynamic allocation on K8s kills executors with running tasks
[ https://issues.apache.org/jira/browse/SPARK-34361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278932#comment-17278932 ] Attila Zsolt Piros commented on SPARK-34361: I am working on this. > Dynamic allocation on K8s kills executors with running tasks > > > Key: SPARK-34361 > URL: https://issues.apache.org/jira/browse/SPARK-34361 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2 >Reporter: Attila Zsolt Piros >Priority: Major > > There is race between executor POD allocator and cluster scheduler backend. > During downscaling (in dynamic allocation) we experienced a lot of killed new > executors with running task on them. > The pattern in the log is the following: > {noformat} > 21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new > total is 138) > ... > 21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID > 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes) > 21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests > (408,312,307). > ... > 21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on > 100.100.18.138: The executor with id 312 was deleted by a user or the > framework. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34361) Dynamic allocation on K8s kills executors with running tasks
Attila Zsolt Piros created SPARK-34361: -- Summary: Dynamic allocation on K8s kills executors with running tasks Key: SPARK-34361 URL: https://issues.apache.org/jira/browse/SPARK-34361 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.0.1, 3.0.0, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2 Reporter: Attila Zsolt Piros There is race between executor POD allocator and cluster scheduler backend. During downscaling (in dynamic allocation) we experienced a lot of killed new executors with running task on them. The pattern in the log is the following: {noformat} 21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new total is 138) ... 21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes) 21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests (408,312,307). ... 21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 100.100.18.138: The executor with id 312 was deleted by a user or the framework. {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34360) Support table truncation by v2 Table Catalogs
[ https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278878#comment-17278878 ] Apache Spark commented on SPARK-34360: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31475 > Support table truncation by v2 Table Catalogs > - > > Key: SPARK-34360 > URL: https://issues.apache.org/jira/browse/SPARK-34360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Add new method `truncateTable` to the TableCatalog interface with default > implementation. And implement this method in InMemoryTableCatalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34360) Support table truncation by v2 Table Catalogs
[ https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34360: Assignee: Apache Spark (was: Maxim Gekk) > Support table truncation by v2 Table Catalogs > - > > Key: SPARK-34360 > URL: https://issues.apache.org/jira/browse/SPARK-34360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.2.0 > > > Add new method `truncateTable` to the TableCatalog interface with default > implementation. And implement this method in InMemoryTableCatalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34360) Support table truncation by v2 Table Catalogs
[ https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34360: Assignee: Maxim Gekk (was: Apache Spark) > Support table truncation by v2 Table Catalogs > - > > Key: SPARK-34360 > URL: https://issues.apache.org/jira/browse/SPARK-34360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Add new method `truncateTable` to the TableCatalog interface with default > implementation. And implement this method in InMemoryTableCatalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34360) Support table truncation by v2 Table Catalogs
[ https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278877#comment-17278877 ] Apache Spark commented on SPARK-34360: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/31475 > Support table truncation by v2 Table Catalogs > - > > Key: SPARK-34360 > URL: https://issues.apache.org/jira/browse/SPARK-34360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Add new method `truncateTable` to the TableCatalog interface with default > implementation. And implement this method in InMemoryTableCatalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34360) Support table truncation by v2 Table Catalogs
[ https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-34360: --- Description: Add new method `truncateTable` to the TableCatalog interface with default implementation. And implement this method in InMemoryTableCatalog. (was: Add new method `truncatePartition` in `SupportsPartitionManagement` and `truncatePartitions` in `SupportsAtomicPartitionManagement`.) > Support table truncation by v2 Table Catalogs > - > > Key: SPARK-34360 > URL: https://issues.apache.org/jira/browse/SPARK-34360 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.2.0 > > > Add new method `truncateTable` to the TableCatalog interface with default > implementation. And implement this method in InMemoryTableCatalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34360) Support table truncation by v2 Table Catalogs
Maxim Gekk created SPARK-34360: -- Summary: Support table truncation by v2 Table Catalogs Key: SPARK-34360 URL: https://issues.apache.org/jira/browse/SPARK-34360 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.2.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.2.0 Add new method `truncatePartition` in `SupportsPartitionManagement` and `truncatePartitions` in `SupportsAtomicPartitionManagement`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34359: Assignee: Wenchen Fan (was: Apache Spark) > add a legacy config to restore the output schema of SHOW DATABASES > -- > > Key: SPARK-34359 > URL: https://issues.apache.org/jira/browse/SPARK-34359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278863#comment-17278863 ] Apache Spark commented on SPARK-34359: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31474 > add a legacy config to restore the output schema of SHOW DATABASES > -- > > Key: SPARK-34359 > URL: https://issues.apache.org/jira/browse/SPARK-34359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34359: Assignee: Apache Spark (was: Wenchen Fan) > add a legacy config to restore the output schema of SHOW DATABASES > -- > > Key: SPARK-34359 > URL: https://issues.apache.org/jira/browse/SPARK-34359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES
Wenchen Fan created SPARK-34359: --- Summary: add a legacy config to restore the output schema of SHOW DATABASES Key: SPARK-34359 URL: https://issues.apache.org/jira/browse/SPARK-34359 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.2 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES
[ https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-34359: Issue Type: Improvement (was: Bug) > add a legacy config to restore the output schema of SHOW DATABASES > -- > > Key: SPARK-34359 > URL: https://issues.apache.org/jira/browse/SPARK-34359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths
[ https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278846#comment-17278846 ] Steve Loughran commented on SPARK-34298: root dirs are special in that they always exist. Normally apps like spark and hive don't notice this as nobody ever runs jobs which write to the base of file:// or hdfs:// ; object stores are special there. You might the commit algorithms get a bit confused too. In which case: * fixes for the s3a committers welcome; * there is a serialized test phase in hadoop-aws where we do stuff against the root dir; * and a PR for real integration tests can go in https://github.com/hortonworks-spark/cloud-integration * anything related to the classic MR committer will be rejected out of fear of going near it; not safe for s3 anyway Otherwise: workaround is to write into a subdir. Sorry > SaveMode.Overwrite not usable when using s3a root paths > > > Key: SPARK-34298 > URL: https://issues.apache.org/jira/browse/SPARK-34298 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: cornel creanga >Priority: Minor > > SaveMode.Overwrite does not work when using paths containing just the root eg > "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials > are needed): > {color:#0033b3}val {color}{color:#00}out {color}= > {color:#067d17}"s3a://peakhour-report"{color} > {color:#0033b3}val {color}{color:#00}sparkContext{color}: > {color:#00}SparkContext {color}= > {color:#00}SparkContext{color}.getOrCreate() > {color:#0033b3}val {color}{color:#00}someData {color}= > {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, > {color:#067d17}"mouse"{color})) > {color:#0033b3}val {color}{color:#00}someSchema {color}= > {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, > {color:#00}IntegerType{color}, > {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, > {color:#00}StringType{color},{color:#0033b3}true{color})) > {color:#0033b3}val {color}{color:#00}someDF {color}= > {color:#871094}spark{color}.createDataFrame( > > {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color})) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color}, > accessK{color:#00}ey{color})) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color}, > {color:#00}secretKey{color})) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color}, > > {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color}) > {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color}, > {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color}) > {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color}) > .save({color:#00}out{color}) > > Error stacktrace: > Exception in thread "main" java.lang.IllegalArgumentException: Can not create > a Path from an empty string > at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[] > at org.apache.hadoop.fs.Path.suffix(Path.java:446) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240) > > If you change out from {color:#0033b3}val {color}{color:#00}out {color}= > {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val > {color}{color:#00}out {color}= > {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code > works.{color}{color} > {color:#067d17}{color:#172b4d}There are two problems in the actual code from > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color} > {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix > method that doesn't work on root paths > {color}{color} > {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly > (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem > class){color}{color} > {color:#067d17}{color:#172b4d}I think that there are two > choices:{color}{color} > {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite > mode for root folders {color}{color} > {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the > Path.suffix method and change the clean up code from > InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root > folder content and delet
[jira] [Created] (SPARK-34358) Add API for all built-in expresssion functions
Malthe Borch created SPARK-34358: Summary: Add API for all built-in expresssion functions Key: SPARK-34358 URL: https://issues.apache.org/jira/browse/SPARK-34358 Project: Spark Issue Type: Improvement Components: Java API, PySpark Affects Versions: 3.0.1 Reporter: Malthe Borch >From the [SQL >functions|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html] > documentation: {quote}Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compile-time safety to make sure the function exists. {quote} Functions such as "inline_outer" are actually commonly used, but are not currently included in the API, meaning that we lose compile-time safety for those invocations. We should implement the required function definitions for the remaining built-in functions when applicable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34357) Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
[ https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Duc Hoa Nguyen updated SPARK-34357: --- Summary: Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone (was: Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone) > Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of > timezone > -- > > Key: SPARK-34357 > URL: https://issues.apache.org/jira/browse/SPARK-34357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Duc Hoa Nguyen >Priority: Minor > > Due to user-experience (confusing to Spark users - java.sql.Time using > milliseconds vs Spark using microseconds; and user losing useful functions > like hour(), minute(), etc on the column), we have decided to revert back to > use TimestampType but this time we will enforce the hour to be consistently > across system timezone (via offset manipulation) > Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here > https://github.com/apache/spark/pull/30902#discussion_r569186823 > Related issues: > [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34351) Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook
[ https://issues.apache.org/jira/browse/SPARK-34351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacek Laskowski resolved SPARK-34351. - Resolution: Invalid Please use StackOverflow or the user@spark.a.o mailing list to ask this question (as described in [http://spark.apache.org/community.html]. See you there! > Running into "Py4JJavaError" while counting to text file or list using > Pyspark, Jupyter notebook > > > Key: SPARK-34351 > URL: https://issues.apache.org/jira/browse/SPARK-34351 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.1 > Environment: PS> python --version > *Python 3.6.8* > PS> jupyter --version > j*upyter core : 4.7.0* > *jupyter-notebook : 6.2.0* > qtconsole : 5.0.2 > ipython : 7.16.1 > ipykernel : 5.4.3 > jupyter client : 6.1.11 > jupyter lab : not installed > nbconvert : 6.0.7 > ipywidgets : 7.6.3 > nbformat : 5.1.2 > traitlets : 4.3.3 > PS > java -version > *java version "1.8.0_271"* > Java(TM) SE Runtime Environment (build 1.8.0_271-b09) > Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode) > > Spark versiyon > *spark-2.3.1-bin-hadoop2.7* >Reporter: Huseyin Elci >Priority: Major > > I run into the following error: > Any help resolving this error is greatly appreciated. > *My Code 1:* > {code:python} > import findspark > findspark.init("C:\Spark") > from pyspark.sql import SparkSession > from pyspark.conf import SparkConf > spark = SparkSession.builder\ > .master("local[4]")\ > .appName("WordCount_RDD")\ > .getOrCreate() > sc = spark.sparkContext > data = "D:\\05 Spark\\data\\MyArticle.txt" > story_rdd = sc.textFile(data) > story_rdd.count() > {code} > *My Code 2:* > {code:python} > import findspark > findspark.init("C:\Spark") > from pyspark import SparkContext > sc = SparkContext() > mylist = [1,2,2,3,5,48,98,62,14,55] > mylist_rdd = sc.parallelize(mylist) > mylist_rdd.map(lambda x: x*x) > mylist_rdd.map(lambda x: x*x).collect() > {code} > *ERROR:* > I took same error code for my codes. > {code:python} > --- > Py4JJavaError Traceback (most recent call last) > in > > 1 story_rdd.count() > C:\Spark\python\pyspark\rdd.py in count(self) > 1071 3 > 1072 """ > -> 1073 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() > 1074 > 1075 def stats(self): > C:\Spark\python\pyspark\rdd.py in sum(self) > 1062 6.0 > 1063 """ > -> 1064 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add) > 1065 > 1066 def count(self): > C:\Spark\python\pyspark\rdd.py in fold(self, zeroValue, op) > 933 # zeroValue provided to each partition is unique from the one provided > 934 # to the final reduce call > --> 935 vals = self.mapPartitions(func).collect() > 936 return reduce(op, vals, zeroValue) > 937 > C:\Spark\python\pyspark\rdd.py in collect(self) > 832 """ > 833 with SCCallSiteSync(self.context) as css: > --> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) > 835 return list(_load_from_socket(sock_info, self._jrdd_deserializer)) > 836 > C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in > __call__(self, *args) > 1255 answer = self.gateway_client.send_command(command) > 1256 return_value = get_return_value( > -> 1257 answer, self.gateway_client, self.target_id, self.name) > 1258 > 1259 for temp_arg in temp_args: > C:\Spark\python\pyspark\sql\utils.py in deco(*a, **kw) > 61 def deco(*a, **kw): > 62 try: > ---> 63 return f(*a, **kw) > 64 except py4j.protocol.Py4JJavaError as e: > 65 s = e.java_exception.toString() > C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in > get_return_value(answer, gateway_client, target_id, name) > 326 raise Py4JJavaError( > 327 "An error occurred while calling > {0} \{1} \{2} > .\n". > --> 328 format(target_id, ".", name), value) > 329 else: > 330 raise Py4JError( > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 > in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 > (TID 1, localhost, executor driver): org.apache.spark.SparkException: Python > worker failed to connect back. > at > org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148) > at > org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76) > at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) > at > org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86) > at org.apache.spark.api.python.PythonRDD.comput
[jira] [Commented] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
[ https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278776#comment-17278776 ] Apache Spark commented on SPARK-34357: -- User 'saikocat' has created a pull request for this issue: https://github.com/apache/spark/pull/31473 > Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless > of timezone > - > > Key: SPARK-34357 > URL: https://issues.apache.org/jira/browse/SPARK-34357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Duc Hoa Nguyen >Priority: Minor > > Due to user-experience (confusing to Spark users - java.sql.Time using > milliseconds vs Spark using microseconds; and user losing useful functions > like hour(), minute(), etc on the column), we have decided to revert back to > use TimestampType but this time we will enforce the hour to be consistently > across system timezone (via offset manipulation) > Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here > https://github.com/apache/spark/pull/30902#discussion_r569186823 > Related issues: > [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
[ https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278775#comment-17278775 ] Apache Spark commented on SPARK-34357: -- User 'saikocat' has created a pull request for this issue: https://github.com/apache/spark/pull/31473 > Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless > of timezone > - > > Key: SPARK-34357 > URL: https://issues.apache.org/jira/browse/SPARK-34357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Duc Hoa Nguyen >Priority: Minor > > Due to user-experience (confusing to Spark users - java.sql.Time using > milliseconds vs Spark using microseconds; and user losing useful functions > like hour(), minute(), etc on the column), we have decided to revert back to > use TimestampType but this time we will enforce the hour to be consistently > across system timezone (via offset manipulation) > Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here > https://github.com/apache/spark/pull/30902#discussion_r569186823 > Related issues: > [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
[ https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34357: Assignee: (was: Apache Spark) > Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless > of timezone > - > > Key: SPARK-34357 > URL: https://issues.apache.org/jira/browse/SPARK-34357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Duc Hoa Nguyen >Priority: Minor > > Due to user-experience (confusing to Spark users - java.sql.Time using > milliseconds vs Spark using microseconds; and user losing useful functions > like hour(), minute(), etc on the column), we have decided to revert back to > use TimestampType but this time we will enforce the hour to be consistently > across system timezone (via offset manipulation) > Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here > https://github.com/apache/spark/pull/30902#discussion_r569186823 > Related issues: > [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
[ https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34357: Assignee: Apache Spark > Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless > of timezone > - > > Key: SPARK-34357 > URL: https://issues.apache.org/jira/browse/SPARK-34357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Duc Hoa Nguyen >Assignee: Apache Spark >Priority: Minor > > Due to user-experience (confusing to Spark users - java.sql.Time using > milliseconds vs Spark using microseconds; and user losing useful functions > like hour(), minute(), etc on the column), we have decided to revert back to > use TimestampType but this time we will enforce the hour to be consistently > across system timezone (via offset manipulation) > Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here > https://github.com/apache/spark/pull/30902#discussion_r569186823 > Related issues: > [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone
Duc Hoa Nguyen created SPARK-34357: -- Summary: Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone Key: SPARK-34357 URL: https://issues.apache.org/jira/browse/SPARK-34357 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Duc Hoa Nguyen Due to user-experience (confusing to Spark users - java.sql.Time using milliseconds vs Spark using microseconds; and user losing useful functions like hour(), minute(), etc on the column), we have decided to revert back to use TimestampType but this time we will enforce the hour to be consistently across system timezone (via offset manipulation) Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here https://github.com/apache/spark/pull/30902#discussion_r569186823 Related issues: [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34356) OVR transform fix potential column conflict
[ https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278750#comment-17278750 ] Apache Spark commented on SPARK-34356: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31472 > OVR transform fix potential column conflict > --- > > Key: SPARK-34356 > URL: https://issues.apache.org/jira/browse/SPARK-34356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {code:java} > import org.apache.spark.ml.classification._val df = > spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", > lit(0.0))val classifier = new > LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) > val ovr = new OneVsRest().setClassifier(classifier) > val ovrm = ovr.fit(df) > ovrm.transform(df) > java.lang.IllegalArgumentException: requirement failed: Column probability > already exists. > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33) > at > org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255) > at > org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222) > at > org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107) > at > org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) > at > org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203) > ... 49 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34356) OVR transform fix potential column conflict
[ https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34356: Assignee: Apache Spark (was: zhengruifeng) > OVR transform fix potential column conflict > --- > > Key: SPARK-34356 > URL: https://issues.apache.org/jira/browse/SPARK-34356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Major > > {code:java} > import org.apache.spark.ml.classification._val df = > spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", > lit(0.0))val classifier = new > LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) > val ovr = new OneVsRest().setClassifier(classifier) > val ovrm = ovr.fit(df) > ovrm.transform(df) > java.lang.IllegalArgumentException: requirement failed: Column probability > already exists. > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33) > at > org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255) > at > org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222) > at > org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107) > at > org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) > at > org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203) > ... 49 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34356) OVR transform fix potential column conflict
[ https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34356: Assignee: zhengruifeng (was: Apache Spark) > OVR transform fix potential column conflict > --- > > Key: SPARK-34356 > URL: https://issues.apache.org/jira/browse/SPARK-34356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {code:java} > import org.apache.spark.ml.classification._val df = > spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", > lit(0.0))val classifier = new > LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) > val ovr = new OneVsRest().setClassifier(classifier) > val ovrm = ovr.fit(df) > ovrm.transform(df) > java.lang.IllegalArgumentException: requirement failed: Column probability > already exists. > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33) > at > org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255) > at > org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222) > at > org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107) > at > org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) > at > org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203) > ... 49 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34356) OVR transform fix potential column conflict
[ https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278749#comment-17278749 ] Apache Spark commented on SPARK-34356: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/31472 > OVR transform fix potential column conflict > --- > > Key: SPARK-34356 > URL: https://issues.apache.org/jira/browse/SPARK-34356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {code:java} > import org.apache.spark.ml.classification._val df = > spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", > lit(0.0))val classifier = new > LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) > val ovr = new OneVsRest().setClassifier(classifier) > val ovrm = ovr.fit(df) > ovrm.transform(df) > java.lang.IllegalArgumentException: requirement failed: Column probability > already exists. > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33) > at > org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255) > at > org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222) > at > org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107) > at > org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) > at > org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203) > ... 49 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34356) OVR transform fix potential column conflict
[ https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-34356: - Summary: OVR transform fix potential column conflict (was: OVR transform avoid potential column conflict) > OVR transform fix potential column conflict > --- > > Key: SPARK-34356 > URL: https://issues.apache.org/jira/browse/SPARK-34356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {code:java} > import org.apache.spark.ml.classification._val df = > spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", > lit(0.0))val classifier = new > LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) > val ovr = new OneVsRest().setClassifier(classifier) > val ovrm = ovr.fit(df) > ovrm.transform(df) > java.lang.IllegalArgumentException: requirement failed: Column probability > already exists. > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33) > at > org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255) > at > org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222) > at > org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107) > at > org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) > at > org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203) > ... 49 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34356) OVR transform avoid potential column conflict
[ https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-34356: Assignee: zhengruifeng > OVR transform avoid potential column conflict > - > > Key: SPARK-34356 > URL: https://issues.apache.org/jira/browse/SPARK-34356 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > > {code:java} > import org.apache.spark.ml.classification._val df = > spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", > lit(0.0))val classifier = new > LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) > val ovr = new OneVsRest().setClassifier(classifier) > val ovrm = ovr.fit(df) > ovrm.transform(df) > java.lang.IllegalArgumentException: requirement failed: Column probability > already exists. > at scala.Predef$.require(Predef.scala:281) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106) > at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38) > at > org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33) > at > org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268) > at > org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255) > at > org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917) > at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222) > at > org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) > at > org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107) > at > org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215) > at > scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) > at > org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203) > ... 49 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34356) OVR transform avoid potential column conflict
zhengruifeng created SPARK-34356: Summary: OVR transform avoid potential column conflict Key: SPARK-34356 URL: https://issues.apache.org/jira/browse/SPARK-34356 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.2.0 Reporter: zhengruifeng {code:java} import org.apache.spark.ml.classification._val df = spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability", lit(0.0))val classifier = new LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true) val ovr = new OneVsRest().setClassifier(classifier) val ovrm = ovr.fit(df) ovrm.transform(df) java.lang.IllegalArgumentException: requirement failed: Column probability already exists. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106) at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96) at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38) at org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33) at org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917) at org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268) at org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255) at org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222) at org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107) at org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215) at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198) at org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203) ... 49 elided {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org