[jira] [Commented] (SPARK-31783) Performance test on dense and sparse datasets
[ https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112818#comment-17112818 ] zhengruifeng commented on SPARK-31783: -- The test code is in 'blockify_total', and current result is the .xlsx file. > Performance test on dense and sparse datasets > - > > Key: SPARK-31783 > URL: https://issues.apache.org/jira/browse/SPARK-31783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Attachments: blockify_perf_20200521.xlsx, blockify_total > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31783) Performance test on dense and sparse datasets
[ https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-31783: - Attachment: blockify_perf_20200521.xlsx > Performance test on dense and sparse datasets > - > > Key: SPARK-31783 > URL: https://issues.apache.org/jira/browse/SPARK-31783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Attachments: blockify_perf_20200521.xlsx, blockify_total > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31783) Performance test on dense and sparse datasets
[ https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-31783: - Attachment: blockify_total > Performance test on dense and sparse datasets > - > > Key: SPARK-31783 > URL: https://issues.apache.org/jira/browse/SPARK-31783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Attachments: blockify_perf_20200521.xlsx, blockify_total > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31783) Performance test on dense and sparse datasets
[ https://issues.apache.org/jira/browse/SPARK-31783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31783: Assignee: zhengruifeng > Performance test on dense and sparse datasets > - > > Key: SPARK-31783 > URL: https://issues.apache.org/jira/browse/SPARK-31783 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31783) Performance test on dense and sparse datasets
zhengruifeng created SPARK-31783: Summary: Performance test on dense and sparse datasets Key: SPARK-31783 URL: https://issues.apache.org/jira/browse/SPARK-31783 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31782) Performance test on dense and sparse datasets
zhengruifeng created SPARK-31782: Summary: Performance test on dense and sparse datasets Key: SPARK-31782 URL: https://issues.apache.org/jira/browse/SPARK-31782 Project: Spark Issue Type: Test Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31782) Performance test on dense and sparse datasets
[ https://issues.apache.org/jira/browse/SPARK-31782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31782. -- Resolution: Duplicate > Performance test on dense and sparse datasets > - > > Key: SPARK-31782 > URL: https://issues.apache.org/jira/browse/SPARK-31782 > Project: Spark > Issue Type: Test > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112816#comment-17112816 ] Jungtaek Lim commented on SPARK-31754: -- I can also take a look if the input and checkpoint are sharable (say, no production inputs), but I imagine it's unlikely. > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > Attachments: CodeGen.txt, Logical-Plan.txt > > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) > at >
[jira] [Updated] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31761: - Target Version/s: (was: 3.1.0) > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString
[ https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31762: Parent: SPARK-31408 Issue Type: Sub-task (was: Bug) > Fix perf regression of date/timestamp formatting in toHiveString > > > Key: SPARK-31762 > URL: https://issues.apache.org/jira/browse/SPARK-31762 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > HiveResult.toHiveString has to convert incoming Java date/timestamps types to > days/microseconds because existing API of DateFormatter/TimestampFormatter > don't accept java.sql.Timestamp/java.util.Date and > java.time.Instant/java.time.LocalDate. Internally, the formatters perform > conversions to Java types again. This badly impacts on the performance. The > ticket aims to add new APIs to DateFormatter and TimestampFormatter that > should accept Java types. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString
[ https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31762: --- Assignee: Maxim Gekk > Fix perf regression of date/timestamp formatting in toHiveString > > > Key: SPARK-31762 > URL: https://issues.apache.org/jira/browse/SPARK-31762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > HiveResult.toHiveString has to convert incoming Java date/timestamps types to > days/microseconds because existing API of DateFormatter/TimestampFormatter > don't accept java.sql.Timestamp/java.util.Date and > java.time.Instant/java.time.LocalDate. Internally, the formatters perform > conversions to Java types again. This badly impacts on the performance. The > ticket aims to add new APIs to DateFormatter and TimestampFormatter that > should accept Java types. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31762) Fix perf regression of date/timestamp formatting in toHiveString
[ https://issues.apache.org/jira/browse/SPARK-31762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31762. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28582 [https://github.com/apache/spark/pull/28582] > Fix perf regression of date/timestamp formatting in toHiveString > > > Key: SPARK-31762 > URL: https://issues.apache.org/jira/browse/SPARK-31762 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > HiveResult.toHiveString has to convert incoming Java date/timestamps types to > days/microseconds because existing API of DateFormatter/TimestampFormatter > don't accept java.sql.Timestamp/java.util.Date and > java.time.Instant/java.time.LocalDate. Internally, the formatters perform > conversions to Java types again. This badly impacts on the performance. The > ticket aims to add new APIs to DateFormatter and TimestampFormatter that > should accept Java types. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31760) Simplification Based on Containment
[ https://issues.apache.org/jira/browse/SPARK-31760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31760: Summary: Simplification Based on Containment (was: Consolidating Single Table Predicates) > Simplification Based on Containment > --- > > Key: SPARK-31760 > URL: https://issues.apache.org/jira/browse/SPARK-31760 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31781) Move param k (number of clusters) to shared params
[ https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112716#comment-17112716 ] Apache Spark commented on SPARK-31781: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28595 > Move param k (number of clusters) to shared params > -- > > Key: SPARK-31781 > URL: https://issues.apache.org/jira/browse/SPARK-31781 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Param k (number of clusters) is used for all the clustering algorithms, so > move it to shared params. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31781) Move param k (number of clusters) to shared params
[ https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31781: Assignee: Apache Spark > Move param k (number of clusters) to shared params > -- > > Key: SPARK-31781 > URL: https://issues.apache.org/jira/browse/SPARK-31781 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > > Param k (number of clusters) is used for all the clustering algorithms, so > move it to shared params. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31781) Move param k (number of clusters) to shared params
[ https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112715#comment-17112715 ] Apache Spark commented on SPARK-31781: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28595 > Move param k (number of clusters) to shared params > -- > > Key: SPARK-31781 > URL: https://issues.apache.org/jira/browse/SPARK-31781 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Param k (number of clusters) is used for all the clustering algorithms, so > move it to shared params. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31781) Move param k (number of clusters) to shared params
[ https://issues.apache.org/jira/browse/SPARK-31781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31781: Assignee: (was: Apache Spark) > Move param k (number of clusters) to shared params > -- > > Key: SPARK-31781 > URL: https://issues.apache.org/jira/browse/SPARK-31781 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Minor > > Param k (number of clusters) is used for all the clustering algorithms, so > move it to shared params. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31781) Move param k (number of clusters) to shared params
Huaxin Gao created SPARK-31781: -- Summary: Move param k (number of clusters) to shared params Key: SPARK-31781 URL: https://issues.apache.org/jira/browse/SPARK-31781 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 3.1.0 Reporter: Huaxin Gao Param k (number of clusters) is used for all the clustering algorithms, so move it to shared params. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31780) Add R test tag to exclude R K8s image building and test
[ https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31780: - Assignee: Dongjoon Hyun > Add R test tag to exclude R K8s image building and test > --- > > Key: SPARK-31780 > URL: https://issues.apache.org/jira/browse/SPARK-31780 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31780) Add R test tag to exclude R K8s image building and test
[ https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31780. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28594 [https://github.com/apache/spark/pull/28594] > Add R test tag to exclude R K8s image building and test > --- > > Key: SPARK-31780 > URL: https://issues.apache.org/jira/browse/SPARK-31780 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31780) Add R test tag to exclude R K8s image building and test
[ https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31780: -- Priority: Minor (was: Major) > Add R test tag to exclude R K8s image building and test > --- > > Key: SPARK-31780 > URL: https://issues.apache.org/jira/browse/SPARK-31780 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31780) Add R test tag to exclude R K8s image building and test
[ https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31780: -- Affects Version/s: (was: 3.1.0) 3.0.0 > Add R test tag to exclude R K8s image building and test > --- > > Key: SPARK-31780 > URL: https://issues.apache.org/jira/browse/SPARK-31780 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31780) Add R test tag to exclude R K8s image building and test
[ https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31780: Assignee: Apache Spark > Add R test tag to exclude R K8s image building and test > --- > > Key: SPARK-31780 > URL: https://issues.apache.org/jira/browse/SPARK-31780 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31780) Add R test tag to exclude R K8s image building and test
[ https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112662#comment-17112662 ] Apache Spark commented on SPARK-31780: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/28594 > Add R test tag to exclude R K8s image building and test > --- > > Key: SPARK-31780 > URL: https://issues.apache.org/jira/browse/SPARK-31780 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31780) Add R test tag to exclude R K8s image building and test
[ https://issues.apache.org/jira/browse/SPARK-31780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31780: Assignee: (was: Apache Spark) > Add R test tag to exclude R K8s image building and test > --- > > Key: SPARK-31780 > URL: https://issues.apache.org/jira/browse/SPARK-31780 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31780) Add R test tag to exclude R K8s image building and test
Dongjoon Hyun created SPARK-31780: - Summary: Add R test tag to exclude R K8s image building and test Key: SPARK-31780 URL: https://issues.apache.org/jira/browse/SPARK-31780 Project: Spark Issue Type: Improvement Components: Kubernetes, Tests Affects Versions: 3.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31779) Redefining struct inside array incorrectly wraps child fields in array
Jeff Evans created SPARK-31779: -- Summary: Redefining struct inside array incorrectly wraps child fields in array Key: SPARK-31779 URL: https://issues.apache.org/jira/browse/SPARK-31779 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5 Reporter: Jeff Evans It seems that redefining a {{struct}} for the purpose of removing a sub-field, when that {{struct}} is itself inside an {{array}}, results in the remaining (non-removed) {{struct}} fields themselves being incorrectly wrapped in an array. For more context, see [this|https://stackoverflow.com/a/46084983/375670] StackOverflow answer and discussion thread. I have debugged this code and distilled it down to what I believe represents a bug in Spark itself. Consider the following {{spark-shell}} session (version 2.4.5): {code} // use a nested JSON structure that contains a struct inside an array val jsonData = """{ "foo": "bar", "top": { "child1": 5, "child2": [ { "child2First": "one", "child2Second": 2 } ] } }""" // read into a DataFrame val df = spark.read.option("multiline", "true").json(Seq(jsonData).toDS()) // create a new definition for "top", which will remove the "top.child2.child2First" column val newTop = struct(df("top").getField("child1").alias("child1"), array(struct(df("top").getField("child2").getField("child2Second").alias("child2Second"))).alias("child2")) // show the schema before and after swapping out the struct definition df.schema.toDDL // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: ARRAY>> df.withColumn("top", newTop).schema.toDDL // `foo` STRING,`top` STRUCT<`child1`: BIGINT, `child2`: ARRAY>>> {code} Notice in this case that the new definition for {{top.child2.child2Second}} is an {{ARRAY}}. This is incorrect; it should simply be {{BIGINT}}. There is nothing in the definition of the {{newTop}} {{struct}} that should have caused the type to become wrapped in an array like this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31778) Support cross-building docker images
Holden Karau created SPARK-31778: Summary: Support cross-building docker images Key: SPARK-31778 URL: https://issues.apache.org/jira/browse/SPARK-31778 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0, 3.0.1, 3.1.0 Reporter: Holden Karau We have a CI for Spark on ARM, we should support building images for ARM and AMD64 now that dokcer buildx can cross-compile -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31777) CrossValidator supports user-supplied folds
Xiangrui Meng created SPARK-31777: - Summary: CrossValidator supports user-supplied folds Key: SPARK-31777 URL: https://issues.apache.org/jira/browse/SPARK-31777 Project: Spark Issue Type: New Feature Components: ML Affects Versions: 3.1.0 Reporter: Xiangrui Meng As a user, I can specify how CrossValidator should create folds by specifying a foldCol, which should be integer type with range [0, numFolds). If foldCol is specified, Spark won't do random k-fold split. This is useful if there are custom logics to create folds, e.g., random split by users instead of random splits of events. This is similar to SPARK-16206, which is for the RDD-based APIs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31776) Literal lit() supports lists and numpy arrays
[ https://issues.apache.org/jira/browse/SPARK-31776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-31776: -- Issue Type: Improvement (was: New Feature) > Literal lit() supports lists and numpy arrays > - > > Key: SPARK-31776 > URL: https://issues.apache.org/jira/browse/SPARK-31776 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiangrui Meng >Priority: Major > > In ML workload, it is common to replace null feature vectors with some > default value. However, lit() does not support Python list and numpy arrays > at input. Users cannot simply use fillna() to get the job done. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31776) Literal lit() supports lists and numpy arrays
Xiangrui Meng created SPARK-31776: - Summary: Literal lit() supports lists and numpy arrays Key: SPARK-31776 URL: https://issues.apache.org/jira/browse/SPARK-31776 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Xiangrui Meng In ML workload, it is common to replace null feature vectors with some default value. However, lit() does not support Python list and numpy arrays at input. Users cannot simply use fillna() to get the job done. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31775) Support tensor type (TensorType) in Spark SQL/DataFrame
Xiangrui Meng created SPARK-31775: - Summary: Support tensor type (TensorType) in Spark SQL/DataFrame Key: SPARK-31775 URL: https://issues.apache.org/jira/browse/SPARK-31775 Project: Spark Issue Type: New Feature Components: ML, SQL Affects Versions: 3.1.0 Reporter: Xiangrui Meng More and more DS/ML workloads are dealing with tensors. For example, a decoded color image can be represented by a 3D tensor. It would be nice to natively support tensor type. A local tensor is essentially an array with shape info, stored together. Native support is better than user-defined type (UDT) because PyArrow does not support UDTs but it already supports tensors. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id
[ https://issues.apache.org/jira/browse/SPARK-31387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31387: - Assignee: Ali Smesseim > HiveThriftServer2Listener update methods fail with unknown operation/session > id > --- > > Key: SPARK-31387 > URL: https://issues.apache.org/jira/browse/SPARK-31387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Ali Smesseim >Assignee: Ali Smesseim >Priority: Major > > HiveThriftServer2Listener update methods, such as onSessionClosed and > onOperationError throw a NullPointerException (in Spark 3) or a > NoSuchElementException (in Spark 2) when the input session/operation id is > unknown. In Spark 2, this can cause control flow issues with the caller of > the listener. In Spark 3, the listener is called by a ListenerBus which > catches the exception, but it would still be nicer if an invalid update is > logged and does not throw an exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31387) HiveThriftServer2Listener update methods fail with unknown operation/session id
[ https://issues.apache.org/jira/browse/SPARK-31387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31387. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28544 [https://github.com/apache/spark/pull/28544] > HiveThriftServer2Listener update methods fail with unknown operation/session > id > --- > > Key: SPARK-31387 > URL: https://issues.apache.org/jira/browse/SPARK-31387 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Ali Smesseim >Assignee: Ali Smesseim >Priority: Major > Fix For: 3.0.0 > > > HiveThriftServer2Listener update methods, such as onSessionClosed and > onOperationError throw a NullPointerException (in Spark 3) or a > NoSuchElementException (in Spark 2) when the input session/operation id is > unknown. In Spark 2, this can cause control flow issues with the caller of > the listener. In Spark 3, the listener is called by a ListenerBus which > catches the exception, but it would still be nicer if an invalid update is > logged and does not throw an exception. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31773) getting the Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, at org.apache.spark.sql.catalyst.errors.package$.attachTree(pac
Pankaj Tiwari created SPARK-31773: - Summary: getting the Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) Key: SPARK-31773 URL: https://issues.apache.org/jira/browse/SPARK-31773 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Environment: spark 2.2 Reporter: Pankaj Tiwari Actually I am loading the excel which has some 90 columns and the some columns name contains special character as well like @ % -> . etc etc so while I am doing one use case like : sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq))); this is working fine but as soon as I am running sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq)).count() it is failing with error like : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#26596L]) +- *HashAggregate(keys=columns name Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree:column namet#14050 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.Stream.foreach(Stream.scala:595) at scala.collection.TraversableOnce$class.count(TraversableOnce.scala:115) at scala.collection.AbstractTraversable.count(Traversable.scala:104) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:312) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:702) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:156) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36) Caused by: java.lang.RuntimeException: Couldn't find here one name of column following with at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31774) getting the Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, at org.apache.spark.sql.catalyst.errors.package$.attachTree(pac
Pankaj Tiwari created SPARK-31774: - Summary: getting the Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) Key: SPARK-31774 URL: https://issues.apache.org/jira/browse/SPARK-31774 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0 Environment: spark 2.2 Reporter: Pankaj Tiwari Actually I am loading the excel which has some 90 columns and the some columns name contains special character as well like @ % -> . etc etc so while I am doing one use case like : sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq))); this is working fine but as soon as I am running sourceDataSet.select(columnSeq).except(targetDataset.select(columnSeq)).count() it is failing with error like : org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree: Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#26596L]) +- *HashAggregate(keys=columns name Caused by: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree:column namet#14050 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703) at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$40.apply(HashAggregateExec.scala:703) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:418) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1233) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1223) at scala.collection.immutable.Stream.foreach(Stream.scala:595) at scala.collection.TraversableOnce$class.count(TraversableOnce.scala:115) at scala.collection.AbstractTraversable.count(Traversable.scala:104) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateUnsafeProjection$.createCode(GenerateUnsafeProjection.scala:312) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsumeWithKeys(HashAggregateExec.scala:702) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.doConsume(HashAggregateExec.scala:156) at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155) at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36) Caused by: java.lang.RuntimeException: Couldn't find here one name of column following with at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31772) Json schema reading is not consistent between int and string types
yaniv oren created SPARK-31772: -- Summary: Json schema reading is not consistent between int and string types Key: SPARK-31772 URL: https://issues.apache.org/jira/browse/SPARK-31772 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: yaniv oren When reading json file using a schema, int value is converted to string if field is string but string field is not converted to int value if field is int. Sample Code: read_schema = StructType([StructField({color:#008080}"a"{color}, IntegerType()), StructField({color:#008080}"b"{color}, StringType())]) df = {color:#94558d}self{color}.spark_session.read.schema(read_schema).json({color:#008080}"input/json/temp_test"{color}) df.show() json temp_test {"a": 1,"b": "b1"} {"a": 2,"b": "b2"} {"a": 3,"b": 3} {"a": "4","b": 4} actual: | a| b| +++ | 1| b1| | 2| b2| | 3| 3| |null|null| +++ expected: Third line will be nulled as the fourth line as b is int while in schema it's string. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112368#comment-17112368 ] Sandeep Katta edited comment on SPARK-31761 at 5/20/20, 3:46 PM: - [~sowen] [~hyukjin.kwon] [~dongjoon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4|#4] +- Relation[col0#0,col1#1|#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *Spark-3.0 Plan* == Parsed Logical Plan == 'Project [('col0 div 'col1) AS (col0 div col1)#4|#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Analyzed Logical Plan == (col0 div col1): int Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Optimized Logical Plan == Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Physical Plan == *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- BatchScan[col0#0, col1#1|#0, col1#1] CSVScan Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], ReadSchema: struct In Spark3 do I need to cast the columns as in spark-2.4, or user should manually add cast to their query as per below example val schema = "col0 int,col1 int"; val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv"); val res = df.selectExpr("col0 div col1") val res = df.selectExpr("Cast(col0 as Decimal) div col1 ") res.collect please let us know your opinion was (Author: sandeep.katta2007): [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4|#4] +- Relation[col0#0,col1#1|#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project
[jira] [Comment Edited] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112368#comment-17112368 ] Sandeep Katta edited comment on SPARK-31761 at 5/20/20, 3:44 PM: - [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4|#4] +- Relation[col0#0,col1#1|#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1|#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L|#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1|#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *Spark-3.0 Plan* == Parsed Logical Plan == 'Project [('col0 div 'col1) AS (col0 div col1)#4|#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Analyzed Logical Plan == (col0 div col1): int Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Optimized Logical Plan == Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1|#0, col1#1] csv [file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv] == Physical Plan == *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4|#0 div col1#1) AS (col0 div col1)#4] +- BatchScan[col0#0, col1#1|#0, col1#1] CSVScan Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv|file:///opt/fordebug/divTest.csv], ReadSchema: struct In Spark3 do I need to cast the columns as in spark-2.4, or user should manually add cast to their query as per below example val schema = "col0 int,col1 int"; val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv"); val res = df.selectExpr("col0 div col1") val res = df.selectExpr("Cast(col0 as Decimal) div col1 ") res.collect please let us know your opinion was (Author: sandeep.katta2007): [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4] +- Relation[col0#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct Spark-3.0 == Parsed Logical Plan == 'Project [('col0 div 'col1) AS (col0
[jira] [Commented] (SPARK-31761) Sql Div operator can result in incorrect output for int_min
[ https://issues.apache.org/jira/browse/SPARK-31761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112368#comment-17112368 ] Sandeep Katta commented on SPARK-31761: --- [~sowen] [~hyukjin.kwon] I have executed the same query in spark-2.4.4 it works as per expectation. As you can see from 2.4.4 plan, columns are casted to double, so there won't be *Integer overflow.* == Parsed Logical Plan == 'Project [cast(('col0 / 'col1) as bigint) AS CAST((col0 / col1) AS BIGINT)#4] +- Relation[col0#0,col1#1] csv == Analyzed Logical Plan == CAST((col0 / col1) AS BIGINT): bigint Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Optimized Logical Plan == Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- Relation[col0#0,col1#1] csv == Physical Plan == *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct *(1) Project [cast((cast(col0#0 as double) / cast(col1#1 as double)) as bigint) AS CAST((col0 / col1) AS BIGINT)#4L] +- *(1) FileScan csv [col0#0,col1#1] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], PartitionFilters: [], PushedFilters: [], ReadSchema: struct Spark-3.0 == Parsed Logical Plan == 'Project [('col0 div 'col1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv == Analyzed Logical Plan == (col0 div col1): int Project [(col0#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv == Optimized Logical Plan == Project [(col0#0 div col1#1) AS (col0 div col1)#4] +- RelationV2[col0#0, col1#1] csv file:/opt/fordebug/divTest.csv == Physical Plan == *(1) Project [(col0#0 div col1#1) AS (col0 div col1)#4] +- BatchScan[col0#0, col1#1] CSVScan Location: InMemoryFileIndex[file:/opt/fordebug/divTest.csv], ReadSchema: struct In Spark3 do I need to cast the columns as in spark-2.4, or user should manually add cast to their query as per below example val schema = "col0 int,col1 int"; val df = spark.read.schema(schema).csv("file:/opt/fordebug/divTest.csv"); val res = df.selectExpr("col0 div col1") val res = df.selectExpr("Cast(col0 as Decimal) div col1 ") res.collect please let us know your opinion > Sql Div operator can result in incorrect output for int_min > --- > > Key: SPARK-31761 > URL: https://issues.apache.org/jira/browse/SPARK-31761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > > Input in csv : -2147483648,-1 --> (_c0, _c1) > {code} > val res = df.selectExpr("_c0 div _c1") > res.collect > res1: Array[org.apache.spark.sql.Row] = Array([-2147483648]) > {code} > The result should be 2147483648 instead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112359#comment-17112359 ] Puviarasu edited comment on SPARK-31754 at 5/20/20, 3:36 PM: - Hello [~kabhwan] , Please find below the comments in *bold*. # Is it always reproducible under the same input & checkpoint (start from initial checkpoint or specific checkpoint)? *Same Checkpoint: With the same checkpoint and input the application is failing with the same exception in the same offset.* *New Checkpoint: Also we tested clearing the checkpoint with the same input. In this case[after clearing the checkpoint] exception dint happen for that particular input.* # Could you share the query plan (logical/physical)? Query plan from previous batch would be OK. *Sure. Please find the attachment [^Logical-Plan.txt]* # Could you try it out with recent version like 2.4.5 or 3.0.0-preview2 so that we can avoid investigating issue which might be already resolved? *For this we might need some more time as we need some changes to be done in our cluster settings. Kindly bear us with the delay.* Thank you. was (Author: puviarasu): Hello [~kabhwan] , Please find below the comments in *bold*. # Is it always reproducible under the same input & checkpoint (start from initial checkpoint or specific checkpoint)? *With the same checkpoint and input the application is failing with the same exception in the same offset. Also we tested clearing the checkpoint with the same input. In this case exception dint happen for that particular input.* # Could you share the query plan (logical/physical)? Query plan from previous batch would be OK. *Sure. Please find the attachment [^Logical-Plan.txt]* # Could you try it out with recent version like 2.4.5 or 3.0.0-preview2 so that we can avoid investigating issue which might be already resolved? *For this we might need some more time as we need some changes to be done in our cluster settings. Kindly bear us with the delay.* Thank you. > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > Attachments: CodeGen.txt, Logical-Plan.txt > > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost
[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112359#comment-17112359 ] Puviarasu commented on SPARK-31754: --- Hello [~kabhwan] , Please find below the comments in *bold*. # Is it always reproducible under the same input & checkpoint (start from initial checkpoint or specific checkpoint)? *With the same checkpoint and input the application is failing with the same exception in the same offset. Also we tested clearing the checkpoint with the same input. In this case exception dint happen for that particular input.* # Could you share the query plan (logical/physical)? Query plan from previous batch would be OK. *Sure. Please find the attachment [^Logical-Plan.txt]* # Could you try it out with recent version like 2.4.5 or 3.0.0-preview2 so that we can avoid investigating issue which might be already resolved? *For this we might need some more time as we need some changes to be done in our cluster settings. Kindly bear us with the delay.* Thank you. > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > Attachments: CodeGen.txt, Logical-Plan.txt > > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) > at
[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Puviarasu updated SPARK-31754: -- Attachment: Logical-Plan.txt > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > Attachments: CodeGen.txt, Logical-Plan.txt > > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) > at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583) > at >
[jira] [Updated] (SPARK-31766) Add Spark version prefix to K8s UUID test image tag
[ https://issues.apache.org/jira/browse/SPARK-31766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31766: -- Fix Version/s: (was: 3.1.0) 3.0.0 > Add Spark version prefix to K8s UUID test image tag > --- > > Key: SPARK-31766 > URL: https://issues.apache.org/jira/browse/SPARK-31766 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31766) Add Spark version prefix to K8s UUID test image tag
[ https://issues.apache.org/jira/browse/SPARK-31766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31766: -- Affects Version/s: (was: 3.1.0) 3.0.0 > Add Spark version prefix to K8s UUID test image tag > --- > > Key: SPARK-31766 > URL: https://issues.apache.org/jira/browse/SPARK-31766 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112339#comment-17112339 ] Agrim Bansal commented on SPARK-12312: -- are we not solving this for hive database ? > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.4.2 >Reporter: nabacg >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied
[ https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112245#comment-17112245 ] Thomas Graves commented on SPARK-31437: --- Originally when I thought about this briefly I was thinking of adding something to ResourceProfile similar what you suggest which would be option as to whether to always create new executors using the ExecutorResourceRequests specified in the ResourceProfile or try to reuse executors that exist that fit. I think what you are proposing is along those lines. Now there are a bunch of edge cases here, like what if there aren't any exist that it would fit in. That is a good reason to keep the ExecutorResourceRequirements in the Profile and make the user specify them even if it it might fit another one. Another option is I was wanting to give resource profiles names. You could potentially use that or perhaps gives the ExecutorResourceRequirements names as well and let user specify the names as optional to use if they already exist. >> We don't have to change any existing behaviour for cases like this. Just >> boot up new executors. Note it tooks like I had a typo the 4 cpus should be 4 gpus. I guess that is ok if you are using an exactly matches option for the ExecutorResourceRequests. Otherwise if its just a fits option then it would be harder to control. Let says you have 2 active profiles, one with 8 cores and 4 gpus, one with 8 cores and then you have a third where you want to start something that uses 2 cores. You tell the third to use existing executors but that means it could use the ones with 8 cores and 4 gpus and it would waste the gpus. If you tell it to not reuse executors then it would boot up more and not do what you really want. This might be where something like the names could come in as include/exclude lists, but then that gets more complicated for user as well. So overall if we just do the exactly matches for the ExecutorResourceRequests that could be fairly straightforward. The tracking in the allocation manager could just change from per resource profile to the per ExecutorResourceRequests and I think similarly in the scheduler it could do something like that. I would just want to make sure whatever end user api we come up with would be extensible to the other cases where it will fit. > Try assigning tasks to existing executors by which required resources in > ResourceProfile are satisfied > -- > > Key: SPARK-31437 > URL: https://issues.apache.org/jira/browse/SPARK-31437 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.1.0 >Reporter: Hongze Zhang >Priority: Major > > By the change in [PR|https://github.com/apache/spark/pull/27773] of > SPARK-29154, submitted tasks are scheduled onto executors only if resource > profile IDs strictly match. As a result Spark always starts new executors for > customized ResourceProfiles. > This limitation makes working with process-local jobs unfriendly. E.g. Task > cores has been increased from 1 to 4 in a new stage, and executor has 8 > slots, it is expected that 2 new tasks can be run on the existing executor > but Spark starts new executors for new ResourceProfile. The behavior is > unnecessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31759) Support configurable max number of rotate logs for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-31759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31759. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28580 [https://github.com/apache/spark/pull/28580] > Support configurable max number of rotate logs for spark daemons > > > Key: SPARK-31759 > URL: https://issues.apache.org/jira/browse/SPARK-31759 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core, Spark Submit >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.1.0 > > > in `spark-daemon.sh`, `spark_rotate_log` accepts $2 as a custom setting for > the number of max rotate log files, but this part of code is actually never > used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31759) Support configurable max number of rotate logs for spark daemons
[ https://issues.apache.org/jira/browse/SPARK-31759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31759: Assignee: Kent Yao > Support configurable max number of rotate logs for spark daemons > > > Key: SPARK-31759 > URL: https://issues.apache.org/jira/browse/SPARK-31759 > Project: Spark > Issue Type: Bug > Components: Deploy, Spark Core, Spark Submit >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > > in `spark-daemon.sh`, `spark_rotate_log` accepts $2 as a custom setting for > the number of max rotate log files, but this part of code is actually never > used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17112021#comment-17112021 ] Apache Spark commented on SPARK-31710: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28593 > result is the not the same when query and execute jobs > -- > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Priority: Major > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
[ https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31771: Assignee: Apache Spark > Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' > - > > Key: SPARK-31771 > URL: https://issues.apache.org/jira/browse/SPARK-31771 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text > Style in java.time.DateTimeFormatterBuilder which output the leading single > letter of the value, e.g. `December` would be `D`, while in Spark 2.4 they > means Full-Text Style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
[ https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111985#comment-17111985 ] Apache Spark commented on SPARK-31771: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28592 > Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' > - > > Key: SPARK-31771 > URL: https://issues.apache.org/jira/browse/SPARK-31771 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text > Style in java.time.DateTimeFormatterBuilder which output the leading single > letter of the value, e.g. `December` would be `D`, while in Spark 2.4 they > means Full-Text Style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
[ https://issues.apache.org/jira/browse/SPARK-31771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31771: Assignee: (was: Apache Spark) > Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' > - > > Key: SPARK-31771 > URL: https://issues.apache.org/jira/browse/SPARK-31771 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text > Style in java.time.DateTimeFormatterBuilder which output the leading single > letter of the value, e.g. `December` would be `D`, while in Spark 2.4 they > means Full-Text Style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied
[ https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552 ] Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 9:14 AM: (edited) Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile; 2. ResourceProfile still contains both resource requirement of executor and task; 3. ExecutorResourceProfile only includes executor's resource req; 4. ExecutorResourceProfile is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResourceProfile; 6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorResourceProfile; s2. If executor resource requirement in ResourceProfile meets existing ExecutorResourceProfile ("meets" may mean exactly matches/equals), use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? We don't have to change any existing behaviour for cases like this. Just boot up new executors. For now the major problem is that even ResourceProfile.executorResources is not changed in a new ResourceProfile (e.g. task.cpu changed from 1 to 2), we still have to shut down the old executors then start new ones, right? This is the thing to be optimized. was (Author: zhztheplayer): Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile; 2. ResourceProfile still contains both resource requirement of executor and task; 3. ExecutorResourceProfile only includes executor's resource req; 4. ExecutorResourceProfile is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResourceProfile; 6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorResourceProfile; s2. If executor resource requirement in ResourceProfile meets existing ExecutorResourceProfile, use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? By just using strategy s1, everything should work as current implementation. > Try assigning tasks to existing executors by which required resources in > ResourceProfile are satisfied > -- > > Key: SPARK-31437 > URL: https://issues.apache.org/jira/browse/SPARK-31437 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.1.0 >Reporter: Hongze Zhang >Priority: Major > > By the change in [PR|https://github.com/apache/spark/pull/27773] of > SPARK-29154, submitted tasks are scheduled onto executors only if resource > profile IDs strictly match. As a result Spark always starts new executors for > customized ResourceProfiles. > This limitation makes working with process-local jobs unfriendly. E.g. Task > cores has been increased from 1 to 4 in a new stage, and executor has 8 > slots, it is expected that 2 new tasks can be run on the existing executor > but Spark starts new executors for new ResourceProfile. The behavior is > unnecessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31771) Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
Kent Yao created SPARK-31771: Summary: Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q' Key: SPARK-31771 URL: https://issues.apache.org/jira/browse/SPARK-31771 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style in java.time.DateTimeFormatterBuilder which output the leading single letter of the value, e.g. `December` would be `D`, while in Spark 2.4 they means Full-Text Style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31770) spark double count issue when table is recreated from hive
Nithin created SPARK-31770: -- Summary: spark double count issue when table is recreated from hive Key: SPARK-31770 URL: https://issues.apache.org/jira/browse/SPARK-31770 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.3.2 Reporter: Nithin Spark sets serde property named path when a table is created. Along with it , it will add a property called sql.source.provider that ensures the consistency. But if we recreate the same table using hive , then the hive will not respect the serde properties set by spark named sql.source.provider. This will lead to conflict if we read the newly created table using spark . It can be argued that the issue is with hive because it do not properly ensures that the table properties are being copied. But these are the properties set by spark on a hive table. and spark should be the one adhering to the standards of hive. From a corporate standpoint , we cannot tell users to avoid creating or altering tables on hive because spark will end up giving wrong results. *Worst of all , even if it failed then it was okay. but it simply gives double count without even a failure message and that is very dangerous for critical applications*. Either end up giving error message if the serde property does not have path as well as sql.source.provider (or) do not even try to create the property path in first place. Few related issues : # SPARK-31751 # SPARK-28266 was resolved as non-reproducible which is not entirely true *steps to reproduce :* {code:java} df = spark.createDataFrame([{"a": "x", "b": "y", "c": "3"}]) df.createOrReplaceTempView("test1") spark.sql("CREATE TABLE test1_using_orc USING ORC AS (SELECT * from test1)") {code} Run the below from hive ( preferably tez ) {code:java} create table test2_like_test1 like test1_using_orc; insert into test2_like_test1 select * from test1_using_orc;{code} ---from spark again : Now test2_like_test1 will have serde property called path , but will not have sql.source.provider. {code:java} >>> spark.sql("set spark.sql.hive.convertMetastoreOrc=false").show() ++-+ | key|value| ++-+ |spark.sql.hive.co...|false| ++-+ >>> spark.sql("select count(*) from test2_like_test1").show() ++ |count(1)| ++ | 1| ++ >>> spark.sql("set spark.sql.hive.convertMetastoreOrc=true").show() ++-+ | key|value| ++-+ |spark.sql.hive.co...| true| ++-+ >>> spark.sql("select count(*) from test2_like_test1").show() ++ |count(1)| ++ | 2| ++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31769) Add support of MDC in spark driver logs.
[ https://issues.apache.org/jira/browse/SPARK-31769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111955#comment-17111955 ] Izek Greenfield commented on SPARK-31769: - Hi [~cloud_fan] Following previous discussions on [PR-26624|https://github.com/apache/spark/pull/26624], I'd like us to agree on the necessity of this feature, after which I'll create a dedicated PR. What are your thoughts? > Add support of MDC in spark driver logs. > > > Key: SPARK-31769 > URL: https://issues.apache.org/jira/browse/SPARK-31769 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Izek Greenfield >Priority: Major > > In order to align driver logs with spark executors logs, add support to log > MDC state for properties that are shared between the driver the server > executors (e.g. application name. task id) > The use case is applicable in 2 cases: > # streaming > # running spark under job server (in this case you have a long-running spark > context that handles many tasks from different sources). > See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that > handles the MDC in executor context. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31769) Add support of MDC in spark driver logs.
Izek Greenfield created SPARK-31769: --- Summary: Add support of MDC in spark driver logs. Key: SPARK-31769 URL: https://issues.apache.org/jira/browse/SPARK-31769 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Izek Greenfield In order to align driver logs with spark executors logs, add support to log MDC state for properties that are shared between the driver the server executors (e.g. application name. task id) The use case is applicable in 2 cases: # streaming # running spark under job server (in this case you have a long-running spark context that handles many tasks from different sources). See also [SPARK-8981|https://issues.apache.org/jira/browse/SPARK-8981] that handles the MDC in executor context. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied
[ https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552 ] Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 8:43 AM: Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile; 2. ResourceProfile still contains both resource requirement of executor and task; 3. ExecutorResourceProfile only includes executor's resource req; 4. ExecutorResourceProfile is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResourceProfile; 6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorSpec; s2. If executor resource requirement in ResourceProfile meets existing ExecutorResourceProfile, use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? By just using strategy s1, everything should work as current implementation. was (Author: zhztheplayer): Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile; 2. ResourceProfile still contains both resource requirement of executor and task; 3. ExecutorResourceProfile only includes executor's resource req; 4. ExecutorResourceProfile is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResourceProfile; 6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorSpec; s2. If executor resource requirement in ResourceProfile meets existing ExecutorSpec, use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? By just using strategy s1, everything should work as current implementation. > Try assigning tasks to existing executors by which required resources in > ResourceProfile are satisfied > -- > > Key: SPARK-31437 > URL: https://issues.apache.org/jira/browse/SPARK-31437 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.1.0 >Reporter: Hongze Zhang >Priority: Major > > By the change in [PR|https://github.com/apache/spark/pull/27773] of > SPARK-29154, submitted tasks are scheduled onto executors only if resource > profile IDs strictly match. As a result Spark always starts new executors for > customized ResourceProfiles. > This limitation makes working with process-local jobs unfriendly. E.g. Task > cores has been increased from 1 to 4 in a new stage, and executor has 8 > slots, it is expected that 2 new tasks can be run on the existing executor > but Spark starts new executors for new ResourceProfile. The behavior is > unnecessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied
[ https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552 ] Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 8:43 AM: Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile; 2. ResourceProfile still contains both resource requirement of executor and task; 3. ExecutorResourceProfile only includes executor's resource req; 4. ExecutorResourceProfile is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResourceProfile; 6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorSpec; s2. If executor resource requirement in ResourceProfile meets existing ExecutorSpec, use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? By just using strategy s1, everything should work as current implementation. was (Author: zhztheplayer): Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorSpec and ResourceProfile; 2. ResourceProfile's structure remains as is; 3. ExecutorSpec only includes resource requirements for executor, task resource requirement is removed; 4. ExecutorSpec is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResource; 6. Each time ResourceProfile comes, ExecutorSpec is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorSpec; s2. If executor resource requirement in ResourceProfile meets existing ExecutorSpec, use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? By just using strategy s1, everything should work as current implementation. > Try assigning tasks to existing executors by which required resources in > ResourceProfile are satisfied > -- > > Key: SPARK-31437 > URL: https://issues.apache.org/jira/browse/SPARK-31437 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.1.0 >Reporter: Hongze Zhang >Priority: Major > > By the change in [PR|https://github.com/apache/spark/pull/27773] of > SPARK-29154, submitted tasks are scheduled onto executors only if resource > profile IDs strictly match. As a result Spark always starts new executors for > customized ResourceProfiles. > This limitation makes working with process-local jobs unfriendly. E.g. Task > cores has been increased from 1 to 4 in a new stage, and executor has 8 > slots, it is expected that 2 new tasks can be run on the existing executor > but Spark starts new executors for new ResourceProfile. The behavior is > unnecessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31437) Try assigning tasks to existing executors by which required resources in ResourceProfile are satisfied
[ https://issues.apache.org/jira/browse/SPARK-31437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084552#comment-17084552 ] Hongze Zhang edited comment on SPARK-31437 at 5/20/20, 8:43 AM: Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile; 2. ResourceProfile still contains both resource requirement of executor and task; 3. ExecutorResourceProfile only includes executor's resource req; 4. ExecutorResourceProfile is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResourceProfile; 6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorResourceProfile; s2. If executor resource requirement in ResourceProfile meets existing ExecutorResourceProfile, use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? By just using strategy s1, everything should work as current implementation. was (Author: zhztheplayer): Thanks [~tgraves]. I got your point of making them tied. Actually I was thinking of something like this: 1. to break ResourceProfile up to ExecutorResourceProfile and ResourceProfile; 2. ResourceProfile still contains both resource requirement of executor and task; 3. ExecutorResourceProfile only includes executor's resource req; 4. ExecutorResourceProfile is required to allocate new executor instances from scheduler backend; 5. Similar to current solution, user specifies ResourceProfile for RDD, then tasks are scheduled onto executors that are allocated using ExecutorResourceProfile; 6. Each time ResourceProfile comes, ExecutorResourceProfile is created/selected within one of several strategies; Strategies types: s1. Always creates new ExecutorSpec; s2. If executor resource requirement in ResourceProfile meets existing ExecutorResourceProfile, use the existing one; s3. ... bq. My etl tasks uses 8 cores, my ml tasks use 8 cores and 4 cpus. How do I keep my etl tasks from running on the ML executors without wasting resources? By just using strategy s1, everything should work as current implementation. > Try assigning tasks to existing executors by which required resources in > ResourceProfile are satisfied > -- > > Key: SPARK-31437 > URL: https://issues.apache.org/jira/browse/SPARK-31437 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core >Affects Versions: 3.1.0 >Reporter: Hongze Zhang >Priority: Major > > By the change in [PR|https://github.com/apache/spark/pull/27773] of > SPARK-29154, submitted tasks are scheduled onto executors only if resource > profile IDs strictly match. As a result Spark always starts new executors for > customized ResourceProfiles. > This limitation makes working with process-local jobs unfriendly. E.g. Task > cores has been increased from 1 to 4 in a new stage, and executor has 8 > slots, it is expected that 2 new tasks can be run on the existing executor > but Spark starts new executors for new ResourceProfile. The behavior is > unnecessary. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-8981: -- Assignee: Izek Greenfield > Set applicationId and appName in log4j MDC > -- > > Key: SPARK-8981 > URL: https://issues.apache.org/jira/browse/SPARK-8981 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Paweł Kopiczko >Assignee: Izek Greenfield >Priority: Minor > Fix For: 3.1.0 > > > It would be nice to have, because it's good to have logs in one file when > using log agents (like logentires) in standalone mode. Also allows > configuring rolling file appender without a mess when multiple applications > are running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8981) Set applicationId and appName in log4j MDC
[ https://issues.apache.org/jira/browse/SPARK-8981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-8981. Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 26624 [https://github.com/apache/spark/pull/26624] > Set applicationId and appName in log4j MDC > -- > > Key: SPARK-8981 > URL: https://issues.apache.org/jira/browse/SPARK-8981 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Paweł Kopiczko >Priority: Minor > Fix For: 3.1.0 > > > It would be nice to have, because it's good to have logs in one file when > using log agents (like logentires) in standalone mode. Also allows > configuring rolling file appender without a mess when multiple applications > are running. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys
[ https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111866#comment-17111866 ] Takeshi Yamamuro commented on SPARK-21784: -- Not yet. IIUC there is no active work now for supporting constraints in Spark. > Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign > keys > -- > > Key: SPARK-21784 > URL: https://issues.apache.org/jira/browse/SPARK-21784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Suresh Thalamati >Priority: Major > > Currently Spark SQL does not have DDL support to define primary key , and > foreign key constraints. This Jira is to add DDL support to define primary > key and foreign key informational constraint using ALTER TABLE syntax. These > constraints will be used in query optimization and you can find more details > about this in the spec in SPARK-19842 > *Syntax :* > {code} > ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName] > (PRIMARY KEY (col_names) | > FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)]) > [VALIDATE | NOVALIDATE] [RELY | NORELY] > {code} > Examples : > {code:sql} > ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY > ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES > employee(empno) NOVALIDATE NORELY > {code} > *Constraint name generated by the system:* > {code:sql} > ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY > ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) > VALIDATE RELY; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling
[ https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111814#comment-17111814 ] Apache Spark commented on SPARK-30689: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28591 > Allow custom resource scheduling to work with YARN versions that don't > support custom resource scheduling > - > > Key: SPARK-30689 > URL: https://issues.apache.org/jira/browse/SPARK-30689 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Many people/companies will not be moving to Hadoop 3.1 or greater, where it > supports custom resource scheduling for things like GPUs soon and have > requested support for it in older hadoop 2.x versions. This also means that > they may not have isolation enabled which is what the default behavior relies > on. > right now the option is to write a custom discovery script to handle on their > own. This is ok but has some limitation because the script runs as a separate > process. It also just a shell script. > I think we can make this a lot more flexible by making the entire resource > discovery class pluggable. The default one would stay as is and call the > discovery script, but if an advanced user wanted to replace the entire thing > they could implement a pluggable class which they could write custom code on > how to discovery resource addresses. > This will also help users if they are running hadoop 3.1.x or greater but > don't have the resources configured or aren't running in an isolated > environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30689) Allow custom resource scheduling to work with YARN versions that don't support custom resource scheduling
[ https://issues.apache.org/jira/browse/SPARK-30689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111813#comment-17111813 ] Apache Spark commented on SPARK-30689: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28591 > Allow custom resource scheduling to work with YARN versions that don't > support custom resource scheduling > - > > Key: SPARK-30689 > URL: https://issues.apache.org/jira/browse/SPARK-30689 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.0.0 > > > Many people/companies will not be moving to Hadoop 3.1 or greater, where it > supports custom resource scheduling for things like GPUs soon and have > requested support for it in older hadoop 2.x versions. This also means that > they may not have isolation enabled which is what the default behavior relies > on. > right now the option is to write a custom discovery script to handle on their > own. This is ok but has some limitation because the script runs as a separate > process. It also just a shell script. > I think we can make this a lot more flexible by making the entire resource > discovery class pluggable. The default one would stay as is and call the > discovery script, but if an advanced user wanted to replace the entire thing > they could implement a pluggable class which they could write custom code on > how to discovery resource addresses. > This will also help users if they are running hadoop 3.1.x or greater but > don't have the resources configured or aren't running in an isolated > environment. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue
[ https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111807#comment-17111807 ] Hyukjin Kwon commented on SPARK-31693: -- Awesome!! > Investigate AmpLab Jenkins server network issue > --- > > Key: SPARK-31693 > URL: https://issues.apache.org/jira/browse/SPARK-31693 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Critical > > Given the series of failures in Spark packaging Jenkins job, it seems that > there is a network issue in AmbLab Jenkins cluster. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ > - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay. > - The node failed to download the maven mirror. (SPARK-31691) -> The primary > host is okay. > - The node failed to communicate repository.apache.org. (Current master > branch Jenkins job failure) > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) > on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve > remote metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could > not transfer metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to > apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Transfer > failed for > https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml: > Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] > failed: Connection timed out (Connection timed out) -> [Help 1] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org