[jira] [Created] (SPARK-31697) HistoryServer should set Content-Type header
Kousuke Saruta created SPARK-31697: -- Summary: HistoryServer should set Content-Type header Key: SPARK-31697 URL: https://issues.apache.org/jira/browse/SPARK-31697 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta I noticed that we will get html as plain text when we access to wrong URLs on HistoryServer. {code:java} setUIRoot('') Not Found 3.1.0-SNAPSHOT Not Found Application local-1589239 not found. {code} The reason is Content-Type not set. {code:java} HTTP/1.1 404 Not Found Date: Wed, 13 May 2020 06:59:29 GMT Cache-Control: no-cache, no-store, must-revalidate X-Frame-Options: SAMEORIGIN X-XSS-Protection: 1; mode=block X-Content-Type-Options: nosniff Content-Length: 1778 Server: Jetty(9.4.18.v20190429) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31697) HistoryServer should set Content-Type
[ https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-31697: --- Summary: HistoryServer should set Content-Type (was: HistoryServer should set Content-Type header) > HistoryServer should set Content-Type > - > > Key: SPARK-31697 > URL: https://issues.apache.org/jira/browse/SPARK-31697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > I noticed that we will get html as plain text when we access to wrong URLs on > HistoryServer. > {code:java} > > > type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('') > > href="/static/spark-logo-77x50px-hd.png"> > Not Found > > > > > > > > > 3.1.0-SNAPSHOT > > Not Found > > > > > > Application local-1589239 not found. > > > > > {code} > > The reason is Content-Type not set. > {code:java} > HTTP/1.1 404 Not Found > Date: Wed, 13 May 2020 06:59:29 GMT > Cache-Control: no-cache, no-store, must-revalidate > X-Frame-Options: SAMEORIGIN > X-XSS-Protection: 1; mode=block > X-Content-Type-Options: nosniff > Content-Length: 1778 > Server: Jetty(9.4.18.v20190429) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31697) HistoryServer should set Content-Type
[ https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31697: Assignee: Kousuke Saruta (was: Apache Spark) > HistoryServer should set Content-Type > - > > Key: SPARK-31697 > URL: https://issues.apache.org/jira/browse/SPARK-31697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > I noticed that we will get html as plain text when we access to wrong URLs on > HistoryServer. > {code:java} > > > type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('') > > href="/static/spark-logo-77x50px-hd.png"> > Not Found > > > > > > > > > 3.1.0-SNAPSHOT > > Not Found > > > > > > Application local-1589239 not found. > > > > > {code} > > The reason is Content-Type not set. > {code:java} > HTTP/1.1 404 Not Found > Date: Wed, 13 May 2020 06:59:29 GMT > Cache-Control: no-cache, no-store, must-revalidate > X-Frame-Options: SAMEORIGIN > X-XSS-Protection: 1; mode=block > X-Content-Type-Options: nosniff > Content-Length: 1778 > Server: Jetty(9.4.18.v20190429) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31697) HistoryServer should set Content-Type
[ https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106034#comment-17106034 ] Apache Spark commented on SPARK-31697: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28519 > HistoryServer should set Content-Type > - > > Key: SPARK-31697 > URL: https://issues.apache.org/jira/browse/SPARK-31697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > I noticed that we will get html as plain text when we access to wrong URLs on > HistoryServer. > {code:java} > > > type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('') > > href="/static/spark-logo-77x50px-hd.png"> > Not Found > > > > > > > > > 3.1.0-SNAPSHOT > > Not Found > > > > > > Application local-1589239 not found. > > > > > {code} > > The reason is Content-Type not set. > {code:java} > HTTP/1.1 404 Not Found > Date: Wed, 13 May 2020 06:59:29 GMT > Cache-Control: no-cache, no-store, must-revalidate > X-Frame-Options: SAMEORIGIN > X-XSS-Protection: 1; mode=block > X-Content-Type-Options: nosniff > Content-Length: 1778 > Server: Jetty(9.4.18.v20190429) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31697) HistoryServer should set Content-Type
[ https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31697: Assignee: Apache Spark (was: Kousuke Saruta) > HistoryServer should set Content-Type > - > > Key: SPARK-31697 > URL: https://issues.apache.org/jira/browse/SPARK-31697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > I noticed that we will get html as plain text when we access to wrong URLs on > HistoryServer. > {code:java} > > > type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('') > > href="/static/spark-logo-77x50px-hd.png"> > Not Found > > > > > > > > > 3.1.0-SNAPSHOT > > Not Found > > > > > > Application local-1589239 not found. > > > > > {code} > > The reason is Content-Type not set. > {code:java} > HTTP/1.1 404 Not Found > Date: Wed, 13 May 2020 06:59:29 GMT > Cache-Control: no-cache, no-store, must-revalidate > X-Frame-Options: SAMEORIGIN > X-XSS-Protection: 1; mode=block > X-Content-Type-Options: nosniff > Content-Length: 1778 > Server: Jetty(9.4.18.v20190429) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31697) HistoryServer should set Content-Type
[ https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106035#comment-17106035 ] Apache Spark commented on SPARK-31697: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/28519 > HistoryServer should set Content-Type > - > > Key: SPARK-31697 > URL: https://issues.apache.org/jira/browse/SPARK-31697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > I noticed that we will get html as plain text when we access to wrong URLs on > HistoryServer. > {code:java} > > > type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('') > > href="/static/spark-logo-77x50px-hd.png"> > Not Found > > > > > > > > > 3.1.0-SNAPSHOT > > Not Found > > > > > > Application local-1589239 not found. > > > > > {code} > > The reason is Content-Type not set. > {code:java} > HTTP/1.1 404 Not Found > Date: Wed, 13 May 2020 06:59:29 GMT > Cache-Control: no-cache, no-store, must-revalidate > X-Frame-Options: SAMEORIGIN > X-XSS-Protection: 1; mode=block > X-Content-Type-Options: nosniff > Content-Length: 1778 > Server: Jetty(9.4.18.v20190429) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31698) NPE on big dataset plans
Viacheslav Tradunsky created SPARK-31698: Summary: NPE on big dataset plans Key: SPARK-31698 URL: https://issues.apache.org/jira/browse/SPARK-31698 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.4 Environment: AWS EMR Reporter: Viacheslav Tradunsky We have big dataset containing 275 SQL operations more than 275 joins. On the terminal operation to write data, it fails with NullPointerException. I understand that such big number of operations might not be what spark is designed for, but NullPointerException is not an ideal way to fail in this case. For more details, please see the stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31698) NPE on big dataset plans
[ https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viacheslav Tradunsky updated SPARK-31698: - Attachment: Spark_NPE_big_dataset.log > NPE on big dataset plans > > > Key: SPARK-31698 > URL: https://issues.apache.org/jira/browse/SPARK-31698 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR >Reporter: Viacheslav Tradunsky >Priority: Major > Attachments: Spark_NPE_big_dataset.log > > > We have big dataset containing 275 SQL operations more than 275 joins. > On the terminal operation to write data, it fails with NullPointerException. > > I understand that such big number of operations might not be what spark is > designed for, but NullPointerException is not an ideal way to fail in this > case. > > For more details, please see the stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31698) NPE on big dataset plans
[ https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viacheslav Tradunsky updated SPARK-31698: - Docs Text: (was: org.apache.spark.SparkException: Job aborted. ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at com.company.app.executor.spark.SparkDatasetGenerationJob.generateDataset(SparkDatasetGenerationJob.scala:51) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at com.company.app.executor.spark.SparkDatasetGenerationJob.call(SparkDatasetGenerationJob.scala:82) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at com.company.app.executor.spark.SparkDatasetGenerationJob.call(SparkDatasetGenerationJob.scala:11) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:40) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:27) ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream: at org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:64) ./livy-livy-se
[jira] [Updated] (SPARK-31698) NPE on big dataset plans
[ https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viacheslav Tradunsky updated SPARK-31698: - Environment: AWS EMR: 30 machine, 7TB RAM total. (was: AWS EMR) > NPE on big dataset plans > > > Key: SPARK-31698 > URL: https://issues.apache.org/jira/browse/SPARK-31698 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR: 30 machine, 7TB RAM total. >Reporter: Viacheslav Tradunsky >Priority: Major > Attachments: Spark_NPE_big_dataset.log > > > We have big dataset containing 275 SQL operations more than 275 joins. > On the terminal operation to write data, it fails with NullPointerException. > > I understand that such big number of operations might not be what spark is > designed for, but NullPointerException is not an ideal way to fail in this > case. > > For more details, please see the stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31698) NPE on big dataset plans
[ https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viacheslav Tradunsky updated SPARK-31698: - Environment: AWS EMR: 30 machines, 7TB RAM total. (was: AWS EMR: 30 machine, 7TB RAM total.) > NPE on big dataset plans > > > Key: SPARK-31698 > URL: https://issues.apache.org/jira/browse/SPARK-31698 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR: 30 machines, 7TB RAM total. >Reporter: Viacheslav Tradunsky >Priority: Major > Attachments: Spark_NPE_big_dataset.log > > > We have big dataset containing 275 SQL operations more than 275 joins. > On the terminal operation to write data, it fails with NullPointerException. > > I understand that such big number of operations might not be what spark is > designed for, but NullPointerException is not an ideal way to fail in this > case. > > For more details, please see the stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31695) BigDecimal setScale is not working in Spark UDF
[ https://issues.apache.org/jira/browse/SPARK-31695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31695. -- Resolution: Not A Problem > BigDecimal setScale is not working in Spark UDF > --- > > Key: SPARK-31695 > URL: https://issues.apache.org/jira/browse/SPARK-31695 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.4 >Reporter: Saravanan Raju >Priority: Major > > I was trying to convert json column to map. I tried udf for converting json > to map. but it is not working as expected. > > {code:java} > val df1 = Seq(("{\"k\":10.004}")).toDF("json") > def udfJsonStrToMapDecimal = udf((jsonStr: String)=> { var > jsonMap:Map[String,Any] = parse(jsonStr).values.asInstanceOf[Map[String, Any]] > jsonMap.map{case(k,v) => > (k,BigDecimal.decimal(v.asInstanceOf[Double]).setScale(6))}.toMap > }) > val f = df1.withColumn("map",udfJsonStrToMapDecimal($"json")) > scala> f.printSchema > root > |-- json: string (nullable = true) > |-- map: map (nullable = true) > ||-- key: string > ||-- value: decimal(38,18) (valueContainsNull = true) > {code} > > *instead of decimal(38,6) it converting the value as decimal(38,18)* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31695) BigDecimal setScale is not working in Spark UDF
[ https://issues.apache.org/jira/browse/SPARK-31695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106076#comment-17106076 ] Hyukjin Kwon commented on SPARK-31695: -- You can explicitly set the scale and precision: {code} val df1 = Seq(("{\"k\":10.004}")).toDF("json") def udfJsonStrToMapDecimal = udf((jsonStr: String)=> { var jsonMap:Map[String,Any] = parse(jsonStr).values.asInstanceOf[Map[String, Any]] jsonMap.map{case(k,v) => (k,BigDecimal.decimal(v.asInstanceOf[Double]).setScale(6))}.toMap }, DecimalType(38, 6)) val f = df1.withColumn("map",udfJsonStrToMapDecimal($"json")) f.printSchema {code} It's unable to automatically detect the scale that set during runtime. > BigDecimal setScale is not working in Spark UDF > --- > > Key: SPARK-31695 > URL: https://issues.apache.org/jira/browse/SPARK-31695 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.4 >Reporter: Saravanan Raju >Priority: Major > > I was trying to convert json column to map. I tried udf for converting json > to map. but it is not working as expected. > > {code:java} > val df1 = Seq(("{\"k\":10.004}")).toDF("json") > def udfJsonStrToMapDecimal = udf((jsonStr: String)=> { var > jsonMap:Map[String,Any] = parse(jsonStr).values.asInstanceOf[Map[String, Any]] > jsonMap.map{case(k,v) => > (k,BigDecimal.decimal(v.asInstanceOf[Double]).setScale(6))}.toMap > }) > val f = df1.withColumn("map",udfJsonStrToMapDecimal($"json")) > scala> f.printSchema > root > |-- json: string (nullable = true) > |-- map: map (nullable = true) > ||-- key: string > ||-- value: decimal(38,18) (valueContainsNull = true) > {code} > > *instead of decimal(38,6) it converting the value as decimal(38,18)* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31699) Optimize OpenSession speed in thriftserver
angerszhu created SPARK-31699: - Summary: Optimize OpenSession speed in thriftserver Key: SPARK-31699 URL: https://issues.apache.org/jira/browse/SPARK-31699 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31698) NPE on big dataset plans
[ https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-31698. -- Resolution: Duplicate The weird error message and stack trace is matched with SPARK-29046 which was fixed in Spark 2.4.5. I'll mark this as duplicated. > NPE on big dataset plans > > > Key: SPARK-31698 > URL: https://issues.apache.org/jira/browse/SPARK-31698 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: AWS EMR: 30 machines, 7TB RAM total. >Reporter: Viacheslav Tradunsky >Priority: Major > Attachments: Spark_NPE_big_dataset.log > > > We have big dataset containing 275 SQL operations more than 275 joins. > On the terminal operation to write data, it fails with NullPointerException. > > I understand that such big number of operations might not be what spark is > designed for, but NullPointerException is not an ideal way to fail in this > case. > > For more details, please see the stacktrace. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31690) Backport pyspark Interaction to Spark 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-31690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106081#comment-17106081 ] Hyukjin Kwon commented on SPARK-31690: -- It seems new API which isn't usually backported per https://spark.apache.org/versioning-policy.html. Also, don't need to create a JIRA next time for a backport. you can reuse the original ticket. > Backport pyspark Interaction to Spark 2.4.x > --- > > Key: SPARK-31690 > URL: https://issues.apache.org/jira/browse/SPARK-31690 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Luca Giovagnoli >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > In our company, we could really make use of the Interaction pyspark wrapper > on spark 2.4.x. > "Interaction" is available on spark 3.0, so I'm proposing to backport the > following code to the current Spark2.4.6-rc1: > - https://issues.apache.org/jira/browse/SPARK-26970 > - [https://github.com/apache/spark/pull/24426/files] > > I'm available to pick this up if it's approved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31690) Backport pyspark Interaction to Spark 2.4.x
[ https://issues.apache.org/jira/browse/SPARK-31690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31690. -- Resolution: Won't Fix > Backport pyspark Interaction to Spark 2.4.x > --- > > Key: SPARK-31690 > URL: https://issues.apache.org/jira/browse/SPARK-31690 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.5 >Reporter: Luca Giovagnoli >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > In our company, we could really make use of the Interaction pyspark wrapper > on spark 2.4.x. > "Interaction" is available on spark 3.0, so I'm proposing to backport the > following code to the current Spark2.4.6-rc1: > - https://issues.apache.org/jira/browse/SPARK-26970 > - [https://github.com/apache/spark/pull/24426/files] > > I'm available to pick this up if it's approved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31686) Return of String instead of array in function get_json_object
[ https://issues.apache.org/jira/browse/SPARK-31686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106084#comment-17106084 ] Hyukjin Kwon commented on SPARK-31686: -- Yes, you don't know the output type before actually parsing. The type should be known before the execution. It's by design > Return of String instead of array in function get_json_object > - > > Key: SPARK-31686 > URL: https://issues.apache.org/jira/browse/SPARK-31686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: {code:json} > // code placeholder > { > customer:{ > addesses:[ { {code} > location : arizona > } > ] > } > } > get_json_object(string(customer),'$addresses[*].location') > return "arizona" > result expected should be > ["arizona"] >Reporter: Touopi Touopi >Priority: Major > > when we selecting a node of a json object that is array, > When the array contains One element , the get_json_object return a String > with " characters instead of an array of One element. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31686) Return of String instead of array in function get_json_object
[ https://issues.apache.org/jira/browse/SPARK-31686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31686. -- Resolution: Not A Problem > Return of String instead of array in function get_json_object > - > > Key: SPARK-31686 > URL: https://issues.apache.org/jira/browse/SPARK-31686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: {code:json} > // code placeholder > { > customer:{ > addesses:[ { {code} > location : arizona > } > ] > } > } > get_json_object(string(customer),'$addresses[*].location') > return "arizona" > result expected should be > ["arizona"] >Reporter: Touopi Touopi >Priority: Major > > when we selecting a node of a json object that is array, > When the array contains One element , the get_json_object return a String > with " characters instead of an array of One element. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29046) Possible NPE on SQLConf.get when SparkContext is stopping in another thread
[ https://issues.apache.org/jira/browse/SPARK-29046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106087#comment-17106087 ] Viacheslav Tradunsky commented on SPARK-29046: -- [~kabhwan] Do you know a lower version of spark which does not have this issue? Maybe Spark 2.3.2? > Possible NPE on SQLConf.get when SparkContext is stopping in another thread > --- > > Key: SPARK-29046 > URL: https://issues.apache.org/jira/browse/SPARK-29046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > We encountered NPE in listener code which deals with query plan - and > according to the stack trace below, only possible case of NPE is > SparkContext._dagScheduler being null, which is only possible while stopping > SparkContext (unless null is set from outside). > > {code:java} > 19/09/11 00:22:24 INFO server.AbstractConnector: Stopped > Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO > server.AbstractConnector: Stopped > Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO > ui.SparkUI: Stopped Spark web UI at http://:3277019/09/11 00:22:24 INFO > cluster.YarnClusterSchedulerBackend: Shutting down all executors19/09/11 > 00:22:24 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each > executor to shut down19/09/11 00:22:24 INFO > cluster.SchedulerExtensionServices: Stopping > SchedulerExtensionServices(serviceOption=None, services=List(), > started=false)19/09/11 00:22:24 WARN sql.SparkExecutionPlanProcessor: Caught > exception during parsing eventjava.lang.NullPointerException at > org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at > org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at > scala.Option.map(Option.scala:146) at > org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:133) at > org.apache.spark.sql.types.StructType.simpleString(StructType.scala:352) at > com.hortonworks.spark.atlas.types.internal$.sparkTableToEntity(internal.scala:102) > at > com.hortonworks.spark.atlas.types.AtlasEntityUtils$class.tableToEntity(AtlasEntityUtils.scala:62) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$.tableToEntity(CommandsHarvester.scala:45) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:240) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:239) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at > scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at > com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities(CommandsHarvester.scala:239) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$CreateDataSourceTableAsSelectHarvester$.harvest(CommandsHarvester.scala:104) > at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:138) > at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:89) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at > scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:89) > at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:63) > at > com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72) > at > com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(Abs
[jira] [Commented] (SPARK-29046) Possible NPE on SQLConf.get when SparkContext is stopping in another thread
[ https://issues.apache.org/jira/browse/SPARK-29046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106104#comment-17106104 ] Jungtaek Lim commented on SPARK-29046: -- Sorry I don't know. Also worth noting that Spark 2.3 version line was EOLed AFAIK. > Possible NPE on SQLConf.get when SparkContext is stopping in another thread > --- > > Key: SPARK-29046 > URL: https://issues.apache.org/jira/browse/SPARK-29046 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > We encountered NPE in listener code which deals with query plan - and > according to the stack trace below, only possible case of NPE is > SparkContext._dagScheduler being null, which is only possible while stopping > SparkContext (unless null is set from outside). > > {code:java} > 19/09/11 00:22:24 INFO server.AbstractConnector: Stopped > Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO > server.AbstractConnector: Stopped > Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO > ui.SparkUI: Stopped Spark web UI at http://:3277019/09/11 00:22:24 INFO > cluster.YarnClusterSchedulerBackend: Shutting down all executors19/09/11 > 00:22:24 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each > executor to shut down19/09/11 00:22:24 INFO > cluster.SchedulerExtensionServices: Stopping > SchedulerExtensionServices(serviceOption=None, services=List(), > started=false)19/09/11 00:22:24 WARN sql.SparkExecutionPlanProcessor: Caught > exception during parsing eventjava.lang.NullPointerException at > org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at > org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at > scala.Option.map(Option.scala:146) at > org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:133) at > org.apache.spark.sql.types.StructType.simpleString(StructType.scala:352) at > com.hortonworks.spark.atlas.types.internal$.sparkTableToEntity(internal.scala:102) > at > com.hortonworks.spark.atlas.types.AtlasEntityUtils$class.tableToEntity(AtlasEntityUtils.scala:62) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$.tableToEntity(CommandsHarvester.scala:45) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:240) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:239) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at > scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at > com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities(CommandsHarvester.scala:239) > at > com.hortonworks.spark.atlas.sql.CommandsHarvester$CreateDataSourceTableAsSelectHarvester$.harvest(CommandsHarvester.scala:104) > at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:138) > at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:89) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at > scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:89) > at > com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:63) > at > com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72) > at > com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:71) > at
[jira] [Resolved] (SPARK-31697) HistoryServer should set Content-Type
[ https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31697. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28519 [https://github.com/apache/spark/pull/28519] > HistoryServer should set Content-Type > - > > Key: SPARK-31697 > URL: https://issues.apache.org/jira/browse/SPARK-31697 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.0 > > > I noticed that we will get html as plain text when we access to wrong URLs on > HistoryServer. > {code:java} > > > type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('') > > href="/static/spark-logo-77x50px-hd.png"> > Not Found > > > > > > > > > 3.1.0-SNAPSHOT > > Not Found > > > > > > Application local-1589239 not found. > > > > > {code} > > The reason is Content-Type not set. > {code:java} > HTTP/1.1 404 Not Found > Date: Wed, 13 May 2020 06:59:29 GMT > Cache-Control: no-cache, no-store, must-revalidate > X-Frame-Options: SAMEORIGIN > X-XSS-Protection: 1; mode=block > X-Content-Type-Options: nosniff > Content-Length: 1778 > Server: Jetty(9.4.18.v20190429) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31700) spark sql write orc file outformat
Dexter Morgan created SPARK-31700: - Summary: spark sql write orc file outformat Key: SPARK-31700 URL: https://issues.apache.org/jira/browse/SPARK-31700 Project: Spark Issue Type: Task Components: Input/Output Affects Versions: 2.3.3 Reporter: Dexter Morgan !image-2020-05-13-16-53-49-678.png! can you give me an example of sparksql outputformat orc file ,plz -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR
Hyukjin Kwon created SPARK-31701: Summary: Bump up the minimum Arrow version as 0.15.1 in SparkR Key: SPARK-31701 URL: https://issues.apache.org/jira/browse/SPARK-31701 Project: Spark Issue Type: Bug Components: SparkR, SQL Affects Versions: 3.0.0 Reporter: Hyukjin Kwon PySpark side bumped up the minimum Arrow version at SPARK-29376. We should better bump up the version in SparkR side to match with 0.15.1. There's no backward compatibility concern because Arrow optimization in SparkR is new. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR
[ https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31701: - Issue Type: Improvement (was: Bug) > Bump up the minimum Arrow version as 0.15.1 in SparkR > - > > Key: SPARK-31701 > URL: https://issues.apache.org/jira/browse/SPARK-31701 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > PySpark side bumped up the minimum Arrow version at SPARK-29376. We should > better bump up the version in SparkR side to match with 0.15.1. There's no > backward compatibility concern because Arrow optimization in SparkR is new. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR
[ https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106225#comment-17106225 ] Apache Spark commented on SPARK-31701: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28520 > Bump up the minimum Arrow version as 0.15.1 in SparkR > - > > Key: SPARK-31701 > URL: https://issues.apache.org/jira/browse/SPARK-31701 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > PySpark side bumped up the minimum Arrow version at SPARK-29376. We should > better bump up the version in SparkR side to match with 0.15.1. There's no > backward compatibility concern because Arrow optimization in SparkR is new. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR
[ https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31701: Assignee: Apache Spark > Bump up the minimum Arrow version as 0.15.1 in SparkR > - > > Key: SPARK-31701 > URL: https://issues.apache.org/jira/browse/SPARK-31701 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > > PySpark side bumped up the minimum Arrow version at SPARK-29376. We should > better bump up the version in SparkR side to match with 0.15.1. There's no > backward compatibility concern because Arrow optimization in SparkR is new. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR
[ https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31701: Assignee: (was: Apache Spark) > Bump up the minimum Arrow version as 0.15.1 in SparkR > - > > Key: SPARK-31701 > URL: https://issues.apache.org/jira/browse/SPARK-31701 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Minor > > PySpark side bumped up the minimum Arrow version at SPARK-29376. We should > better bump up the version in SparkR side to match with 0.15.1. There's no > backward compatibility concern because Arrow optimization in SparkR is new. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31702) Old POSIXlt, POSIXct and Date become corrupt due to calendar difference
Hyukjin Kwon created SPARK-31702: Summary: Old POSIXlt, POSIXct and Date become corrupt due to calendar difference Key: SPARK-31702 URL: https://issues.apache.org/jira/browse/SPARK-31702 Project: Spark Issue Type: Bug Components: SparkR, SQL Affects Versions: 2.4.5, 3.0.0 Reporter: Hyukjin Kwon Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see below: {code} # Non-existent timestamp in hybrid Julian and Gregorian Calendar showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00")) {code} {code} +---+---+ |POSIXct|POSIXlt| +---+---+ |1582-09-30 00:33:08|1582-09-30 00:33:08| +---+---+ {code} See https://docs.google.com/document/d/1Upf6c5fNM59Q6nko-ipjLLae86x9mBejwuXshii-Azg/edit?usp=sharing Note that the results seem wrong from the very first implementation. The cause seems because R side uses Proleptic Gregorian calendar but JVM side is using hybrid Juilian and Gregoiran calendar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31702) Old POSIXlt, POSIXct and Date become corrupt due to calendar difference
[ https://issues.apache.org/jira/browse/SPARK-31702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31702: - Description: Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see below: {code} # Non-existent timestamp in hybrid Julian and Gregorian Calendar showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00")) {code} {code} +---+---+ |POSIXct|POSIXlt| +---+---+ |1582-09-30 00:33:08|1582-09-30 00:33:08| +---+---+ {code} See https://docs.google.com/document/d/1an3Mzv6s0naO4mDwGFHJ48gLT--6EliA1GG3kbgBymo/edit?usp=sharing Note that the results seem wrong from the very first implementation. The cause seems because R side uses Proleptic Gregorian calendar but JVM side is using hybrid Juilian and Gregoiran calendar. was: Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see below: {code} # Non-existent timestamp in hybrid Julian and Gregorian Calendar showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00")) {code} {code} +---+---+ |POSIXct|POSIXlt| +---+---+ |1582-09-30 00:33:08|1582-09-30 00:33:08| +---+---+ {code} See https://docs.google.com/document/d/1Upf6c5fNM59Q6nko-ipjLLae86x9mBejwuXshii-Azg/edit?usp=sharing Note that the results seem wrong from the very first implementation. The cause seems because R side uses Proleptic Gregorian calendar but JVM side is using hybrid Juilian and Gregoiran calendar. > Old POSIXlt, POSIXct and Date become corrupt due to calendar difference > --- > > Key: SPARK-31702 > URL: https://issues.apache.org/jira/browse/SPARK-31702 > Project: Spark > Issue Type: Bug > Components: SparkR, SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see > below: > {code} > # Non-existent timestamp in hybrid Julian and Gregorian Calendar > showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 > 00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00")) > {code} > {code} > +---+---+ > |POSIXct|POSIXlt| > +---+---+ > |1582-09-30 00:33:08|1582-09-30 00:33:08| > +---+---+ > {code} > See > https://docs.google.com/document/d/1an3Mzv6s0naO4mDwGFHJ48gLT--6EliA1GG3kbgBymo/edit?usp=sharing > Note that the results seem wrong from the very first implementation. The > cause seems because R side uses Proleptic Gregorian calendar but JVM side is > using hybrid Juilian and Gregoiran calendar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
Michail Giannakopoulos created SPARK-31703: -- Summary: Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64) Key: SPARK-31703 URL: https://issues.apache.org/jira/browse/SPARK-31703 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5, 3.0.0 Environment: AIX 7.2 LinuxPPC64 with RedHat. Reporter: Michail Giannakopoulos Attachments: Data_problem_Spark.gif Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so as to be able to read data stored in parquet format, we notice that values associated with DOUBLE and DECIMAL types are parsed in the wrong form. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
[ https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos updated SPARK-31703: --- Attachment: Data_problem_Spark.gif > Changes made by SPARK-26985 break reading parquet files correctly in > BigEndian architectures (AIX + LinuxPPC64) > --- > > Key: SPARK-31703 > URL: https://issues.apache.org/jira/browse/SPARK-31703 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 > Environment: AIX 7.2 > LinuxPPC64 with RedHat. >Reporter: Michail Giannakopoulos >Priority: Critical > Attachments: Data_problem_Spark.gif > > > Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) > so as to be able to read data stored in parquet format, we notice that values > associated with DOUBLE and DECIMAL types are parsed in the wrong form. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
[ https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos updated SPARK-31703: --- Labels: BigEndian (was: ) > Changes made by SPARK-26985 break reading parquet files correctly in > BigEndian architectures (AIX + LinuxPPC64) > --- > > Key: SPARK-31703 > URL: https://issues.apache.org/jira/browse/SPARK-31703 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 > Environment: AIX 7.2 > LinuxPPC64 with RedHat. >Reporter: Michail Giannakopoulos >Priority: Critical > Labels: BigEndian > Attachments: Data_problem_Spark.gif > > > Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) > so as to be able to read data stored in parquet format, we notice that values > associated with DOUBLE and DECIMAL types are parsed in the wrong form. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)
[ https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michail Giannakopoulos updated SPARK-31703: --- Description: Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so as to be able to read data stored in parquet format, we notice that values associated with DOUBLE and DECIMAL types are parsed in the wrong form. According toe parquet documentation, they always opt to store the values using left-endian representation for values: [https://github.com/apache/parquet-format/blob/master/Encodings.md] {noformat} The plain encoding is used whenever a more efficient encoding can not be used. It stores the data in the following format: BOOLEAN: Bit Packed, LSB first INT32: 4 bytes little endian INT64: 8 bytes little endian INT96: 12 bytes little endian (deprecated) FLOAT: 4 bytes IEEE little endian DOUBLE: 8 bytes IEEE little endian BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in the array FIXED_LEN_BYTE_ARRAY: the bytes contained in the array For native types, this outputs the data as little endian. Floating point types are encoded in IEEE. For the byte array type, it encodes the length as a 4 byte little endian, followed by the bytes.{noformat} was:Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so as to be able to read data stored in parquet format, we notice that values associated with DOUBLE and DECIMAL types are parsed in the wrong form. > Changes made by SPARK-26985 break reading parquet files correctly in > BigEndian architectures (AIX + LinuxPPC64) > --- > > Key: SPARK-31703 > URL: https://issues.apache.org/jira/browse/SPARK-31703 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 > Environment: AIX 7.2 > LinuxPPC64 with RedHat. >Reporter: Michail Giannakopoulos >Priority: Critical > Labels: BigEndian > Attachments: Data_problem_Spark.gif > > > Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) > so as to be able to read data stored in parquet format, we notice that values > associated with DOUBLE and DECIMAL types are parsed in the wrong form. > According toe parquet documentation, they always opt to store the values > using left-endian representation for values: > [https://github.com/apache/parquet-format/blob/master/Encodings.md] > {noformat} > The plain encoding is used whenever a more efficient encoding can not be > used. It > stores the data in the following format: > BOOLEAN: Bit Packed, LSB first > INT32: 4 bytes little endian > INT64: 8 bytes little endian > INT96: 12 bytes little endian (deprecated) > FLOAT: 4 bytes IEEE little endian > DOUBLE: 8 bytes IEEE little endian > BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained > in the array > FIXED_LEN_BYTE_ARRAY: the bytes contained in the array > For native types, this outputs the data as little endian. Floating > point types are encoded in IEEE. > For the byte array type, it encodes the length as a 4 byte little > endian, followed by the bytes.{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11
Markus Tretzmüller created SPARK-31704: -- Summary: PandasUDFType.GROUPED_AGG with Java 11 Key: SPARK-31704 URL: https://issues.apache.org/jira/browse/SPARK-31704 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.0 Environment: java jdk: 11 python: 3.7 Reporter: Markus Tretzmüller Running the example from the [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions] gives an error with java 11. It works with java 8. {code:python} import findspark findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7') from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql import Window from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession \ .builder \ .appName('test') \ .getOrCreate() df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) @pandas_udf("double", PandasUDFType.GROUPED_AGG) def mean_udf(v): return v.mean() w = (Window.partitionBy('id') .orderBy('v') .rowsBetween(-1, 0)) df.withColumn('mean_v', mean_udf(df['v']).over(w)).show() {code} {noformat} File "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 (TID 37, 131.130.32.15, executor driver): java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available at io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473) at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) at org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222) at org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240) at org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132) at org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101) at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932) at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11664) Add methods to get bisecting k-means cluster structure
[ https://issues.apache.org/jira/browse/SPARK-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106325#comment-17106325 ] Dan Griffin commented on SPARK-11664: - Hey. I'm wondering what the status of this capability being integrated into the official spark release? It seems that many in the community would like to have this feature in addition to the final sets of clusters. > Add methods to get bisecting k-means cluster structure > -- > > Key: SPARK-11664 > URL: https://issues.apache.org/jira/browse/SPARK-11664 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Reporter: Yu Ishikawa >Priority: Minor > Labels: bulk-closed > > I think users want to visualize the result of bisecting k-means clustering as > a dendrogram in order to confirm it. So it would be great to support method > to get the cluster tree structure as an adjacency list, linkage matrix and so > on. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31705) Rewrite join condition to conjunctive normal form
Yuming Wang created SPARK-31705: --- Summary: Rewrite join condition to conjunctive normal form Key: SPARK-31705 URL: https://issues.apache.org/jira/browse/SPARK-31705 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Assignee: Yuming Wang Rewrite join condition to [conjunctive normal form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more conditions to filter. PostgreSQL: {code:sql} CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); CREATE TABLE orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); explain select count(*) from lineitem, orders where l_orderkey = o_orderkey and ((l_suppkey > 10 and o_custkey > 20) or (l_suppkey > 30 and o_custkey > 40)) and l_partkey > 0; explain select count(*) from lineitem join orders on l_orderkey = o_orderkey and ((l_suppkey > 10 and o_custkey > 20) or (l_suppkey > 30 and o_custkey > 40)) and l_partkey > 0; {code} {noformat} postgres=# explain select count(*) from lineitem, orders postgres-# where l_orderkey = o_orderkey postgres-# and ((l_suppkey > 10 and o_custkey > 20) postgres(# or (l_suppkey > 30 and o_custkey > 40)) postgres-# and l_partkey > 0; QUERY PLAN --- Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > 20)) OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 20) OR (o_custkey > 40)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR (l_suppkey > 30))) (9 rows) postgres=# postgres=# explain select count(*) from lineitem join orders postgres-# on l_orderkey = o_orderkey postgres-# and ((l_suppkey > 10 and o_custkey > 20) postgres(# or (l_suppkey > 30 and o_custkey > 40)) postgres-# and l_partkey > 0; QUERY PLAN --- Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > 20)) OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 20) OR (o_custkey > 40)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR (l_suppkey > 30))) (9 rows) {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106344#comment-17106344 ] Yuming Wang commented on SPARK-31705: - I'm working on. > Rewrite join condition to conjunctive normal form > - > > Key: SPARK-31705 > URL: https://issues.apache.org/jira/browse/SPARK-31705 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Rewrite join condition to [conjunctive normal > form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more > conditions to filter. > PostgreSQL: > {code:sql} > CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, > > l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), > > l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), > > l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate > DATE, > l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); > > CREATE TABLE orders ( > o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), > o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), > o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); > explain select count(*) from lineitem, orders > where l_orderkey = o_orderkey > and ((l_suppkey > 10 and o_custkey > 20) > or (l_suppkey > 30 and o_custkey > 40)) > and l_partkey > 0; > explain select count(*) from lineitem join orders > on l_orderkey = o_orderkey > and ((l_suppkey > 10 and o_custkey > 20) > or (l_suppkey > 30 and o_custkey > 40)) > and l_partkey > 0; > {code} > {noformat} > postgres=# explain select count(*) from lineitem, orders > postgres-# where l_orderkey = o_orderkey > postgres-# and ((l_suppkey > 10 and o_custkey > 20) > postgres(# or (l_suppkey > 30 and o_custkey > 40)) > postgres-# and l_partkey > 0; > QUERY PLAN > --- > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > > 20)) OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 20) OR (o_custkey > 40)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR > (l_suppkey > 30))) > (9 rows) > postgres=# > postgres=# explain select count(*) from lineitem join orders > postgres-# on l_orderkey = o_orderkey > postgres-# and ((l_suppkey > 10 and o_custkey > 20) > postgres(# or (l_suppkey > 30 and o_custkey > 40)) > postgres-# and l_partkey > 0; > QUERY PLAN > --- > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > > 20)) OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 20) OR (o_custkey > 40)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR > (l_suppkey > 30))) > (9 rows) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR
[ https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31701: - Assignee: Hyukjin Kwon > Bump up the minimum Arrow version as 0.15.1 in SparkR > - > > Key: SPARK-31701 > URL: https://issues.apache.org/jira/browse/SPARK-31701 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > PySpark side bumped up the minimum Arrow version at SPARK-29376. We should > better bump up the version in SparkR side to match with 0.15.1. There's no > backward compatibility concern because Arrow optimization in SparkR is new. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR
[ https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31701. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28520 [https://github.com/apache/spark/pull/28520] > Bump up the minimum Arrow version as 0.15.1 in SparkR > - > > Key: SPARK-31701 > URL: https://issues.apache.org/jira/browse/SPARK-31701 > Project: Spark > Issue Type: Improvement > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.0.0 > > > PySpark side bumped up the minimum Arrow version at SPARK-29376. We should > better bump up the version in SparkR side to match with 0.15.1. There's no > backward compatibility concern because Arrow optimization in SparkR is new. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31706) add back the support of streaming update mode
Wenchen Fan created SPARK-31706: --- Summary: add back the support of streaming update mode Key: SPARK-31706 URL: https://issues.apache.org/jira/browse/SPARK-31706 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11
[ https://issues.apache.org/jira/browse/SPARK-31704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106500#comment-17106500 ] Bryan Cutler commented on SPARK-31704: -- This is due to a Netty API that Arrow uses and unfortunately, it currently needs the following Java option set to get working {{-Dio.netty.tryReflectionSetAccessible=true}}. See https://issues.apache.org/jira/browse/SPARK-29924 which added documentation for this here https://github.com/apache/spark/blob/master/docs/index.md#downloading. > PandasUDFType.GROUPED_AGG with Java 11 > -- > > Key: SPARK-31704 > URL: https://issues.apache.org/jira/browse/SPARK-31704 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: java jdk: 11 > python: 3.7 > >Reporter: Markus Tretzmüller >Priority: Minor > Labels: newbie > > Running the example from the > [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions] > gives an error with java 11. It works with java 8. > {code:python} > import findspark > findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7') > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql import Window > from pyspark.sql import SparkSession > if __name__ == '__main__': > spark = SparkSession \ > .builder \ > .appName('test') \ > .getOrCreate() > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("double", PandasUDFType.GROUPED_AGG) > def mean_udf(v): > return v.mean() > w = (Window.partitionBy('id') > .orderBy('v') > .rowsBetween(-1, 0)) > df.withColumn('mean_v', mean_udf(df['v']).over(w)).show() > {code} > {noformat} > File > "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 > in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 > (TID 37, 131.130.32.15, executor driver): > java.lang.UnsupportedOperationException: sun.misc.Unsafe or > java.nio.DirectByteBuffer.(long, int) not available > at > io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473) > at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) > at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) > at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) > at > org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240) > at > org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132) > at > org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31706) add back the support of streaming update mode
[ https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31706: Assignee: Apache Spark (was: Wenchen Fan) > add back the support of streaming update mode > - > > Key: SPARK-31706 > URL: https://issues.apache.org/jira/browse/SPARK-31706 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31706) add back the support of streaming update mode
[ https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31706: Assignee: Wenchen Fan (was: Apache Spark) > add back the support of streaming update mode > - > > Key: SPARK-31706 > URL: https://issues.apache.org/jira/browse/SPARK-31706 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31706) add back the support of streaming update mode
[ https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106504#comment-17106504 ] Apache Spark commented on SPARK-31706: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28523 > add back the support of streaming update mode > - > > Key: SPARK-31706 > URL: https://issues.apache.org/jira/browse/SPARK-31706 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance
[ https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106562#comment-17106562 ] Zirui Li commented on SPARK-23607: -- Hi [~zhouyejoe] wondering do you have any plan to post the PR? Thanks > Use HDFS extended attributes to store application summary to improve the > Spark History Server performance > - > > Key: SPARK-23607 > URL: https://issues.apache.org/jira/browse/SPARK-23607 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 2.3.0 >Reporter: Ye Zhou >Priority: Minor > Labels: bulk-closed > > Currently in Spark History Server, checkForLogs thread will create replaying > tasks for log files which have file size change. The replaying task will > filter out most of the log file content and keep the application summary > including applicationId, user, attemptACL, start time, end time. The > application summary data will get updated into listing.ldb and serve the > application list on SHS home page. For a long running application, its log > file which name ends with "inprogress" will get replayed for multiple times > to get these application summary. This is a waste of computing and data > reading resource to SHS, which results in the delay for application to get > showing up on home page. Internally we have a patch which utilizes HDFS > extended attributes to improve the performance for getting application > summary in SHS. With this patch, Driver will write the application summary > information into extended attributes as key/value. SHS will try to read from > extended attributes. If SHS fails to read from extended attributes, it will > fall back to read from the log file content as usual. This feature can be > enable/disable through configuration. > It has been running fine for 4 months internally with this patch and the last > updated timestamp on SHS keeps within 1 minute as we configure the interval > to 1 minute. Originally we had long delay which could be as long as 30 > minutes in our scale where we have a large number of Spark applications > running per day. > We want to see whether this kind of approach is also acceptable to community. > Please comment. If so, I will post a pull request for the changes. Thanks. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
Jungtaek Lim created SPARK-31707: Summary: Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax Key: SPARK-31707 URL: https://issues.apache.org/jira/browse/SPARK-31707 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Jungtaek Lim We need to consider the behavior change of SPARK-30098 . This is a placeholder to keep the discussion and the final decision. `CREATE TABLE` syntax changes its behavior silently. The following is one example of the breaking the existing user data pipelines. *Apache Spark 2.4.5* {code} spark-sql> CREATE TABLE t(a STRING); spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; spark-sql> SELECT * FROM t LIMIT 1; # Apache Spark Time taken: 2.05 seconds, Fetched 1 row(s) {code} {code} spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 {code} *Apache Spark 3.0.0-preview2* {code} spark-sql> CREATE TABLE t(a STRING); spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; Error in query: LOAD DATA is not supported for datasource tables: `default`.`t`; {code} {code} spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-31707: - Description: According to the latest status of discussion in the dev@ mailing list, [[DISCUSS] Resolve ambiguous parser rule between two "create table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html], we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0. This issue tracks the effort of revert. was: We need to consider the behavior change of SPARK-30098 . This is a placeholder to keep the discussion and the final decision. `CREATE TABLE` syntax changes its behavior silently. The following is one example of the breaking the existing user data pipelines. *Apache Spark 2.4.5* {code} spark-sql> CREATE TABLE t(a STRING); spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; spark-sql> SELECT * FROM t LIMIT 1; # Apache Spark Time taken: 2.05 seconds, Fetched 1 row(s) {code} {code} spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 {code} *Apache Spark 3.0.0-preview2* {code} spark-sql> CREATE TABLE t(a STRING); spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t; Error in query: LOAD DATA is not supported for datasource tables: `default`.`t`; {code} {code} spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 {code} > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31707 > URL: https://issues.apache.org/jira/browse/SPARK-31707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Blocker > > According to the latest status of discussion in the dev@ mailing list, > [[DISCUSS] Resolve ambiguous parser rule between two "create > table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html], > we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0. > This issue tracks the effort of revert. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31707: Assignee: (was: Apache Spark) > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31707 > URL: https://issues.apache.org/jira/browse/SPARK-31707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Blocker > > According to the latest status of discussion in the dev@ mailing list, > [[DISCUSS] Resolve ambiguous parser rule between two "create > table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html], > we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0. > This issue tracks the effort of revert. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106638#comment-17106638 ] Apache Spark commented on SPARK-31707: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/28517 > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31707 > URL: https://issues.apache.org/jira/browse/SPARK-31707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Blocker > > According to the latest status of discussion in the dev@ mailing list, > [[DISCUSS] Resolve ambiguous parser rule between two "create > table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html], > we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0. > This issue tracks the effort of revert. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106637#comment-17106637 ] Apache Spark commented on SPARK-31707: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/28517 > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31707 > URL: https://issues.apache.org/jira/browse/SPARK-31707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Blocker > > According to the latest status of discussion in the dev@ mailing list, > [[DISCUSS] Resolve ambiguous parser rule between two "create > table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html], > we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0. > This issue tracks the effort of revert. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
[ https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31707: Assignee: Apache Spark > Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax > - > > Key: SPARK-31707 > URL: https://issues.apache.org/jira/browse/SPARK-31707 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Blocker > > According to the latest status of discussion in the dev@ mailing list, > [[DISCUSS] Resolve ambiguous parser rule between two "create > table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html], > we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0. > This issue tracks the effort of revert. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-31700) spark sql write orc file outformat
[ https://issues.apache.org/jira/browse/SPARK-31700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-31700. - > spark sql write orc file outformat > --- > > Key: SPARK-31700 > URL: https://issues.apache.org/jira/browse/SPARK-31700 > Project: Spark > Issue Type: Task > Components: Input/Output >Affects Versions: 2.3.3 >Reporter: Dexter Morgan >Priority: Major > > !image-2020-05-13-16-53-49-678.png! > > can you give me an example of sparksql outputformat orc file ,plz > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31700) spark sql write orc file outformat
[ https://issues.apache.org/jira/browse/SPARK-31700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31700. --- Resolution: Invalid Hi, please use dev mailing list. > spark sql write orc file outformat > --- > > Key: SPARK-31700 > URL: https://issues.apache.org/jira/browse/SPARK-31700 > Project: Spark > Issue Type: Task > Components: Input/Output >Affects Versions: 2.3.3 >Reporter: Dexter Morgan >Priority: Major > > !image-2020-05-13-16-53-49-678.png! > > can you give me an example of sparksql outputformat orc file ,plz > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11
[ https://issues.apache.org/jira/browse/SPARK-31704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106642#comment-17106642 ] Dongjoon Hyun commented on SPARK-31704: --- +1 for [~bryanc]'s advice. You may see Apache Spark 3.0.0 RC1 document. - https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/_site/index.html > PandasUDFType.GROUPED_AGG with Java 11 > -- > > Key: SPARK-31704 > URL: https://issues.apache.org/jira/browse/SPARK-31704 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: java jdk: 11 > python: 3.7 > >Reporter: Markus Tretzmüller >Priority: Minor > Labels: newbie > > Running the example from the > [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions] > gives an error with java 11. It works with java 8. > {code:python} > import findspark > findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7') > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql import Window > from pyspark.sql import SparkSession > if __name__ == '__main__': > spark = SparkSession \ > .builder \ > .appName('test') \ > .getOrCreate() > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("double", PandasUDFType.GROUPED_AGG) > def mean_udf(v): > return v.mean() > w = (Window.partitionBy('id') > .orderBy('v') > .rowsBetween(-1, 0)) > df.withColumn('mean_v', mean_udf(df['v']).over(w)).show() > {code} > {noformat} > File > "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 > in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 > (TID 37, 131.130.32.15, executor driver): > java.lang.UnsupportedOperationException: sun.misc.Unsafe or > java.nio.DirectByteBuffer.(long, int) not available > at > io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473) > at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) > at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) > at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) > at > org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240) > at > org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132) > at > org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11
[ https://issues.apache.org/jira/browse/SPARK-31704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31704. --- Resolution: Duplicate > PandasUDFType.GROUPED_AGG with Java 11 > -- > > Key: SPARK-31704 > URL: https://issues.apache.org/jira/browse/SPARK-31704 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: java jdk: 11 > python: 3.7 > >Reporter: Markus Tretzmüller >Priority: Minor > Labels: newbie > > Running the example from the > [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions] > gives an error with java 11. It works with java 8. > {code:python} > import findspark > findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7') > from pyspark.sql.functions import pandas_udf, PandasUDFType > from pyspark.sql import Window > from pyspark.sql import SparkSession > if __name__ == '__main__': > spark = SparkSession \ > .builder \ > .appName('test') \ > .getOrCreate() > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > @pandas_udf("double", PandasUDFType.GROUPED_AGG) > def mean_udf(v): > return v.mean() > w = (Window.partitionBy('id') > .orderBy('v') > .rowsBetween(-1, 0)) > df.withColumn('mean_v', mean_udf(df['v']).over(w)).show() > {code} > {noformat} > File > "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py", > line 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 > in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 > (TID 37, 131.130.32.15, executor driver): > java.lang.UnsupportedOperationException: sun.misc.Unsafe or > java.nio.DirectByteBuffer.(long, int) not available > at > io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473) > at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243) > at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233) > at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245) > at > org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222) > at > org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240) > at > org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132) > at > org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > at > org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932) > at > org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31696) Support spark.kubernetes.driver.service.annotation
[ https://issues.apache.org/jira/browse/SPARK-31696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31696: - Assignee: Dongjoon Hyun > Support spark.kubernetes.driver.service.annotation > -- > > Key: SPARK-31696 > URL: https://issues.apache.org/jira/browse/SPARK-31696 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31696) Support spark.kubernetes.driver.service.annotation
[ https://issues.apache.org/jira/browse/SPARK-31696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31696. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28518 [https://github.com/apache/spark/pull/28518] > Support spark.kubernetes.driver.service.annotation > -- > > Key: SPARK-31696 > URL: https://issues.apache.org/jira/browse/SPARK-31696 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector
Huaxin Gao created SPARK-31708: -- Summary: Add docs and examples for ANOVASelector and FValueSelector Key: SPARK-31708 URL: https://issues.apache.org/jira/browse/SPARK-31708 Project: Spark Issue Type: Sub-task Components: Documentation, ML Affects Versions: 3.1.0 Reporter: Huaxin Gao Add docs and examples for ANOVASelector and FValueSelector -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector
[ https://issues.apache.org/jira/browse/SPARK-31708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31708: Assignee: Apache Spark > Add docs and examples for ANOVASelector and FValueSelector > -- > > Key: SPARK-31708 > URL: https://issues.apache.org/jira/browse/SPARK-31708 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Add docs and examples for ANOVASelector and FValueSelector -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector
[ https://issues.apache.org/jira/browse/SPARK-31708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31708: Assignee: (was: Apache Spark) > Add docs and examples for ANOVASelector and FValueSelector > -- > > Key: SPARK-31708 > URL: https://issues.apache.org/jira/browse/SPARK-31708 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add docs and examples for ANOVASelector and FValueSelector -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector
[ https://issues.apache.org/jira/browse/SPARK-31708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106752#comment-17106752 ] Apache Spark commented on SPARK-31708: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28524 > Add docs and examples for ANOVASelector and FValueSelector > -- > > Key: SPARK-31708 > URL: https://issues.apache.org/jira/browse/SPARK-31708 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add docs and examples for ANOVASelector and FValueSelector -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31705: Description: Rewrite join condition to [conjunctive normal form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more conditions to filter. PostgreSQL: {code:sql} CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); CREATE TABLE orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); EXPLAIN SELECT Count(*) FROM lineitem, orders WHERE l_orderkey = o_orderkey AND ( ( l_suppkey > 3 AND o_custkey > 13 ) OR ( l_suppkey > 1 AND o_custkey > 11 ) ) AND l_partkey > 19; EXPLAIN SELECT Count(*) FROM lineitem JOIN orders ON l_orderkey = o_orderkey AND ( ( l_suppkey > 3 AND o_custkey > 13 ) OR ( l_suppkey > 1 AND o_custkey > 11 ) ) AND l_partkey > 19; {code} {noformat} postgres=# EXPLAIN postgres-# SELECT Count(*) postgres-# FROM lineitem, postgres-#orders postgres-# WHERE l_orderkey = o_orderkey postgres-#AND ( ( l_suppkey > 3 postgres(#AND o_custkey > 13 ) postgres(# OR ( l_suppkey > 1 postgres(#AND o_custkey > 11 ) ) postgres-#AND l_partkey > 19; QUERY PLAN - Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 13) OR (o_custkey > 11)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR (l_suppkey > 1))) (9 rows) postgres=# EXPLAIN postgres-# SELECT Count(*) postgres-# FROM lineitem postgres-#JOIN orders postgres-# ON l_orderkey = o_orderkey postgres-# AND ( ( l_suppkey > 3 postgres(# AND o_custkey > 13 ) postgres(#OR ( l_suppkey > 1 postgres(# AND o_custkey > 11 ) ) postgres-# AND l_partkey > 19; QUERY PLAN - Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 13) OR (o_custkey > 11)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR (l_suppkey > 1))) (9 rows) {noformat} was: Rewrite join condition to [conjunctive normal form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more conditions to filter. PostgreSQL: {code:sql} CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); CREATE TABLE orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority va
[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31705: Issue Type: Improvement (was: New Feature) > Rewrite join condition to conjunctive normal form > - > > Key: SPARK-31705 > URL: https://issues.apache.org/jira/browse/SPARK-31705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Rewrite join condition to [conjunctive normal > form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more > conditions to filter. > PostgreSQL: > {code:sql} > CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, > > l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), > > l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), > > l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate > DATE, > l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); > > CREATE TABLE orders ( > o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), > o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), > o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND ( ( l_suppkey > 3 >AND o_custkey > 13 ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem >JOIN orders > ON l_orderkey = o_orderkey > AND ( ( l_suppkey > 3 > AND o_custkey > 13 ) >OR ( l_suppkey > 1 > AND o_custkey > 11 ) ) > AND l_partkey > 19; > {code} > {noformat} > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem, > postgres-#orders > postgres-# WHERE l_orderkey = o_orderkey > postgres-#AND ( ( l_suppkey > 3 > postgres(#AND o_custkey > 13 ) > postgres(# OR ( l_suppkey > 1 > postgres(#AND o_custkey > 11 ) ) > postgres-#AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR > (l_suppkey > 1))) > (9 rows) > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem > postgres-#JOIN orders > postgres-# ON l_orderkey = o_orderkey > postgres-# AND ( ( l_suppkey > 3 > postgres(# AND o_custkey > 13 ) > postgres(#OR ( l_suppkey > 1 > postgres(# AND o_custkey > 11 ) ) > postgres-# AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR > (l_suppkey > 1))) > (9 rows) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@s
[jira] [Commented] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106774#comment-17106774 ] Yunbo Fan commented on SPARK-21033: --- [~clehene] Since it's 2020, have you solved the problem? I‘m seeing the same one. > fix the potential OOM in UnsafeExternalSorter > - > > Key: SPARK-21033 > URL: https://issues.apache.org/jira/browse/SPARK-21033 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 2.3.0 > > > In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for > pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary > buffer for radix sort. > In `UnsafeExternalSorter`, we set the > `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, > and hoping the max size of point array to be 8 GB. However this is wrong, > `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point > array before reach this limitation, we may hit the max-page-size error. > Users may see exception like this on large dataset: > {code} > Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with > more than 17179869176 bytes > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241) > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94) > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31693) Investigate AmpLab Jenkins server network issue
[ https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31693: - Priority: Critical (was: Major) > Investigate AmpLab Jenkins server network issue > --- > > Key: SPARK-31693 > URL: https://issues.apache.org/jira/browse/SPARK-31693 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Critical > > Given the series of failures in Spark packaging Jenkins job, it seems that > there is a network issue in AmbLab Jenkins cluster. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ > - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay. > - The node failed to download the maven mirror. (SPARK-31691) -> The primary > host is okay. > - The node failed to communicate repository.apache.org. (Current master > branch Jenkins job failure) > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) > on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve > remote metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could > not transfer metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to > apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Transfer > failed for > https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml: > Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] > failed: Connection timed out (Connection timed out) -> [Help 1] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue
[ https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106825#comment-17106825 ] Hyukjin Kwon commented on SPARK-31693: -- Seems it's blocking many other PRs ... > Investigate AmpLab Jenkins server network issue > --- > > Key: SPARK-31693 > URL: https://issues.apache.org/jira/browse/SPARK-31693 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Critical > > Given the series of failures in Spark packaging Jenkins job, it seems that > there is a network issue in AmbLab Jenkins cluster. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ > - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay. > - The node failed to download the maven mirror. (SPARK-31691) -> The primary > host is okay. > - The node failed to communicate repository.apache.org. (Current master > branch Jenkins job failure) > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) > on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve > remote metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could > not transfer metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to > apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Transfer > failed for > https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml: > Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] > failed: Connection timed out (Connection timed out) -> [Help 1] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue
[ https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106830#comment-17106830 ] shane knapp commented on SPARK-31693: - grrr. ok, sorry. today was my zoom meeting day. i'll reboot the master and all nodes tomorrow and see how that goes. i really don't see how this is an issue on our end. -- Shane Knapp Computer Guy / Voice of Reason UC Berkeley EECS Research / RISELab Staff Technical Lead https://rise.cs.berkeley.edu > Investigate AmpLab Jenkins server network issue > --- > > Key: SPARK-31693 > URL: https://issues.apache.org/jira/browse/SPARK-31693 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Critical > > Given the series of failures in Spark packaging Jenkins job, it seems that > there is a network issue in AmbLab Jenkins cluster. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ > - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay. > - The node failed to download the maven mirror. (SPARK-31691) -> The primary > host is okay. > - The node failed to communicate repository.apache.org. (Current master > branch Jenkins job failure) > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) > on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve > remote metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could > not transfer metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to > apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Transfer > failed for > https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml: > Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] > failed: Connection timed out (Connection timed out) -> [Help 1] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared
[ https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31632. -- Fix Version/s: 2.4.7 3.0.0 Assignee: Xingcan Cui Resolution: Fixed Fixed in https://github.com/apache/spark/pull/28444 > The ApplicationInfo in KVStore may be accessed before it's prepared > --- > > Key: SPARK-31632 > URL: https://issues.apache.org/jira/browse/SPARK-31632 > Project: Spark > Issue Type: Bug > Components: Spark Core, Web UI >Affects Versions: 3.0.0 >Reporter: Xingcan Cui >Assignee: Xingcan Cui >Priority: Minor > Fix For: 3.0.0, 2.4.7 > > > While starting some local tests, I occasionally encountered the following > exceptions for Web UI. > {noformat} > 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/ > java.util.NoSuchElementException > at java.util.Collections$EmptyIterator.next(Collections.java:4191) > at > org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) > at > org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) > at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) > at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) > at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) > at > org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) > at > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) > at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) > at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) > at > org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) > at > org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) > at org.eclipse.jetty.server.Server.handle(Server.java:505) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) > at > org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) > at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) > at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) > at java.lang.Thread.run(Thread.java:748){noformat} > *Reason* > That is because {{AppStatusStore.applicationInfo()}} accesses an empty view > (iterator) returned by {{InMemoryStore}}. > AppStatusStore > {code:java} > def applicationInfo(): v1.ApplicationInfo = { > store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info > } > {code} > InMemoryStore > {code:java} > public KVStoreView view(Class type){ > InstanceList list = inMemoryLists.get(type); > return list != null ? list.view() : emptyView(); > } > {code} > During the initialization of {{SparkContext}}, it first starts the Web UI > (SparkContext: L475 _ui.foreach(_.bind())) and then setup the > {{LiveListenerBus}} thread (SparkContext: L608 > {{setupAndStartListenerBus()}}) for dispatching the > {{SparkListenerApplicationStart}} event (which will trigger writing the > requested {{ApplicationInfo}} to {{InMemoryStore}}). > *Solution* > Since the {{applicationInfo()}} method is expected to always return a valid > {{ApplicationInfo}}, maybe we can add a while-loop-check here to guarantee > the availability of {{ApplicationInfo}}. > {code:java} > def applicationInfo(): v1.ApplicationInfo = { > var iterator = store.view(classOf[ApplicationInfoWrapper]).max(1).iterator() > while (!iterator.
[jira] [Updated] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared
[ https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xingcan Cui updated SPARK-31632: Description: While starting some local tests, I occasionally encountered the following exceptions for Web UI. {noformat} 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/ java.util.NoSuchElementException at java.util.Collections$EmptyIterator.next(Collections.java:4191) at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) at org.eclipse.jetty.server.Server.handle(Server.java:505) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804) at java.lang.Thread.run(Thread.java:748){noformat} *Reason* That is because {{AppStatusStore.applicationInfo()}} accesses an empty view (iterator) returned by {{InMemoryStore}}. AppStatusStore {code:java} def applicationInfo(): v1.ApplicationInfo = { store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info } {code} InMemoryStore {code:java} public KVStoreView view(Class type){ InstanceList list = inMemoryLists.get(type); return list != null ? list.view() : emptyView(); } {code} During the initialization of {{SparkContext}}, it first starts the Web UI (SparkContext: L475 _ui.foreach(_.bind())) and then setup the {{LiveListenerBus}} thread (SparkContext: L608 {{setupAndStartListenerBus()}}) for dispatching the {{SparkListenerApplicationStart}} event (which will trigger writing the requested {{ApplicationInfo}} to {{InMemoryStore}}). was: While starting some local tests, I occasionally encountered the following exceptions for Web UI. {noformat} 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/ java.util.NoSuchElementException at java.util.Collections$EmptyIterator.next(Collections.java:4191) at org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467) at org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39) at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266) at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80) at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623) at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610) at org.eclipse.jetty.servlet.ServletHandler.doHand
[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue
[ https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106836#comment-17106836 ] Hyukjin Kwon commented on SPARK-31693: -- Thank you so much [~shaneknapp]. > Investigate AmpLab Jenkins server network issue > --- > > Key: SPARK-31693 > URL: https://issues.apache.org/jira/browse/SPARK-31693 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Critical > > Given the series of failures in Spark packaging Jenkins job, it seems that > there is a network issue in AmbLab Jenkins cluster. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ > - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay. > - The node failed to download the maven mirror. (SPARK-31691) -> The primary > host is okay. > - The node failed to communicate repository.apache.org. (Current master > branch Jenkins job failure) > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) > on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve > remote metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could > not transfer metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to > apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Transfer > failed for > https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml: > Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] > failed: Connection timed out (Connection timed out) -> [Help 1] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data
[ https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106845#comment-17106845 ] Apache Spark commented on SPARK-27562: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/28525 > Complete the verification mechanism for shuffle transmitted data > > > Key: SPARK-27562 > URL: https://issues.apache.org/jira/browse/SPARK-27562 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: feiwang >Priority: Major > > We've seen some shuffle data corruption during shuffle read phase. > As described in SPARK-26089, spark only checks small shuffle blocks before > PR #23453, which is proposed by ankuriitg. > There are two changes/improvements that are made in PR #23453. > 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as > smaller blocks, so if a > large block is corrupt in the starting, that block will be re-fetched and if > that also fails, > FetchFailureException will be thrown. > 2. If large blocks are corrupt after size maxBytesInFlight/3, then any > IOException thrown while > reading the stream will be converted to FetchFailureException. This is > slightly more aggressive > than was originally intended but since the consumer of the stream may have > already read some records and processed them, we can't just re-fetch the > block, we need to fail the whole task. Additionally, we also thought about > maybe adding a new type of TaskEndReason, which would re-try the task couple > of times before failing the previous stage, but given the complexity involved > in that solution we decided to not proceed in that direction. > However, I think there still exists some problems with the current shuffle > transmitted data verification mechanism: > - For a large block, it is checked upto maxBytesInFlight/3 size when > fetching shuffle data. So if a large block is corrupt after size > maxBytesInFlight/3, it can not be detected in data fetch phase. This has > been described in the previous section. > - Only the compressed or wrapped blocks are checked, I think we should also > check thease blocks which are not wrapped. > We complete the verification mechanism for shuffle transmitted data: > Firstly, we choose crc32 for the checksum verification of shuffle data. > Crc is also used for checksum verification in hadoop, it is simple and fast. > In shuffle write phase, after completing the partitionedFile, we compute > the crc32 value for each partition and then write these digests with the > indexs into shuffle index file. > For the sortShuffleWriter and unsafe shuffle writer, there is only one > partitionedFile for a shuffleMapTask, so the compution of digests(compute the > digests for each partition depend on the indexs of this partitionedFile) is > cheap. > For the bypassShuffleWriter, the reduce partitions is little than > byPassMergeThreshold, the cost of digests compution is acceptable. > In shuffle read phase, the digest value will be passed with the block data. > And we will recompute the digest of the data obtained to compare with the > origin digest value. > When recomputing the digest of data obtained, it only need an additional > buffer(2048Bytes) for computing crc32 value. > After recomputing, we will reset the obtained data inputStream, if it is > markSupported we only need reset it, otherwise it is a > fileSegmentManagerBuffer, we need recreate it. > So, this verification mechanism proposed for shuffle transmitted data is > cheap and complete. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data
[ https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106846#comment-17106846 ] Apache Spark commented on SPARK-27562: -- User 'turboFei' has created a pull request for this issue: https://github.com/apache/spark/pull/28525 > Complete the verification mechanism for shuffle transmitted data > > > Key: SPARK-27562 > URL: https://issues.apache.org/jira/browse/SPARK-27562 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: feiwang >Priority: Major > > We've seen some shuffle data corruption during shuffle read phase. > As described in SPARK-26089, spark only checks small shuffle blocks before > PR #23453, which is proposed by ankuriitg. > There are two changes/improvements that are made in PR #23453. > 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as > smaller blocks, so if a > large block is corrupt in the starting, that block will be re-fetched and if > that also fails, > FetchFailureException will be thrown. > 2. If large blocks are corrupt after size maxBytesInFlight/3, then any > IOException thrown while > reading the stream will be converted to FetchFailureException. This is > slightly more aggressive > than was originally intended but since the consumer of the stream may have > already read some records and processed them, we can't just re-fetch the > block, we need to fail the whole task. Additionally, we also thought about > maybe adding a new type of TaskEndReason, which would re-try the task couple > of times before failing the previous stage, but given the complexity involved > in that solution we decided to not proceed in that direction. > However, I think there still exists some problems with the current shuffle > transmitted data verification mechanism: > - For a large block, it is checked upto maxBytesInFlight/3 size when > fetching shuffle data. So if a large block is corrupt after size > maxBytesInFlight/3, it can not be detected in data fetch phase. This has > been described in the previous section. > - Only the compressed or wrapped blocks are checked, I think we should also > check thease blocks which are not wrapped. > We complete the verification mechanism for shuffle transmitted data: > Firstly, we choose crc32 for the checksum verification of shuffle data. > Crc is also used for checksum verification in hadoop, it is simple and fast. > In shuffle write phase, after completing the partitionedFile, we compute > the crc32 value for each partition and then write these digests with the > indexs into shuffle index file. > For the sortShuffleWriter and unsafe shuffle writer, there is only one > partitionedFile for a shuffleMapTask, so the compution of digests(compute the > digests for each partition depend on the indexs of this partitionedFile) is > cheap. > For the bypassShuffleWriter, the reduce partitions is little than > byPassMergeThreshold, the cost of digests compution is acceptable. > In shuffle read phase, the digest value will be passed with the block data. > And we will recompute the digest of the data obtained to compare with the > origin digest value. > When recomputing the digest of data obtained, it only need an additional > buffer(2048Bytes) for computing crc32 value. > After recomputing, we will reset the obtained data inputStream, if it is > markSupported we only need reset it, otherwise it is a > fileSegmentManagerBuffer, we need recreate it. > So, this verification mechanism proposed for shuffle transmitted data is > cheap and complete. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31705: Description: Rewrite join condition to [conjunctive normal form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more conditions to filter. PostgreSQL: {code:sql} CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); CREATE TABLE orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); EXPLAIN SELECT Count(*) FROM lineitem, orders WHERE l_orderkey = o_orderkey AND ( ( l_suppkey > 3 AND o_custkey > 13 ) OR ( l_suppkey > 1 AND o_custkey > 11 ) ) AND l_partkey > 19; EXPLAIN SELECT Count(*) FROM lineitem JOIN orders ON l_orderkey = o_orderkey AND ( ( l_suppkey > 3 AND o_custkey > 13 ) OR ( l_suppkey > 1 AND o_custkey > 11 ) ) AND l_partkey > 19; {code} {noformat} postgres=# EXPLAIN postgres-# SELECT Count(*) postgres-# FROM lineitem, postgres-#orders postgres-# WHERE l_orderkey = o_orderkey postgres-#AND ( ( l_suppkey > 3 postgres(#AND o_custkey > 13 ) postgres(# OR ( l_suppkey > 1 postgres(#AND o_custkey > 11 ) ) postgres-#AND l_partkey > 19; QUERY PLAN - Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 13) OR (o_custkey > 11)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR (l_suppkey > 1))) (9 rows) postgres=# EXPLAIN postgres-# SELECT Count(*) postgres-# FROM lineitem postgres-#JOIN orders postgres-# ON l_orderkey = o_orderkey postgres-# AND ( ( l_suppkey > 3 postgres(# AND o_custkey > 13 ) postgres(#OR ( l_suppkey > 1 postgres(# AND o_custkey > 11 ) ) postgres-# AND l_partkey > 19; QUERY PLAN - Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 13) OR (o_custkey > 11)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR (l_suppkey > 1))) (9 rows) {noformat} https://docs.teradata.com/reader/i_VlYHwN0b8knh6AEWrv1Q/Bh~37Qcc2~24P_jn2~0w6w was: Rewrite join condition to [conjunctive normal form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more conditions to filter. PostgreSQL: {code:sql} CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); CREATE TABLE orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(2
[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31705: Description: Rewrite join condition to [conjunctive normal form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more conditions to filter. PostgreSQL: {code:sql} CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate DATE, l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); CREATE TABLE orders ( o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); EXPLAIN SELECT Count(*) FROM lineitem, orders WHERE l_orderkey = o_orderkey AND ( ( l_suppkey > 3 AND o_custkey > 13 ) OR ( l_suppkey > 1 AND o_custkey > 11 ) ) AND l_partkey > 19; EXPLAIN SELECT Count(*) FROM lineitem JOIN orders ON l_orderkey = o_orderkey AND ( ( l_suppkey > 3 AND o_custkey > 13 ) OR ( l_suppkey > 1 AND o_custkey > 11 ) ) AND l_partkey > 19; EXPLAIN SELECT Count(*) FROM lineitem, orders WHERE l_orderkey = o_orderkey AND NOT ( ( l_suppkey > 3 AND ( l_suppkey > 2 OR o_custkey > 13 ) ) OR ( l_suppkey > 1 AND o_custkey > 11 ) ) AND l_partkey > 19; {code} {noformat} postgres=# EXPLAIN postgres-# SELECT Count(*) postgres-# FROM lineitem, postgres-#orders postgres-# WHERE l_orderkey = o_orderkey postgres-#AND ( ( l_suppkey > 3 postgres(#AND o_custkey > 13 ) postgres(# OR ( l_suppkey > 1 postgres(#AND o_custkey > 11 ) ) postgres-#AND l_partkey > 19; QUERY PLAN - Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 13) OR (o_custkey > 11)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR (l_suppkey > 1))) (9 rows) postgres=# EXPLAIN postgres-# SELECT Count(*) postgres-# FROM lineitem postgres-#JOIN orders postgres-# ON l_orderkey = o_orderkey postgres-# AND ( ( l_suppkey > 3 postgres(# AND o_custkey > 13 ) postgres(#OR ( l_suppkey > 1 postgres(# AND o_custkey > 11 ) ) postgres-# AND l_partkey > 19; QUERY PLAN - Aggregate (cost=21.18..21.19 rows=1 width=8) -> Hash Join (cost=10.60..21.17 rows=2 width=0) Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) Filter: ((o_custkey > 13) OR (o_custkey > 11)) -> Hash (cost=10.53..10.53 rows=6 width=16) -> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR (l_suppkey > 1))) (9 rows) postgres=# EXPLAIN postgres-# SELECT Count(*) postgres-# FROM lineitem, postgres-#orders postgres-# WHERE l_orderkey = o_orderkey postgres-#AND NOT ( ( l_suppkey > 3 postgres(#AND ( l_suppkey > 2 postgres(# OR o_custkey > 13 ) ) postgres(# OR ( l_suppkey > 1 postgres(#AND o_custkey > 11 ) ) postgres-#AND l_partkey > 19;
[jira] [Resolved] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory
[ https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31692. --- Fix Version/s: 3.0.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/28516 > Hadoop confs passed via spark config are not set in URLStream Handler Factory > - > > Key: SPARK-31692 > URL: https://issues.apache.org/jira/browse/SPARK-31692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Priority: Major > Fix For: 3.0.0 > > > Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in > URLStreamHandlerFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory
[ https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31692: - Assignee: Karuppayya > Hadoop confs passed via spark config are not set in URLStream Handler Factory > - > > Key: SPARK-31692 > URL: https://issues.apache.org/jira/browse/SPARK-31692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Karuppayya >Assignee: Karuppayya >Priority: Major > Fix For: 3.0.0 > > > Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in > URLStreamHandlerFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not
[ https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106958#comment-17106958 ] Apache Spark commented on SPARK-31405: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28526 > fail by default when read/write datetime values and not sure if they need > rebase or not > --- > > Key: SPARK-31405 > URL: https://issues.apache.org/jira/browse/SPARK-31405 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org