[jira] [Created] (SPARK-31697) HistoryServer should set Content-Type header

2020-05-13 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-31697:
--

 Summary: HistoryServer should set Content-Type header
 Key: SPARK-31697
 URL: https://issues.apache.org/jira/browse/SPARK-31697
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


I noticed that we will get html as plain text when we access to wrong URLs on 
HistoryServer.
{code:java}

  
setUIRoot('')


Not Found
  
  

  

  

  
  3.1.0-SNAPSHOT

Not Found
  

  
  

  Application local-1589239 not found.

  

  
 {code}
 

The reason is Content-Type not set.
{code:java}
HTTP/1.1 404 Not Found
Date: Wed, 13 May 2020 06:59:29 GMT
Cache-Control: no-cache, no-store, must-revalidate
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
X-Content-Type-Options: nosniff
Content-Length: 1778
Server: Jetty(9.4.18.v20190429) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31697) HistoryServer should set Content-Type

2020-05-13 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-31697:
---
Summary: HistoryServer should set Content-Type  (was: HistoryServer should 
set Content-Type header)

> HistoryServer should set Content-Type
> -
>
> Key: SPARK-31697
> URL: https://issues.apache.org/jira/browse/SPARK-31697
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed that we will get html as plain text when we access to wrong URLs on 
> HistoryServer.
> {code:java}
> 
>   
>  type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')
> 
>  href="/static/spark-logo-77x50px-hd.png">
> Not Found
>   
>   
> 
>   
> 
>   
> 
>   
>   3.1.0-SNAPSHOT
> 
> Not Found
>   
> 
>   
>   
> 
>   Application local-1589239 not found.
> 
>   
> 
>   
>  {code}
>  
> The reason is Content-Type not set.
> {code:java}
> HTTP/1.1 404 Not Found
> Date: Wed, 13 May 2020 06:59:29 GMT
> Cache-Control: no-cache, no-store, must-revalidate
> X-Frame-Options: SAMEORIGIN
> X-XSS-Protection: 1; mode=block
> X-Content-Type-Options: nosniff
> Content-Length: 1778
> Server: Jetty(9.4.18.v20190429) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31697) HistoryServer should set Content-Type

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31697:


Assignee: Kousuke Saruta  (was: Apache Spark)

> HistoryServer should set Content-Type
> -
>
> Key: SPARK-31697
> URL: https://issues.apache.org/jira/browse/SPARK-31697
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed that we will get html as plain text when we access to wrong URLs on 
> HistoryServer.
> {code:java}
> 
>   
>  type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')
> 
>  href="/static/spark-logo-77x50px-hd.png">
> Not Found
>   
>   
> 
>   
> 
>   
> 
>   
>   3.1.0-SNAPSHOT
> 
> Not Found
>   
> 
>   
>   
> 
>   Application local-1589239 not found.
> 
>   
> 
>   
>  {code}
>  
> The reason is Content-Type not set.
> {code:java}
> HTTP/1.1 404 Not Found
> Date: Wed, 13 May 2020 06:59:29 GMT
> Cache-Control: no-cache, no-store, must-revalidate
> X-Frame-Options: SAMEORIGIN
> X-XSS-Protection: 1; mode=block
> X-Content-Type-Options: nosniff
> Content-Length: 1778
> Server: Jetty(9.4.18.v20190429) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31697) HistoryServer should set Content-Type

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106034#comment-17106034
 ] 

Apache Spark commented on SPARK-31697:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28519

> HistoryServer should set Content-Type
> -
>
> Key: SPARK-31697
> URL: https://issues.apache.org/jira/browse/SPARK-31697
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed that we will get html as plain text when we access to wrong URLs on 
> HistoryServer.
> {code:java}
> 
>   
>  type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')
> 
>  href="/static/spark-logo-77x50px-hd.png">
> Not Found
>   
>   
> 
>   
> 
>   
> 
>   
>   3.1.0-SNAPSHOT
> 
> Not Found
>   
> 
>   
>   
> 
>   Application local-1589239 not found.
> 
>   
> 
>   
>  {code}
>  
> The reason is Content-Type not set.
> {code:java}
> HTTP/1.1 404 Not Found
> Date: Wed, 13 May 2020 06:59:29 GMT
> Cache-Control: no-cache, no-store, must-revalidate
> X-Frame-Options: SAMEORIGIN
> X-XSS-Protection: 1; mode=block
> X-Content-Type-Options: nosniff
> Content-Length: 1778
> Server: Jetty(9.4.18.v20190429) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31697) HistoryServer should set Content-Type

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31697:


Assignee: Apache Spark  (was: Kousuke Saruta)

> HistoryServer should set Content-Type
> -
>
> Key: SPARK-31697
> URL: https://issues.apache.org/jira/browse/SPARK-31697
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> I noticed that we will get html as plain text when we access to wrong URLs on 
> HistoryServer.
> {code:java}
> 
>   
>  type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')
> 
>  href="/static/spark-logo-77x50px-hd.png">
> Not Found
>   
>   
> 
>   
> 
>   
> 
>   
>   3.1.0-SNAPSHOT
> 
> Not Found
>   
> 
>   
>   
> 
>   Application local-1589239 not found.
> 
>   
> 
>   
>  {code}
>  
> The reason is Content-Type not set.
> {code:java}
> HTTP/1.1 404 Not Found
> Date: Wed, 13 May 2020 06:59:29 GMT
> Cache-Control: no-cache, no-store, must-revalidate
> X-Frame-Options: SAMEORIGIN
> X-XSS-Protection: 1; mode=block
> X-Content-Type-Options: nosniff
> Content-Length: 1778
> Server: Jetty(9.4.18.v20190429) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31697) HistoryServer should set Content-Type

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106035#comment-17106035
 ] 

Apache Spark commented on SPARK-31697:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/28519

> HistoryServer should set Content-Type
> -
>
> Key: SPARK-31697
> URL: https://issues.apache.org/jira/browse/SPARK-31697
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> I noticed that we will get html as plain text when we access to wrong URLs on 
> HistoryServer.
> {code:java}
> 
>   
>  type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')
> 
>  href="/static/spark-logo-77x50px-hd.png">
> Not Found
>   
>   
> 
>   
> 
>   
> 
>   
>   3.1.0-SNAPSHOT
> 
> Not Found
>   
> 
>   
>   
> 
>   Application local-1589239 not found.
> 
>   
> 
>   
>  {code}
>  
> The reason is Content-Type not set.
> {code:java}
> HTTP/1.1 404 Not Found
> Date: Wed, 13 May 2020 06:59:29 GMT
> Cache-Control: no-cache, no-store, must-revalidate
> X-Frame-Options: SAMEORIGIN
> X-XSS-Protection: 1; mode=block
> X-Content-Type-Options: nosniff
> Content-Length: 1778
> Server: Jetty(9.4.18.v20190429) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31698) NPE on big dataset plans

2020-05-13 Thread Viacheslav Tradunsky (Jira)
Viacheslav Tradunsky created SPARK-31698:


 Summary: NPE on big dataset plans
 Key: SPARK-31698
 URL: https://issues.apache.org/jira/browse/SPARK-31698
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.4
 Environment: AWS EMR
Reporter: Viacheslav Tradunsky


We have big dataset containing 275 SQL operations more than 275 joins.

On the terminal operation to write data, it fails with NullPointerException.

 

I understand that such big number of operations might not be what spark is 
designed for, but NullPointerException is not an ideal way to fail in this case.

 

For more details, please see the stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31698) NPE on big dataset plans

2020-05-13 Thread Viacheslav Tradunsky (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viacheslav Tradunsky updated SPARK-31698:
-
Attachment: Spark_NPE_big_dataset.log

> NPE on big dataset plans
> 
>
> Key: SPARK-31698
> URL: https://issues.apache.org/jira/browse/SPARK-31698
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR
>Reporter: Viacheslav Tradunsky
>Priority: Major
> Attachments: Spark_NPE_big_dataset.log
>
>
> We have big dataset containing 275 SQL operations more than 275 joins.
> On the terminal operation to write data, it fails with NullPointerException.
>  
> I understand that such big number of operations might not be what spark is 
> designed for, but NullPointerException is not an ideal way to fail in this 
> case.
>  
> For more details, please see the stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31698) NPE on big dataset plans

2020-05-13 Thread Viacheslav Tradunsky (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viacheslav Tradunsky updated SPARK-31698:
-
Docs Text:   (was: org.apache.spark.SparkException: Job aborted.
./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:198)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:156)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:229)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:566)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
com.company.app.executor.spark.SparkDatasetGenerationJob.generateDataset(SparkDatasetGenerationJob.scala:51)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
com.company.app.executor.spark.SparkDatasetGenerationJob.call(SparkDatasetGenerationJob.scala:82)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
com.company.app.executor.spark.SparkDatasetGenerationJob.call(SparkDatasetGenerationJob.scala:11)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:40)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.livy.rsc.driver.BypassJob.call(BypassJob.java:27)
  ./livy-livy-server.out.gz:20/05/12 22:46:54 INFO LineBufferedStream:  at 
org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:64)
  ./livy-livy-se

[jira] [Updated] (SPARK-31698) NPE on big dataset plans

2020-05-13 Thread Viacheslav Tradunsky (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viacheslav Tradunsky updated SPARK-31698:
-
Environment: AWS EMR: 30 machine, 7TB RAM total.  (was: AWS EMR)

> NPE on big dataset plans
> 
>
> Key: SPARK-31698
> URL: https://issues.apache.org/jira/browse/SPARK-31698
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR: 30 machine, 7TB RAM total.
>Reporter: Viacheslav Tradunsky
>Priority: Major
> Attachments: Spark_NPE_big_dataset.log
>
>
> We have big dataset containing 275 SQL operations more than 275 joins.
> On the terminal operation to write data, it fails with NullPointerException.
>  
> I understand that such big number of operations might not be what spark is 
> designed for, but NullPointerException is not an ideal way to fail in this 
> case.
>  
> For more details, please see the stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31698) NPE on big dataset plans

2020-05-13 Thread Viacheslav Tradunsky (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viacheslav Tradunsky updated SPARK-31698:
-
Environment: AWS EMR: 30 machines, 7TB RAM total.  (was: AWS EMR: 30 
machine, 7TB RAM total.)

> NPE on big dataset plans
> 
>
> Key: SPARK-31698
> URL: https://issues.apache.org/jira/browse/SPARK-31698
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR: 30 machines, 7TB RAM total.
>Reporter: Viacheslav Tradunsky
>Priority: Major
> Attachments: Spark_NPE_big_dataset.log
>
>
> We have big dataset containing 275 SQL operations more than 275 joins.
> On the terminal operation to write data, it fails with NullPointerException.
>  
> I understand that such big number of operations might not be what spark is 
> designed for, but NullPointerException is not an ideal way to fail in this 
> case.
>  
> For more details, please see the stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31695) BigDecimal setScale is not working in Spark UDF

2020-05-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31695.
--
Resolution: Not A Problem

> BigDecimal setScale is not working in Spark UDF
> ---
>
> Key: SPARK-31695
> URL: https://issues.apache.org/jira/browse/SPARK-31695
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.4
>Reporter: Saravanan Raju
>Priority: Major
>
> I was trying to convert json column to map. I tried udf for converting json 
> to map. but it is not working as expected.
>   
> {code:java}
> val df1 = Seq(("{\"k\":10.004}")).toDF("json")
> def udfJsonStrToMapDecimal = udf((jsonStr: String)=> { var 
> jsonMap:Map[String,Any] = parse(jsonStr).values.asInstanceOf[Map[String, Any]]
>  jsonMap.map{case(k,v) => 
> (k,BigDecimal.decimal(v.asInstanceOf[Double]).setScale(6))}.toMap
> })
> val f = df1.withColumn("map",udfJsonStrToMapDecimal($"json"))
> scala> f.printSchema
> root
>  |-- json: string (nullable = true)
>  |-- map: map (nullable = true)
>  ||-- key: string
>  ||-- value: decimal(38,18) (valueContainsNull = true)
> {code}
>  
> *instead of decimal(38,6) it converting the value as decimal(38,18)*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31695) BigDecimal setScale is not working in Spark UDF

2020-05-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106076#comment-17106076
 ] 

Hyukjin Kwon commented on SPARK-31695:
--

You can explicitly set the scale and precision:

{code}
val df1 = Seq(("{\"k\":10.004}")).toDF("json")
def udfJsonStrToMapDecimal = udf((jsonStr: String)=> { var 
jsonMap:Map[String,Any] = parse(jsonStr).values.asInstanceOf[Map[String, Any]]
 jsonMap.map{case(k,v) => 
(k,BigDecimal.decimal(v.asInstanceOf[Double]).setScale(6))}.toMap
}, DecimalType(38, 6))
val f = df1.withColumn("map",udfJsonStrToMapDecimal($"json"))
f.printSchema
{code}

It's unable to automatically detect the scale that set during runtime.

> BigDecimal setScale is not working in Spark UDF
> ---
>
> Key: SPARK-31695
> URL: https://issues.apache.org/jira/browse/SPARK-31695
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.4
>Reporter: Saravanan Raju
>Priority: Major
>
> I was trying to convert json column to map. I tried udf for converting json 
> to map. but it is not working as expected.
>   
> {code:java}
> val df1 = Seq(("{\"k\":10.004}")).toDF("json")
> def udfJsonStrToMapDecimal = udf((jsonStr: String)=> { var 
> jsonMap:Map[String,Any] = parse(jsonStr).values.asInstanceOf[Map[String, Any]]
>  jsonMap.map{case(k,v) => 
> (k,BigDecimal.decimal(v.asInstanceOf[Double]).setScale(6))}.toMap
> })
> val f = df1.withColumn("map",udfJsonStrToMapDecimal($"json"))
> scala> f.printSchema
> root
>  |-- json: string (nullable = true)
>  |-- map: map (nullable = true)
>  ||-- key: string
>  ||-- value: decimal(38,18) (valueContainsNull = true)
> {code}
>  
> *instead of decimal(38,6) it converting the value as decimal(38,18)*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31699) Optimize OpenSession speed in thriftserver

2020-05-13 Thread angerszhu (Jira)
angerszhu created SPARK-31699:
-

 Summary: Optimize OpenSession speed in thriftserver
 Key: SPARK-31699
 URL: https://issues.apache.org/jira/browse/SPARK-31699
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31698) NPE on big dataset plans

2020-05-13 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-31698.
--
Resolution: Duplicate

The weird error message and stack trace is matched with SPARK-29046 which was 
fixed in Spark 2.4.5. I'll mark this as duplicated.

> NPE on big dataset plans
> 
>
> Key: SPARK-31698
> URL: https://issues.apache.org/jira/browse/SPARK-31698
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: AWS EMR: 30 machines, 7TB RAM total.
>Reporter: Viacheslav Tradunsky
>Priority: Major
> Attachments: Spark_NPE_big_dataset.log
>
>
> We have big dataset containing 275 SQL operations more than 275 joins.
> On the terminal operation to write data, it fails with NullPointerException.
>  
> I understand that such big number of operations might not be what spark is 
> designed for, but NullPointerException is not an ideal way to fail in this 
> case.
>  
> For more details, please see the stacktrace.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31690) Backport pyspark Interaction to Spark 2.4.x

2020-05-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106081#comment-17106081
 ] 

Hyukjin Kwon commented on SPARK-31690:
--

It seems new API which isn't usually backported per 
https://spark.apache.org/versioning-policy.html. Also, don't need to create a 
JIRA next time for a backport.  you can reuse the original ticket.

> Backport pyspark Interaction to Spark 2.4.x
> ---
>
> Key: SPARK-31690
> URL: https://issues.apache.org/jira/browse/SPARK-31690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Luca Giovagnoli
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In our company, we could really make use of the Interaction pyspark wrapper 
> on spark 2.4.x.
> "Interaction" is available on spark 3.0, so I'm proposing to backport the 
> following code to the current Spark2.4.6-rc1:
> - https://issues.apache.org/jira/browse/SPARK-26970
> - [https://github.com/apache/spark/pull/24426/files]
>  
> I'm available to pick this up if it's approved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31690) Backport pyspark Interaction to Spark 2.4.x

2020-05-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31690.
--
Resolution: Won't Fix

> Backport pyspark Interaction to Spark 2.4.x
> ---
>
> Key: SPARK-31690
> URL: https://issues.apache.org/jira/browse/SPARK-31690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Luca Giovagnoli
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In our company, we could really make use of the Interaction pyspark wrapper 
> on spark 2.4.x.
> "Interaction" is available on spark 3.0, so I'm proposing to backport the 
> following code to the current Spark2.4.6-rc1:
> - https://issues.apache.org/jira/browse/SPARK-26970
> - [https://github.com/apache/spark/pull/24426/files]
>  
> I'm available to pick this up if it's approved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31686) Return of String instead of array in function get_json_object

2020-05-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106084#comment-17106084
 ] 

Hyukjin Kwon commented on SPARK-31686:
--

Yes, you don't know the output type before actually parsing. The type should be 
known before the execution. It's by design

> Return of String instead of array in function get_json_object
> -
>
> Key: SPARK-31686
> URL: https://issues.apache.org/jira/browse/SPARK-31686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: {code:json}
> // code placeholder
> {
> customer:{ 
>  addesses:[ { {code}
>                   location :  arizona
>                   }
>                ]
> }
> }
>  get_json_object(string(customer),'$addresses[*].location')
> return "arizona"
> result expected should be
> ["arizona"]
>Reporter: Touopi Touopi
>Priority: Major
>
> when we selecting a node of a json object that is array,
> When the array contains One element , the get_json_object return a String 
> with " characters instead of an array of One element.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31686) Return of String instead of array in function get_json_object

2020-05-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31686.
--
Resolution: Not A Problem

> Return of String instead of array in function get_json_object
> -
>
> Key: SPARK-31686
> URL: https://issues.apache.org/jira/browse/SPARK-31686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: {code:json}
> // code placeholder
> {
> customer:{ 
>  addesses:[ { {code}
>                   location :  arizona
>                   }
>                ]
> }
> }
>  get_json_object(string(customer),'$addresses[*].location')
> return "arizona"
> result expected should be
> ["arizona"]
>Reporter: Touopi Touopi
>Priority: Major
>
> when we selecting a node of a json object that is array,
> When the array contains One element , the get_json_object return a String 
> with " characters instead of an array of One element.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29046) Possible NPE on SQLConf.get when SparkContext is stopping in another thread

2020-05-13 Thread Viacheslav Tradunsky (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106087#comment-17106087
 ] 

Viacheslav Tradunsky commented on SPARK-29046:
--

[~kabhwan] Do you know a lower version of spark which does not have this issue? 
Maybe Spark 2.3.2? 

 

> Possible NPE on SQLConf.get when SparkContext is stopping in another thread
> ---
>
> Key: SPARK-29046
> URL: https://issues.apache.org/jira/browse/SPARK-29046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> We encountered NPE in listener code which deals with query plan - and 
> according to the stack trace below, only possible case of NPE is 
> SparkContext._dagScheduler being null, which is only possible while stopping 
> SparkContext (unless null is set from outside).
>  
> {code:java}
> 19/09/11 00:22:24 INFO server.AbstractConnector: Stopped 
> Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO 
> server.AbstractConnector: Stopped 
> Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO 
> ui.SparkUI: Stopped Spark web UI at http://:3277019/09/11 00:22:24 INFO 
> cluster.YarnClusterSchedulerBackend: Shutting down all executors19/09/11 
> 00:22:24 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each 
> executor to shut down19/09/11 00:22:24 INFO 
> cluster.SchedulerExtensionServices: Stopping 
> SchedulerExtensionServices(serviceOption=None, services=List(), 
> started=false)19/09/11 00:22:24 WARN sql.SparkExecutionPlanProcessor: Caught 
> exception during parsing eventjava.lang.NullPointerException at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at 
> scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:133) at 
> org.apache.spark.sql.types.StructType.simpleString(StructType.scala:352) at 
> com.hortonworks.spark.atlas.types.internal$.sparkTableToEntity(internal.scala:102)
>  at 
> com.hortonworks.spark.atlas.types.AtlasEntityUtils$class.tableToEntity(AtlasEntityUtils.scala:62)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$.tableToEntity(CommandsHarvester.scala:45)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:240)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:239)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities(CommandsHarvester.scala:239)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$CreateDataSourceTableAsSelectHarvester$.harvest(CommandsHarvester.scala:104)
>  at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:138)
>  at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:89)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:89)
>  at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:63)
>  at 
> com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72)
>  at 
> com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(Abs

[jira] [Commented] (SPARK-29046) Possible NPE on SQLConf.get when SparkContext is stopping in another thread

2020-05-13 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106104#comment-17106104
 ] 

Jungtaek Lim commented on SPARK-29046:
--

Sorry I don't know. Also worth noting that Spark 2.3 version line was EOLed 
AFAIK.

> Possible NPE on SQLConf.get when SparkContext is stopping in another thread
> ---
>
> Key: SPARK-29046
> URL: https://issues.apache.org/jira/browse/SPARK-29046
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> We encountered NPE in listener code which deals with query plan - and 
> according to the stack trace below, only possible case of NPE is 
> SparkContext._dagScheduler being null, which is only possible while stopping 
> SparkContext (unless null is set from outside).
>  
> {code:java}
> 19/09/11 00:22:24 INFO server.AbstractConnector: Stopped 
> Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO 
> server.AbstractConnector: Stopped 
> Spark@49d8c117{HTTP/1.1,[http/1.1]}{0.0.0.0:0}19/09/11 00:22:24 INFO 
> ui.SparkUI: Stopped Spark web UI at http://:3277019/09/11 00:22:24 INFO 
> cluster.YarnClusterSchedulerBackend: Shutting down all executors19/09/11 
> 00:22:24 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each 
> executor to shut down19/09/11 00:22:24 INFO 
> cluster.SchedulerExtensionServices: Stopping 
> SchedulerExtensionServices(serviceOption=None, services=List(), 
> started=false)19/09/11 00:22:24 WARN sql.SparkExecutionPlanProcessor: Caught 
> exception during parsing eventjava.lang.NullPointerException at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at 
> org.apache.spark.sql.internal.SQLConf$$anonfun$15.apply(SQLConf.scala:133) at 
> scala.Option.map(Option.scala:146) at 
> org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:133) at 
> org.apache.spark.sql.types.StructType.simpleString(StructType.scala:352) at 
> com.hortonworks.spark.atlas.types.internal$.sparkTableToEntity(internal.scala:102)
>  at 
> com.hortonworks.spark.atlas.types.AtlasEntityUtils$class.tableToEntity(AtlasEntityUtils.scala:62)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$.tableToEntity(CommandsHarvester.scala:45)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:240)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$$anonfun$com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities$1.apply(CommandsHarvester.scala:239)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$.com$hortonworks$spark$atlas$sql$CommandsHarvester$$discoverInputsEntities(CommandsHarvester.scala:239)
>  at 
> com.hortonworks.spark.atlas.sql.CommandsHarvester$CreateDataSourceTableAsSelectHarvester$.harvest(CommandsHarvester.scala:104)
>  at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:138)
>  at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor$$anonfun$2.apply(SparkExecutionPlanProcessor.scala:89)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>  at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at 
> scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:89)
>  at 
> com.hortonworks.spark.atlas.sql.SparkExecutionPlanProcessor.process(SparkExecutionPlanProcessor.scala:63)
>  at 
> com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:72)
>  at 
> com.hortonworks.spark.atlas.AbstractEventProcessor$$anonfun$eventProcess$1.apply(AbstractEventProcessor.scala:71)
>  at 

[jira] [Resolved] (SPARK-31697) HistoryServer should set Content-Type

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31697.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28519
[https://github.com/apache/spark/pull/28519]

> HistoryServer should set Content-Type
> -
>
> Key: SPARK-31697
> URL: https://issues.apache.org/jira/browse/SPARK-31697
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.0
>
>
> I noticed that we will get html as plain text when we access to wrong URLs on 
> HistoryServer.
> {code:java}
> 
>   
>  type="text/css"/> href="/static/vis-timeline-graph2d.min.css" type="text/css"/> rel="stylesheet" href="/static/webui.css" type="text/css"/> rel="stylesheet" href="/static/timeline-view.css" type="text/css"/> src="/static/sorttable.js"> src="/static/jquery-3.4.1.min.js"> src="/static/vis-timeline-graph2d.min.js"> src="/static/bootstrap.bundle.min.js"> src="/static/initialize-tooltips.js"> src="/static/table.js"> src="/static/timeline-view.js"> src="/static/log-view.js"> src="/static/webui.js">setUIRoot('')
> 
>  href="/static/spark-logo-77x50px-hd.png">
> Not Found
>   
>   
> 
>   
> 
>   
> 
>   
>   3.1.0-SNAPSHOT
> 
> Not Found
>   
> 
>   
>   
> 
>   Application local-1589239 not found.
> 
>   
> 
>   
>  {code}
>  
> The reason is Content-Type not set.
> {code:java}
> HTTP/1.1 404 Not Found
> Date: Wed, 13 May 2020 06:59:29 GMT
> Cache-Control: no-cache, no-store, must-revalidate
> X-Frame-Options: SAMEORIGIN
> X-XSS-Protection: 1; mode=block
> X-Content-Type-Options: nosniff
> Content-Length: 1778
> Server: Jetty(9.4.18.v20190429) {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31700) spark sql write orc file outformat

2020-05-13 Thread Dexter Morgan (Jira)
Dexter Morgan created SPARK-31700:
-

 Summary: spark sql write orc file  outformat
 Key: SPARK-31700
 URL: https://issues.apache.org/jira/browse/SPARK-31700
 Project: Spark
  Issue Type: Task
  Components: Input/Output
Affects Versions: 2.3.3
Reporter: Dexter Morgan


!image-2020-05-13-16-53-49-678.png!

 

can you give me an example of sparksql outputformat orc file ,plz

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR

2020-05-13 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31701:


 Summary: Bump up the minimum Arrow version as 0.15.1 in SparkR
 Key: SPARK-31701
 URL: https://issues.apache.org/jira/browse/SPARK-31701
 Project: Spark
  Issue Type: Bug
  Components: SparkR, SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


PySpark side bumped up the minimum Arrow version at SPARK-29376. We should 
better bump up the version in SparkR side to match with 0.15.1. There's no 
backward compatibility concern because Arrow optimization in SparkR is new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR

2020-05-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31701:
-
Issue Type: Improvement  (was: Bug)

> Bump up the minimum Arrow version as 0.15.1 in SparkR
> -
>
> Key: SPARK-31701
> URL: https://issues.apache.org/jira/browse/SPARK-31701
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> PySpark side bumped up the minimum Arrow version at SPARK-29376. We should 
> better bump up the version in SparkR side to match with 0.15.1. There's no 
> backward compatibility concern because Arrow optimization in SparkR is new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106225#comment-17106225
 ] 

Apache Spark commented on SPARK-31701:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28520

> Bump up the minimum Arrow version as 0.15.1 in SparkR
> -
>
> Key: SPARK-31701
> URL: https://issues.apache.org/jira/browse/SPARK-31701
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> PySpark side bumped up the minimum Arrow version at SPARK-29376. We should 
> better bump up the version in SparkR side to match with 0.15.1. There's no 
> backward compatibility concern because Arrow optimization in SparkR is new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31701:


Assignee: Apache Spark

> Bump up the minimum Arrow version as 0.15.1 in SparkR
> -
>
> Key: SPARK-31701
> URL: https://issues.apache.org/jira/browse/SPARK-31701
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> PySpark side bumped up the minimum Arrow version at SPARK-29376. We should 
> better bump up the version in SparkR side to match with 0.15.1. There's no 
> backward compatibility concern because Arrow optimization in SparkR is new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31701:


Assignee: (was: Apache Spark)

> Bump up the minimum Arrow version as 0.15.1 in SparkR
> -
>
> Key: SPARK-31701
> URL: https://issues.apache.org/jira/browse/SPARK-31701
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> PySpark side bumped up the minimum Arrow version at SPARK-29376. We should 
> better bump up the version in SparkR side to match with 0.15.1. There's no 
> backward compatibility concern because Arrow optimization in SparkR is new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31702) Old POSIXlt, POSIXct and Date become corrupt due to calendar difference

2020-05-13 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31702:


 Summary: Old POSIXlt, POSIXct and Date become corrupt due to 
calendar difference
 Key: SPARK-31702
 URL: https://issues.apache.org/jira/browse/SPARK-31702
 Project: Spark
  Issue Type: Bug
  Components: SparkR, SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Hyukjin Kwon


Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see below:

{code}
# Non-existent timestamp in hybrid Julian and Gregorian Calendar
showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 
00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00"))
{code}

{code}
+---+---+
|POSIXct|POSIXlt|
+---+---+
|1582-09-30 00:33:08|1582-09-30 00:33:08|
+---+---+
{code}

See 
https://docs.google.com/document/d/1Upf6c5fNM59Q6nko-ipjLLae86x9mBejwuXshii-Azg/edit?usp=sharing

Note that the results seem wrong from the very first implementation. The cause 
seems because R side uses Proleptic Gregorian calendar but JVM side is using 
hybrid Juilian and Gregoiran calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31702) Old POSIXlt, POSIXct and Date become corrupt due to calendar difference

2020-05-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31702:
-
Description: 
Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see below:

{code}
# Non-existent timestamp in hybrid Julian and Gregorian Calendar
showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 
00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00"))
{code}

{code}
+---+---+
|POSIXct|POSIXlt|
+---+---+
|1582-09-30 00:33:08|1582-09-30 00:33:08|
+---+---+
{code}

See 
https://docs.google.com/document/d/1an3Mzv6s0naO4mDwGFHJ48gLT--6EliA1GG3kbgBymo/edit?usp=sharing

Note that the results seem wrong from the very first implementation. The cause 
seems because R side uses Proleptic Gregorian calendar but JVM side is using 
hybrid Juilian and Gregoiran calendar.

  was:
Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see below:

{code}
# Non-existent timestamp in hybrid Julian and Gregorian Calendar
showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 
00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00"))
{code}

{code}
+---+---+
|POSIXct|POSIXlt|
+---+---+
|1582-09-30 00:33:08|1582-09-30 00:33:08|
+---+---+
{code}

See 
https://docs.google.com/document/d/1Upf6c5fNM59Q6nko-ipjLLae86x9mBejwuXshii-Azg/edit?usp=sharing

Note that the results seem wrong from the very first implementation. The cause 
seems because R side uses Proleptic Gregorian calendar but JVM side is using 
hybrid Juilian and Gregoiran calendar.


> Old POSIXlt, POSIXct and Date become corrupt due to calendar difference
> ---
>
> Key: SPARK-31702
> URL: https://issues.apache.org/jira/browse/SPARK-31702
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Old POSIXlt, POSIXct and Date become corrupt in SparkR. For example, see 
> below:
> {code}
> # Non-existent timestamp in hybrid Julian and Gregorian Calendar
> showDF(createDataFrame(as.data.frame(list(list(POSIXct=as.POSIXct("1582-10-10 
> 00:01:00"), POSIXlt=as.POSIXlt("1582-10-10 00:01:00"))
> {code}
> {code}
> +---+---+
> |POSIXct|POSIXlt|
> +---+---+
> |1582-09-30 00:33:08|1582-09-30 00:33:08|
> +---+---+
> {code}
> See 
> https://docs.google.com/document/d/1an3Mzv6s0naO4mDwGFHJ48gLT--6EliA1GG3kbgBymo/edit?usp=sharing
> Note that the results seem wrong from the very first implementation. The 
> cause seems because R side uses Proleptic Gregorian calendar but JVM side is 
> using hybrid Juilian and Gregoiran calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-05-13 Thread Michail Giannakopoulos (Jira)
Michail Giannakopoulos created SPARK-31703:
--

 Summary: Changes made by SPARK-26985 break reading parquet files 
correctly in BigEndian architectures (AIX + LinuxPPC64)
 Key: SPARK-31703
 URL: https://issues.apache.org/jira/browse/SPARK-31703
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.5, 3.0.0
 Environment: AIX 7.2
LinuxPPC64 with RedHat.
Reporter: Michail Giannakopoulos
 Attachments: Data_problem_Spark.gif

Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so 
as to be able to read data stored in parquet format, we notice that values 
associated with DOUBLE and DECIMAL types are parsed in the wrong form.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-05-13 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31703:
---
Attachment: Data_problem_Spark.gif

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Critical
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-05-13 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31703:
---
Labels: BigEndian  (was: )

> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Critical
>  Labels: BigEndian
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31703) Changes made by SPARK-26985 break reading parquet files correctly in BigEndian architectures (AIX + LinuxPPC64)

2020-05-13 Thread Michail Giannakopoulos (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michail Giannakopoulos updated SPARK-31703:
---
Description: 
Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) so 
as to be able to read data stored in parquet format, we notice that values 
associated with DOUBLE and DECIMAL types are parsed in the wrong form.

According toe parquet documentation, they always opt to store the values using 
left-endian representation for values:
[https://github.com/apache/parquet-format/blob/master/Encodings.md]
{noformat}
The plain encoding is used whenever a more efficient encoding can not be used. 
It
stores the data in the following format:

BOOLEAN: Bit Packed, LSB first
INT32: 4 bytes little endian
INT64: 8 bytes little endian
INT96: 12 bytes little endian (deprecated)
FLOAT: 4 bytes IEEE little endian
DOUBLE: 8 bytes IEEE little endian
BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained in 
the array
FIXED_LEN_BYTE_ARRAY: the bytes contained in the array

For native types, this outputs the data as little endian. Floating
point types are encoded in IEEE.
For the byte array type, it encodes the length as a 4 byte little
endian, followed by the bytes.{noformat}

  was:Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and 
PowerPC) so as to be able to read data stored in parquet format, we notice that 
values associated with DOUBLE and DECIMAL types are parsed in the wrong form.


> Changes made by SPARK-26985 break reading parquet files correctly in 
> BigEndian architectures (AIX + LinuxPPC64)
> ---
>
> Key: SPARK-31703
> URL: https://issues.apache.org/jira/browse/SPARK-31703
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
> Environment: AIX 7.2
> LinuxPPC64 with RedHat.
>Reporter: Michail Giannakopoulos
>Priority: Critical
>  Labels: BigEndian
> Attachments: Data_problem_Spark.gif
>
>
> Trying to upgrade to Apache Spark 2.4.5 in our IBM systems (AIX and PowerPC) 
> so as to be able to read data stored in parquet format, we notice that values 
> associated with DOUBLE and DECIMAL types are parsed in the wrong form.
> According toe parquet documentation, they always opt to store the values 
> using left-endian representation for values:
> [https://github.com/apache/parquet-format/blob/master/Encodings.md]
> {noformat}
> The plain encoding is used whenever a more efficient encoding can not be 
> used. It
> stores the data in the following format:
> BOOLEAN: Bit Packed, LSB first
> INT32: 4 bytes little endian
> INT64: 8 bytes little endian
> INT96: 12 bytes little endian (deprecated)
> FLOAT: 4 bytes IEEE little endian
> DOUBLE: 8 bytes IEEE little endian
> BYTE_ARRAY: length in 4 bytes little endian followed by the bytes contained 
> in the array
> FIXED_LEN_BYTE_ARRAY: the bytes contained in the array
> For native types, this outputs the data as little endian. Floating
> point types are encoded in IEEE.
> For the byte array type, it encodes the length as a 4 byte little
> endian, followed by the bytes.{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11

2020-05-13 Thread Jira
Markus Tretzmüller created SPARK-31704:
--

 Summary: PandasUDFType.GROUPED_AGG with Java 11
 Key: SPARK-31704
 URL: https://issues.apache.org/jira/browse/SPARK-31704
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.0
 Environment: java jdk: 11

python: 3.7

 
Reporter: Markus Tretzmüller


Running the example from the 
[docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions]
 gives an error with java 11. It works with java 8.


{code:python}
import findspark
findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7')
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql import Window
from pyspark.sql import SparkSession

if __name__ == '__main__':
spark = SparkSession \
.builder \
.appName('test') \
.getOrCreate()

df = spark.createDataFrame(
[(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
("id", "v"))

@pandas_udf("double", PandasUDFType.GROUPED_AGG)
def mean_udf(v):
return v.mean()

w = (Window.partitionBy('id')
 .orderBy('v')
 .rowsBetween(-1, 0))
df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
{code}


{noformat}
File 
"/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
 line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 in 
stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 (TID 
37, 131.130.32.15, executor driver): java.lang.UnsupportedOperationException: 
sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available
at 
io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
at 
org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
at 
org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240)
at 
org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132)
at 
org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94)
at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at 
org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
at 
org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213)
{noformat}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11664) Add methods to get bisecting k-means cluster structure

2020-05-13 Thread Dan Griffin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-11664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106325#comment-17106325
 ] 

Dan Griffin commented on SPARK-11664:
-

Hey. I'm wondering what the status of this capability being integrated into the 
official spark release? It seems that many in the community would like to have 
this feature in addition to the final sets of clusters.

> Add methods to get bisecting k-means cluster structure
> --
>
> Key: SPARK-11664
> URL: https://issues.apache.org/jira/browse/SPARK-11664
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Yu Ishikawa
>Priority: Minor
>  Labels: bulk-closed
>
> I think users want to visualize the result of bisecting k-means clustering as 
> a dendrogram in order to confirm it. So it would be great to support method 
> to get the cluster tree structure as an adjacency list, linkage matrix and so 
> on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-13 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-31705:
---

 Summary: Rewrite join condition to conjunctive normal form
 Key: SPARK-31705
 URL: https://issues.apache.org/jira/browse/SPARK-31705
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang
Assignee: Yuming Wang


Rewrite join condition to [conjunctive normal 
form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
conditions to filter.

PostgreSQL:
{code:sql}
CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT,   
l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),  
  
l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255),   

l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
DATE,
l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
  
CREATE TABLE orders (
o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  

explain select count(*) from lineitem, orders
 where l_orderkey = o_orderkey
 and ((l_suppkey > 10 and o_custkey > 20)
  or (l_suppkey > 30 and o_custkey > 40))
 and l_partkey > 0;

explain select count(*) from lineitem join orders
 on l_orderkey = o_orderkey
 and ((l_suppkey > 10 and o_custkey > 20)
  or (l_suppkey > 30 and o_custkey > 40))
 and l_partkey > 0;
{code}



{noformat}
postgres=# explain select count(*) from lineitem, orders
postgres-#  where l_orderkey = o_orderkey
postgres-#  and ((l_suppkey > 10 and o_custkey > 20)
postgres(#   or (l_suppkey > 30 and o_custkey > 40))
postgres-#  and l_partkey > 0;
QUERY PLAN
---
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > 20)) 
OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 20) OR (o_custkey > 40))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR 
(l_suppkey > 30)))
(9 rows)

postgres=#
postgres=# explain select count(*) from lineitem join orders
postgres-#  on l_orderkey = o_orderkey
postgres-#  and ((l_suppkey > 10 and o_custkey > 20)
postgres(#   or (l_suppkey > 30 and o_custkey > 40))
postgres-#  and l_partkey > 0;
QUERY PLAN
---
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > 20)) 
OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 20) OR (o_custkey > 40))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR 
(l_suppkey > 30)))
(9 rows)
{noformat}






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-13 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106344#comment-17106344
 ] 

Yuming Wang commented on SPARK-31705:
-

I'm working on.

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> explain select count(*) from lineitem, orders
>  where l_orderkey = o_orderkey
>  and ((l_suppkey > 10 and o_custkey > 20)
>   or (l_suppkey > 30 and o_custkey > 40))
>  and l_partkey > 0;
> explain select count(*) from lineitem join orders
>  on l_orderkey = o_orderkey
>  and ((l_suppkey > 10 and o_custkey > 20)
>   or (l_suppkey > 30 and o_custkey > 40))
>  and l_partkey > 0;
> {code}
> {noformat}
> postgres=# explain select count(*) from lineitem, orders
> postgres-#  where l_orderkey = o_orderkey
> postgres-#  and ((l_suppkey > 10 and o_custkey > 20)
> postgres(#   or (l_suppkey > 30 and o_custkey > 40))
> postgres-#  and l_partkey > 0;
> QUERY PLAN
> ---
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > 
> 20)) OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 20) OR (o_custkey > 40))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR 
> (l_suppkey > 30)))
> (9 rows)
> postgres=#
> postgres=# explain select count(*) from lineitem join orders
> postgres-#  on l_orderkey = o_orderkey
> postgres-#  and ((l_suppkey > 10 and o_custkey > 20)
> postgres(#   or (l_suppkey > 30 and o_custkey > 40))
> postgres-#  and l_partkey > 0;
> QUERY PLAN
> ---
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 10) AND (orders.o_custkey > 
> 20)) OR ((lineitem.l_suppkey > 30) AND (orders.o_custkey > 40)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 20) OR (o_custkey > 40))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 0) AND ((l_suppkey > 10) OR 
> (l_suppkey > 30)))
> (9 rows)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31701:
-

Assignee: Hyukjin Kwon

> Bump up the minimum Arrow version as 0.15.1 in SparkR
> -
>
> Key: SPARK-31701
> URL: https://issues.apache.org/jira/browse/SPARK-31701
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> PySpark side bumped up the minimum Arrow version at SPARK-29376. We should 
> better bump up the version in SparkR side to match with 0.15.1. There's no 
> backward compatibility concern because Arrow optimization in SparkR is new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31701) Bump up the minimum Arrow version as 0.15.1 in SparkR

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31701.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28520
[https://github.com/apache/spark/pull/28520]

> Bump up the minimum Arrow version as 0.15.1 in SparkR
> -
>
> Key: SPARK-31701
> URL: https://issues.apache.org/jira/browse/SPARK-31701
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.0.0
>
>
> PySpark side bumped up the minimum Arrow version at SPARK-29376. We should 
> better bump up the version in SparkR side to match with 0.15.1. There's no 
> backward compatibility concern because Arrow optimization in SparkR is new.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31706) add back the support of streaming update mode

2020-05-13 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31706:
---

 Summary: add back the support of streaming update mode
 Key: SPARK-31706
 URL: https://issues.apache.org/jira/browse/SPARK-31706
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11

2020-05-13 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106500#comment-17106500
 ] 

Bryan Cutler commented on SPARK-31704:
--

This is due to a Netty API that Arrow uses and unfortunately, it currently 
needs the following Java option set to get working 
{{-Dio.netty.tryReflectionSetAccessible=true}}.  See 
https://issues.apache.org/jira/browse/SPARK-29924 which added documentation for 
this here https://github.com/apache/spark/blob/master/docs/index.md#downloading.

> PandasUDFType.GROUPED_AGG with Java 11
> --
>
> Key: SPARK-31704
> URL: https://issues.apache.org/jira/browse/SPARK-31704
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: java jdk: 11
> python: 3.7
>  
>Reporter: Markus Tretzmüller
>Priority: Minor
>  Labels: newbie
>
> Running the example from the 
> [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions]
>  gives an error with java 11. It works with java 8.
> {code:python}
> import findspark
> findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7')
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql import Window
> from pyspark.sql import SparkSession
> if __name__ == '__main__':
> spark = SparkSession \
> .builder \
> .appName('test') \
> .getOrCreate()
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf("double", PandasUDFType.GROUPED_AGG)
> def mean_udf(v):
> return v.mean()
> w = (Window.partitionBy('id')
>  .orderBy('v')
>  .rowsBetween(-1, 0))
> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
> {code}
> {noformat}
> File 
> "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 
> in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 
> (TID 37, 131.130.32.15, executor driver): 
> java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
> java.nio.DirectByteBuffer.(long, int) not available
>   at 
> io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
>   at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
>   at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
>   at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
>   at 
> org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31706) add back the support of streaming update mode

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31706:


Assignee: Apache Spark  (was: Wenchen Fan)

> add back the support of streaming update mode
> -
>
> Key: SPARK-31706
> URL: https://issues.apache.org/jira/browse/SPARK-31706
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31706) add back the support of streaming update mode

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31706:


Assignee: Wenchen Fan  (was: Apache Spark)

> add back the support of streaming update mode
> -
>
> Key: SPARK-31706
> URL: https://issues.apache.org/jira/browse/SPARK-31706
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31706) add back the support of streaming update mode

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106504#comment-17106504
 ] 

Apache Spark commented on SPARK-31706:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28523

> add back the support of streaming update mode
> -
>
> Key: SPARK-31706
> URL: https://issues.apache.org/jira/browse/SPARK-31706
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2020-05-13 Thread Zirui Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106562#comment-17106562
 ] 

Zirui Li commented on SPARK-23607:
--

Hi [~zhouyejoe] wondering do you have any plan to post the PR? Thanks

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>  Labels: bulk-closed
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-05-13 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-31707:


 Summary: Revert SPARK-30098 Use default datasource as provider for 
CREATE TABLE syntax
 Key: SPARK-31707
 URL: https://issues.apache.org/jira/browse/SPARK-31707
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


We need to consider the behavior change of SPARK-30098 .
This is a placeholder to keep the discussion and the final decision.

`CREATE TABLE` syntax changes its behavior silently.

The following is one example of the breaking the existing user data pipelines.
*Apache Spark 2.4.5*
{code}
spark-sql> CREATE TABLE t(a STRING);

spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;

spark-sql> SELECT * FROM t LIMIT 1;
# Apache Spark
Time taken: 2.05 seconds, Fetched 1 row(s)
{code}

{code}
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a   3
{code}

*Apache Spark 3.0.0-preview2*
{code}
spark-sql> CREATE TABLE t(a STRING);

spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
Error in query: LOAD DATA is not supported for datasource tables: `default`.`t`;
{code}

{code}
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a   2
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-05-13 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-31707:
-
Description: 
According to the latest status of discussion in the dev@ mailing list, 
[[DISCUSS] Resolve ambiguous parser rule between two "create 
table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html],
 we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0.

This issue tracks the effort of revert.

  was:
We need to consider the behavior change of SPARK-30098 .
This is a placeholder to keep the discussion and the final decision.

`CREATE TABLE` syntax changes its behavior silently.

The following is one example of the breaking the existing user data pipelines.
*Apache Spark 2.4.5*
{code}
spark-sql> CREATE TABLE t(a STRING);

spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;

spark-sql> SELECT * FROM t LIMIT 1;
# Apache Spark
Time taken: 2.05 seconds, Fetched 1 row(s)
{code}

{code}
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a   3
{code}

*Apache Spark 3.0.0-preview2*
{code}
spark-sql> CREATE TABLE t(a STRING);

spark-sql> LOAD DATA INPATH '/usr/local/spark/README.md' INTO TABLE t;
Error in query: LOAD DATA is not supported for datasource tables: `default`.`t`;
{code}

{code}
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a   2
{code}


> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31707
> URL: https://issues.apache.org/jira/browse/SPARK-31707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> According to the latest status of discussion in the dev@ mailing list, 
> [[DISCUSS] Resolve ambiguous parser rule between two "create 
> table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html],
>  we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0.
> This issue tracks the effort of revert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31707:


Assignee: (was: Apache Spark)

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31707
> URL: https://issues.apache.org/jira/browse/SPARK-31707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> According to the latest status of discussion in the dev@ mailing list, 
> [[DISCUSS] Resolve ambiguous parser rule between two "create 
> table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html],
>  we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0.
> This issue tracks the effort of revert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106638#comment-17106638
 ] 

Apache Spark commented on SPARK-31707:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/28517

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31707
> URL: https://issues.apache.org/jira/browse/SPARK-31707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> According to the latest status of discussion in the dev@ mailing list, 
> [[DISCUSS] Resolve ambiguous parser rule between two "create 
> table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html],
>  we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0.
> This issue tracks the effort of revert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106637#comment-17106637
 ] 

Apache Spark commented on SPARK-31707:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/28517

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31707
> URL: https://issues.apache.org/jira/browse/SPARK-31707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> According to the latest status of discussion in the dev@ mailing list, 
> [[DISCUSS] Resolve ambiguous parser rule between two "create 
> table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html],
>  we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0.
> This issue tracks the effort of revert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31707) Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31707:


Assignee: Apache Spark

> Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
> -
>
> Key: SPARK-31707
> URL: https://issues.apache.org/jira/browse/SPARK-31707
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Blocker
>
> According to the latest status of discussion in the dev@ mailing list, 
> [[DISCUSS] Resolve ambiguous parser rule between two "create 
> table"s|http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html],
>  we'd want to revert the change of SPARK-30098 first to unblock Spark 3.0.0.
> This issue tracks the effort of revert.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-31700) spark sql write orc file outformat

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-31700.
-

> spark sql write orc file  outformat
> ---
>
> Key: SPARK-31700
> URL: https://issues.apache.org/jira/browse/SPARK-31700
> Project: Spark
>  Issue Type: Task
>  Components: Input/Output
>Affects Versions: 2.3.3
>Reporter: Dexter Morgan
>Priority: Major
>
> !image-2020-05-13-16-53-49-678.png!
>  
> can you give me an example of sparksql outputformat orc file ,plz
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31700) spark sql write orc file outformat

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31700.
---
Resolution: Invalid

Hi, please use dev mailing list.

> spark sql write orc file  outformat
> ---
>
> Key: SPARK-31700
> URL: https://issues.apache.org/jira/browse/SPARK-31700
> Project: Spark
>  Issue Type: Task
>  Components: Input/Output
>Affects Versions: 2.3.3
>Reporter: Dexter Morgan
>Priority: Major
>
> !image-2020-05-13-16-53-49-678.png!
>  
> can you give me an example of sparksql outputformat orc file ,plz
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11

2020-05-13 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106642#comment-17106642
 ] 

Dongjoon Hyun commented on SPARK-31704:
---

+1 for [~bryanc]'s advice.

You may see Apache Spark 3.0.0 RC1 document.
- https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/_site/index.html

> PandasUDFType.GROUPED_AGG with Java 11
> --
>
> Key: SPARK-31704
> URL: https://issues.apache.org/jira/browse/SPARK-31704
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: java jdk: 11
> python: 3.7
>  
>Reporter: Markus Tretzmüller
>Priority: Minor
>  Labels: newbie
>
> Running the example from the 
> [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions]
>  gives an error with java 11. It works with java 8.
> {code:python}
> import findspark
> findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7')
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql import Window
> from pyspark.sql import SparkSession
> if __name__ == '__main__':
> spark = SparkSession \
> .builder \
> .appName('test') \
> .getOrCreate()
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf("double", PandasUDFType.GROUPED_AGG)
> def mean_udf(v):
> return v.mean()
> w = (Window.partitionBy('id')
>  .orderBy('v')
>  .rowsBetween(-1, 0))
> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
> {code}
> {noformat}
> File 
> "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 
> in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 
> (TID 37, 131.130.32.15, executor driver): 
> java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
> java.nio.DirectByteBuffer.(long, int) not available
>   at 
> io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
>   at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
>   at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
>   at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
>   at 
> org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31704) PandasUDFType.GROUPED_AGG with Java 11

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31704.
---
Resolution: Duplicate

> PandasUDFType.GROUPED_AGG with Java 11
> --
>
> Key: SPARK-31704
> URL: https://issues.apache.org/jira/browse/SPARK-31704
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: java jdk: 11
> python: 3.7
>  
>Reporter: Markus Tretzmüller
>Priority: Minor
>  Labels: newbie
>
> Running the example from the 
> [docs|https://spark.apache.org/docs/3.0.0-preview2/api/python/pyspark.sql.html#module-pyspark.sql.functions]
>  gives an error with java 11. It works with java 8.
> {code:python}
> import findspark
> findspark.init('/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7')
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql import Window
> from pyspark.sql import SparkSession
> if __name__ == '__main__':
> spark = SparkSession \
> .builder \
> .appName('test') \
> .getOrCreate()
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf("double", PandasUDFType.GROUPED_AGG)
> def mean_udf(v):
> return v.mean()
> w = (Window.partitionBy('id')
>  .orderBy('v')
>  .rowsBetween(-1, 0))
> df.withColumn('mean_v', mean_udf(df['v']).over(w)).show()
> {code}
> {noformat}
> File 
> "/usr/local/lib/spark-3.0.0-preview2-bin-hadoop2.7/python/lib/py4j-0.10.8.1-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o81.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 44 
> in stage 7.0 failed 1 times, most recent failure: Lost task 44.0 in stage 7.0 
> (TID 37, 131.130.32.15, executor driver): 
> java.lang.UnsupportedOperationException: sun.misc.Unsafe or 
> java.nio.DirectByteBuffer.(long, int) not available
>   at 
> io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)
>   at io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)
>   at io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)
>   at io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)
>   at 
> org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.serialize(MessageSerializer.java:240)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeRecordBatch(ArrowWriter.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowWriter.writeBatch(ArrowWriter.java:120)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.$anonfun$writeIteratorToStream$1(ArrowPythonRunner.scala:94)
>   at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.writeIteratorToStream(ArrowPythonRunner.scala:101)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:373)
>   at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1932)
>   at 
> org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:213)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31696) Support spark.kubernetes.driver.service.annotation

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31696:
-

Assignee: Dongjoon Hyun

> Support spark.kubernetes.driver.service.annotation
> --
>
> Key: SPARK-31696
> URL: https://issues.apache.org/jira/browse/SPARK-31696
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31696) Support spark.kubernetes.driver.service.annotation

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31696.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28518
[https://github.com/apache/spark/pull/28518]

> Support spark.kubernetes.driver.service.annotation
> --
>
> Key: SPARK-31696
> URL: https://issues.apache.org/jira/browse/SPARK-31696
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector

2020-05-13 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31708:
--

 Summary: Add docs and examples for ANOVASelector and FValueSelector
 Key: SPARK-31708
 URL: https://issues.apache.org/jira/browse/SPARK-31708
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, ML
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add docs and examples for ANOVASelector and FValueSelector



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31708:


Assignee: Apache Spark

> Add docs and examples for ANOVASelector and FValueSelector
> --
>
> Key: SPARK-31708
> URL: https://issues.apache.org/jira/browse/SPARK-31708
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>
> Add docs and examples for ANOVASelector and FValueSelector



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector

2020-05-13 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31708:


Assignee: (was: Apache Spark)

> Add docs and examples for ANOVASelector and FValueSelector
> --
>
> Key: SPARK-31708
> URL: https://issues.apache.org/jira/browse/SPARK-31708
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add docs and examples for ANOVASelector and FValueSelector



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31708) Add docs and examples for ANOVASelector and FValueSelector

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106752#comment-17106752
 ] 

Apache Spark commented on SPARK-31708:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28524

> Add docs and examples for ANOVASelector and FValueSelector
> --
>
> Key: SPARK-31708
> URL: https://issues.apache.org/jira/browse/SPARK-31708
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Priority: Major
>
> Add docs and examples for ANOVASelector and FValueSelector



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31705:

Description: 
Rewrite join condition to [conjunctive normal 
form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
conditions to filter.

PostgreSQL:
{code:sql}
CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT,   
l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),  
  
l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255),   

l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
DATE,
l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
  
CREATE TABLE orders (
o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  

EXPLAIN
SELECT Count(*)
FROM   lineitem,
   orders
WHERE  l_orderkey = o_orderkey
   AND ( ( l_suppkey > 3
   AND o_custkey > 13 )
  OR ( l_suppkey > 1
   AND o_custkey > 11 ) )
   AND l_partkey > 19;

EXPLAIN
SELECT Count(*)
FROM   lineitem
   JOIN orders
 ON l_orderkey = o_orderkey
AND ( ( l_suppkey > 3
AND o_custkey > 13 )
   OR ( l_suppkey > 1
AND o_custkey > 11 ) )
AND l_partkey > 19;
{code}



{noformat}
postgres=# EXPLAIN
postgres-# SELECT Count(*)
postgres-# FROM   lineitem,
postgres-#orders
postgres-# WHERE  l_orderkey = o_orderkey
postgres-#AND ( ( l_suppkey > 3
postgres(#AND o_custkey > 13 )
postgres(#   OR ( l_suppkey > 1
postgres(#AND o_custkey > 11 ) )
postgres-#AND l_partkey > 19;
   QUERY PLAN
-
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 13) OR (o_custkey > 11))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
(l_suppkey > 1)))
(9 rows)

postgres=# EXPLAIN
postgres-# SELECT Count(*)
postgres-# FROM   lineitem
postgres-#JOIN orders
postgres-#  ON l_orderkey = o_orderkey
postgres-# AND ( ( l_suppkey > 3
postgres(# AND o_custkey > 13 )
postgres(#OR ( l_suppkey > 1
postgres(# AND o_custkey > 11 ) )
postgres-# AND l_partkey > 19;
   QUERY PLAN
-
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 13) OR (o_custkey > 11))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
(l_suppkey > 1)))
(9 rows)
{noformat}




  was:
Rewrite join condition to [conjunctive normal 
form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
conditions to filter.

PostgreSQL:
{code:sql}
CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT,   
l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),  
  
l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255),   

l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
DATE,
l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
  
CREATE TABLE orders (
o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority va

[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31705:

Issue Type: Improvement  (was: New Feature)

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@s

[jira] [Commented] (SPARK-21033) fix the potential OOM in UnsafeExternalSorter

2020-05-13 Thread Yunbo Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106774#comment-17106774
 ] 

Yunbo Fan commented on SPARK-21033:
---

[~clehene] Since it's 2020, have you solved the problem?

I‘m seeing the same one.

> fix the potential OOM in UnsafeExternalSorter
> -
>
> Key: SPARK-21033
> URL: https://issues.apache.org/jira/browse/SPARK-21033
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 2.3.0
>
>
> In `UnsafeInMemorySorter`, one record may take 32 bytes: 1 `long` for 
> pointer, 1 `long` for key-prefix, and another 2 `long`s as the temporary 
> buffer for radix sort.
> In `UnsafeExternalSorter`, we set the 
> `DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD` to be `1024 * 1024 * 1024 / 2`, 
> and hoping the max size of point array to be 8 GB. However this is wrong, 
> `1024 * 1024 * 1024 / 2 * 32` is actually 16 GB, and if we grow the point 
> array before reach this limitation, we may hit the max-page-size error.
> Users may see exception like this on large dataset:
> {code}
> Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with 
> more than 17179869176 bytes
> at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:241)
> at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:121)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:374)
> at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:396)
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:94)
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2020-05-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31693:
-
Priority: Critical  (was: Major)

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2020-05-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106825#comment-17106825
 ] 

Hyukjin Kwon commented on SPARK-31693:
--

Seems it's blocking many other PRs ...

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2020-05-13 Thread shane knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106830#comment-17106830
 ] 

shane knapp commented on SPARK-31693:
-

grrr.  ok, sorry.  today was my zoom meeting day.  i'll reboot the master
and all nodes tomorrow and see how that goes.

i really don't see how this is an issue on our end.




-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared

2020-05-13 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31632.
--
Fix Version/s: 2.4.7
   3.0.0
 Assignee: Xingcan Cui
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/28444

> The ApplicationInfo in KVStore may be accessed before it's prepared
> ---
>
> Key: SPARK-31632
> URL: https://issues.apache.org/jira/browse/SPARK-31632
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 3.0.0
>Reporter: Xingcan Cui
>Assignee: Xingcan Cui
>Priority: Minor
> Fix For: 3.0.0, 2.4.7
>
>
> While starting some local tests, I occasionally encountered the following 
> exceptions for Web UI.
> {noformat}
> 23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/
>  java.util.NoSuchElementException
>  at java.util.Collections$EmptyIterator.next(Collections.java:4191)
>  at 
> org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467)
>  at 
> org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39)
>  at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266)
>  at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
>  at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
>  at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>  at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
>  at 
> org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
>  at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
>  at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
>  at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
>  at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
>  at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
>  at 
> org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
>  at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
>  at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
>  at org.eclipse.jetty.server.Server.handle(Server.java:505)
>  at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
>  at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
>  at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
>  at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
>  at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
>  at java.lang.Thread.run(Thread.java:748){noformat}
> *Reason*
>  That is because {{AppStatusStore.applicationInfo()}} accesses an empty view 
> (iterator) returned by {{InMemoryStore}}.
> AppStatusStore
> {code:java}
> def applicationInfo(): v1.ApplicationInfo = {
> store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info
> }
> {code}
> InMemoryStore
> {code:java}
> public  KVStoreView view(Class type){
> InstanceList list = inMemoryLists.get(type);
> return list != null ? list.view() : emptyView();
>  }
> {code}
> During the initialization of {{SparkContext}}, it first starts the Web UI 
> (SparkContext: L475 _ui.foreach(_.bind())) and then setup the 
> {{LiveListenerBus}} thread (SparkContext: L608 
> {{setupAndStartListenerBus()}}) for dispatching the 
> {{SparkListenerApplicationStart}} event (which will trigger writing the 
> requested {{ApplicationInfo}} to {{InMemoryStore}}).
> *Solution*
>  Since the {{applicationInfo()}} method is expected to always return a valid 
> {{ApplicationInfo}}, maybe we can add a while-loop-check here to guarantee 
> the availability of {{ApplicationInfo}}.
> {code:java}
> def applicationInfo(): v1.ApplicationInfo = {
>  var iterator = store.view(classOf[ApplicationInfoWrapper]).max(1).iterator()
>  while (!iterator.

[jira] [Updated] (SPARK-31632) The ApplicationInfo in KVStore may be accessed before it's prepared

2020-05-13 Thread Xingcan Cui (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xingcan Cui updated SPARK-31632:

Description: 
While starting some local tests, I occasionally encountered the following 
exceptions for Web UI.
{noformat}
23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/
 java.util.NoSuchElementException
 at java.util.Collections$EmptyIterator.next(Collections.java:4191)
 at 
org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467)
 at 
org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39)
 at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266)
 at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
 at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
 at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203)
 at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201)
 at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144)
 at 
org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:753)
 at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220)
 at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
 at org.eclipse.jetty.server.Server.handle(Server.java:505)
 at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:370)
 at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:267)
 at 
org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305)
 at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103)
 at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:698)
 at 
org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:804)
 at java.lang.Thread.run(Thread.java:748){noformat}
*Reason*
 That is because {{AppStatusStore.applicationInfo()}} accesses an empty view 
(iterator) returned by {{InMemoryStore}}.

AppStatusStore
{code:java}
def applicationInfo(): v1.ApplicationInfo = {
store.view(classOf[ApplicationInfoWrapper]).max(1).iterator().next().info
}
{code}
InMemoryStore
{code:java}
public  KVStoreView view(Class type){
InstanceList list = inMemoryLists.get(type);
return list != null ? list.view() : emptyView();
 }
{code}
During the initialization of {{SparkContext}}, it first starts the Web UI 
(SparkContext: L475 _ui.foreach(_.bind())) and then setup the 
{{LiveListenerBus}} thread (SparkContext: L608 {{setupAndStartListenerBus()}}) 
for dispatching the {{SparkListenerApplicationStart}} event (which will trigger 
writing the requested {{ApplicationInfo}} to {{InMemoryStore}}).

  was:
While starting some local tests, I occasionally encountered the following 
exceptions for Web UI.
{noformat}
23:00:29.845 WARN org.eclipse.jetty.server.HttpChannel: /jobs/
 java.util.NoSuchElementException
 at java.util.Collections$EmptyIterator.next(Collections.java:4191)
 at 
org.apache.spark.util.kvstore.InMemoryStore$InMemoryIterator.next(InMemoryStore.java:467)
 at 
org.apache.spark.status.AppStatusStore.applicationInfo(AppStatusStore.scala:39)
 at org.apache.spark.ui.jobs.AllJobsPage.render(AllJobsPage.scala:266)
 at org.apache.spark.ui.WebUI.$anonfun$attachPage$1(WebUI.scala:89)
 at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:80)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:873)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1623)
 at org.apache.spark.ui.HttpSecurityFilter.doFilter(HttpSecurityFilter.scala:95)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1610)
 at org.eclipse.jetty.servlet.ServletHandler.doHand

[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2020-05-13 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106836#comment-17106836
 ] 

Hyukjin Kwon commented on SPARK-31693:
--

Thank you so much [~shaneknapp].

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106845#comment-17106845
 ] 

Apache Spark commented on SPARK-27562:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/28525

> Complete the verification mechanism for shuffle transmitted data
> 
>
> Key: SPARK-27562
> URL: https://issues.apache.org/jira/browse/SPARK-27562
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Major
>
> We've seen some shuffle data corruption during shuffle read phase. 
> As described in SPARK-26089, spark only checks small  shuffle blocks before 
> PR #23453, which is proposed by ankuriitg.
> There are two changes/improvements that are made in PR #23453.
> 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
> smaller blocks, so if a
> large block is corrupt in the starting, that block will be re-fetched and if 
> that also fails,
> FetchFailureException will be thrown.
> 2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
> IOException thrown while
> reading the stream will be converted to FetchFailureException. This is 
> slightly more aggressive
> than was originally intended but since the consumer of the stream may have 
> already read some records and processed them, we can't just re-fetch the 
> block, we need to fail the whole task. Additionally, we also thought about 
> maybe adding a new type of TaskEndReason, which would re-try the task couple 
> of times before failing the previous stage, but given the complexity involved 
> in that solution we decided to not proceed in that direction.
> However, I think there still exists some problems with the current shuffle 
> transmitted data verification mechanism:
> - For a large block, it is checked upto  maxBytesInFlight/3 size when 
> fetching shuffle data. So if a large block is corrupt after size 
> maxBytesInFlight/3, it can not be detected in data fetch phase.  This has 
> been described in the previous section.
> - Only the compressed or wrapped blocks are checked, I think we should also 
> check thease blocks which are not wrapped.
> We complete the verification mechanism for shuffle transmitted data:
> Firstly, we choose crc32 for the checksum verification  of shuffle data.
> Crc is also used for checksum verification in hadoop, it is simple and fast.
> In shuffle write phase, after completing the partitionedFile, we compute 
> the crc32 value for each partition and then write these digests with the 
> indexs into shuffle index file.
> For the sortShuffleWriter and unsafe shuffle writer, there is only one 
> partitionedFile for a shuffleMapTask, so the compution of digests(compute the 
> digests for each partition depend on the indexs of this partitionedFile) is  
> cheap.
> For the bypassShuffleWriter, the reduce partitions is little than 
> byPassMergeThreshold, the cost of digests compution is acceptable.
> In shuffle read phase, the digest value will be passed with the block data.
> And we will recompute the digest of the data obtained to compare with the 
> origin digest value.
> When recomputing the digest of data obtained, it only need an additional 
> buffer(2048Bytes) for computing crc32 value.
> After recomputing, we will reset the obtained data inputStream, if it is 
> markSupported we only need reset it, otherwise it is a 
> fileSegmentManagerBuffer, we need recreate it.
> So, this verification mechanism  proposed for shuffle transmitted data is 
> cheap and complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27562) Complete the verification mechanism for shuffle transmitted data

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106846#comment-17106846
 ] 

Apache Spark commented on SPARK-27562:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/28525

> Complete the verification mechanism for shuffle transmitted data
> 
>
> Key: SPARK-27562
> URL: https://issues.apache.org/jira/browse/SPARK-27562
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Major
>
> We've seen some shuffle data corruption during shuffle read phase. 
> As described in SPARK-26089, spark only checks small  shuffle blocks before 
> PR #23453, which is proposed by ankuriitg.
> There are two changes/improvements that are made in PR #23453.
> 1. Large blocks are checked upto maxBytesInFlight/3 size in a similar way as 
> smaller blocks, so if a
> large block is corrupt in the starting, that block will be re-fetched and if 
> that also fails,
> FetchFailureException will be thrown.
> 2. If large blocks are corrupt after size maxBytesInFlight/3, then any 
> IOException thrown while
> reading the stream will be converted to FetchFailureException. This is 
> slightly more aggressive
> than was originally intended but since the consumer of the stream may have 
> already read some records and processed them, we can't just re-fetch the 
> block, we need to fail the whole task. Additionally, we also thought about 
> maybe adding a new type of TaskEndReason, which would re-try the task couple 
> of times before failing the previous stage, but given the complexity involved 
> in that solution we decided to not proceed in that direction.
> However, I think there still exists some problems with the current shuffle 
> transmitted data verification mechanism:
> - For a large block, it is checked upto  maxBytesInFlight/3 size when 
> fetching shuffle data. So if a large block is corrupt after size 
> maxBytesInFlight/3, it can not be detected in data fetch phase.  This has 
> been described in the previous section.
> - Only the compressed or wrapped blocks are checked, I think we should also 
> check thease blocks which are not wrapped.
> We complete the verification mechanism for shuffle transmitted data:
> Firstly, we choose crc32 for the checksum verification  of shuffle data.
> Crc is also used for checksum verification in hadoop, it is simple and fast.
> In shuffle write phase, after completing the partitionedFile, we compute 
> the crc32 value for each partition and then write these digests with the 
> indexs into shuffle index file.
> For the sortShuffleWriter and unsafe shuffle writer, there is only one 
> partitionedFile for a shuffleMapTask, so the compution of digests(compute the 
> digests for each partition depend on the indexs of this partitionedFile) is  
> cheap.
> For the bypassShuffleWriter, the reduce partitions is little than 
> byPassMergeThreshold, the cost of digests compution is acceptable.
> In shuffle read phase, the digest value will be passed with the block data.
> And we will recompute the digest of the data obtained to compare with the 
> origin digest value.
> When recomputing the digest of data obtained, it only need an additional 
> buffer(2048Bytes) for computing crc32 value.
> After recomputing, we will reset the obtained data inputStream, if it is 
> markSupported we only need reset it, otherwise it is a 
> fileSegmentManagerBuffer, we need recreate it.
> So, this verification mechanism  proposed for shuffle transmitted data is 
> cheap and complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31705:

Description: 
Rewrite join condition to [conjunctive normal 
form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
conditions to filter.

PostgreSQL:
{code:sql}
CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT,   
l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),  
  
l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255),   

l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
DATE,
l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
  
CREATE TABLE orders (
o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  

EXPLAIN
SELECT Count(*)
FROM   lineitem,
   orders
WHERE  l_orderkey = o_orderkey
   AND ( ( l_suppkey > 3
   AND o_custkey > 13 )
  OR ( l_suppkey > 1
   AND o_custkey > 11 ) )
   AND l_partkey > 19;

EXPLAIN
SELECT Count(*)
FROM   lineitem
   JOIN orders
 ON l_orderkey = o_orderkey
AND ( ( l_suppkey > 3
AND o_custkey > 13 )
   OR ( l_suppkey > 1
AND o_custkey > 11 ) )
AND l_partkey > 19;
{code}



{noformat}
postgres=# EXPLAIN
postgres-# SELECT Count(*)
postgres-# FROM   lineitem,
postgres-#orders
postgres-# WHERE  l_orderkey = o_orderkey
postgres-#AND ( ( l_suppkey > 3
postgres(#AND o_custkey > 13 )
postgres(#   OR ( l_suppkey > 1
postgres(#AND o_custkey > 11 ) )
postgres-#AND l_partkey > 19;
   QUERY PLAN
-
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 13) OR (o_custkey > 11))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
(l_suppkey > 1)))
(9 rows)

postgres=# EXPLAIN
postgres-# SELECT Count(*)
postgres-# FROM   lineitem
postgres-#JOIN orders
postgres-#  ON l_orderkey = o_orderkey
postgres-# AND ( ( l_suppkey > 3
postgres(# AND o_custkey > 13 )
postgres(#OR ( l_suppkey > 1
postgres(# AND o_custkey > 11 ) )
postgres-# AND l_partkey > 19;
   QUERY PLAN
-
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 13) OR (o_custkey > 11))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
(l_suppkey > 1)))
(9 rows)
{noformat}




https://docs.teradata.com/reader/i_VlYHwN0b8knh6AEWrv1Q/Bh~37Qcc2~24P_jn2~0w6w

  was:
Rewrite join condition to [conjunctive normal 
form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
conditions to filter.

PostgreSQL:
{code:sql}
CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT,   
l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),  
  
l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255),   

l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
DATE,
l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
  
CREATE TABLE orders (
o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(2

[jira] [Updated] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-13 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-31705:

Description: 
Rewrite join condition to [conjunctive normal 
form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
conditions to filter.

PostgreSQL:
{code:sql}
CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT,   
l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),  
  
l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255),   

l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
DATE,
l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
  
CREATE TABLE orders (
o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  

EXPLAIN
SELECT Count(*)
FROM   lineitem,
   orders
WHERE  l_orderkey = o_orderkey
   AND ( ( l_suppkey > 3
   AND o_custkey > 13 )
  OR ( l_suppkey > 1
   AND o_custkey > 11 ) )
   AND l_partkey > 19;

EXPLAIN
SELECT Count(*)
FROM   lineitem
   JOIN orders
 ON l_orderkey = o_orderkey
AND ( ( l_suppkey > 3
AND o_custkey > 13 )
   OR ( l_suppkey > 1
AND o_custkey > 11 ) )
AND l_partkey > 19;

EXPLAIN
SELECT Count(*) 
FROM   lineitem, 
   orders 
WHERE  l_orderkey = o_orderkey 
   AND NOT ( ( l_suppkey > 3 
   AND ( l_suppkey > 2 
  OR o_custkey > 13 ) ) 
  OR ( l_suppkey > 1 
   AND o_custkey > 11 ) ) 
   AND l_partkey > 19;
{code}



{noformat}
postgres=# EXPLAIN
postgres-# SELECT Count(*)
postgres-# FROM   lineitem,
postgres-#orders
postgres-# WHERE  l_orderkey = o_orderkey
postgres-#AND ( ( l_suppkey > 3
postgres(#AND o_custkey > 13 )
postgres(#   OR ( l_suppkey > 1
postgres(#AND o_custkey > 11 ) )
postgres-#AND l_partkey > 19;
   QUERY PLAN
-
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 13) OR (o_custkey > 11))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
(l_suppkey > 1)))
(9 rows)

postgres=# EXPLAIN
postgres-# SELECT Count(*)
postgres-# FROM   lineitem
postgres-#JOIN orders
postgres-#  ON l_orderkey = o_orderkey
postgres-# AND ( ( l_suppkey > 3
postgres(# AND o_custkey > 13 )
postgres(#OR ( l_suppkey > 1
postgres(# AND o_custkey > 11 ) )
postgres-# AND l_partkey > 19;
   QUERY PLAN
-
 Aggregate  (cost=21.18..21.19 rows=1 width=8)
   ->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
 Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
 Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
 ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
   Filter: ((o_custkey > 13) OR (o_custkey > 11))
 ->  Hash  (cost=10.53..10.53 rows=6 width=16)
   ->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
 Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
(l_suppkey > 1)))
(9 rows)

postgres=# EXPLAIN
postgres-# SELECT Count(*)
postgres-# FROM   lineitem,
postgres-#orders
postgres-# WHERE  l_orderkey = o_orderkey
postgres-#AND NOT ( ( l_suppkey > 3
postgres(#AND ( l_suppkey > 2
postgres(#   OR o_custkey > 13 ) )
postgres(#   OR ( l_suppkey > 1
postgres(#AND o_custkey > 11 ) )
postgres-#AND l_partkey > 19;

[jira] [Resolved] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31692.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/28516

> Hadoop confs passed via spark config are not set in URLStream Handler Factory
> -
>
> Key: SPARK-31692
> URL: https://issues.apache.org/jira/browse/SPARK-31692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Priority: Major
> Fix For: 3.0.0
>
>
> Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in 
> URLStreamHandlerFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory

2020-05-13 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31692:
-

Assignee: Karuppayya

> Hadoop confs passed via spark config are not set in URLStream Handler Factory
> -
>
> Key: SPARK-31692
> URL: https://issues.apache.org/jira/browse/SPARK-31692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Assignee: Karuppayya
>Priority: Major
> Fix For: 3.0.0
>
>
> Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in 
> URLStreamHandlerFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31405) fail by default when read/write datetime values and not sure if they need rebase or not

2020-05-13 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106958#comment-17106958
 ] 

Apache Spark commented on SPARK-31405:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28526

> fail by default when read/write datetime values and not sure if they need 
> rebase or not
> ---
>
> Key: SPARK-31405
> URL: https://issues.apache.org/jira/browse/SPARK-31405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org