[jira] [Created] (SPARK-31079) Add RuleExecutor metrics in Explain Formatted

2020-03-06 Thread Xin Wu (Jira)
Xin Wu created SPARK-31079:
--

 Summary: Add RuleExecutor metrics in Explain Formatted
 Key: SPARK-31079
 URL: https://issues.apache.org/jira/browse/SPARK-31079
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Xin Wu


RuleExecutor already support metering for analyzer/optimizer. By providing such 
information in Explain command, user can get better user experience when 
debugging a specific query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite

2020-03-06 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053762#comment-17053762
 ] 

Gabor Somogyi commented on SPARK-30541:
---

Going to check them next week...

> Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-30541
> URL: https://issues.apache.org/jira/browse/SPARK-30541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
> Attachments: consoleText_NOK.txt, consoleText_OK.txt, 
> unit-tests_NOK.log, unit-tests_OK.log
>
>
> The test suite has been failing intermittently as of now:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/]
>  
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it 
> is a sbt.testing.SuiteSelector)
>   
> {noformat}
> Error Details
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
> Stack Trace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: 
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed for /brokers/ids
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>   at 
> kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554)
>   at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
>   at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455)
>   at 
> kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293)
>   at 
> org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395)
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite

2020-03-06 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-30541:
--
Attachment: unit-tests_OK.log
unit-tests_NOK.log
consoleText_OK.txt
consoleText_NOK.txt

> Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-30541
> URL: https://issues.apache.org/jira/browse/SPARK-30541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
> Attachments: consoleText_NOK.txt, consoleText_OK.txt, 
> unit-tests_NOK.log, unit-tests_OK.log
>
>
> The test suite has been failing intermittently as of now:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/]
>  
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it 
> is a sbt.testing.SuiteSelector)
>   
> {noformat}
> Error Details
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
> Stack Trace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: 
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed for /brokers/ids
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>   at 
> kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554)
>   at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
>   at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455)
>   at 
> kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293)
>   at 
> org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395)
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite

2020-03-06 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053760#comment-17053760
 ] 

Gabor Somogyi commented on SPARK-30541:
---

As I see there are 2 problems:
 * Kafka broker is not coming up in time: I've had discussion with the Kafka 
guys and their solution to this is to execute their test with retry
 * Client not found in Kerberos database: which is reproduced here: 
[https://github.com/apache/spark/pull/27810]

I'm going to attach the logs for the second problem here not because jenkins 
results going to disappear.

 

> Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-30541
> URL: https://issues.apache.org/jira/browse/SPARK-30541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The test suite has been failing intermittently as of now:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/]
>  
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it 
> is a sbt.testing.SuiteSelector)
>   
> {noformat}
> Error Details
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
> Stack Trace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: 
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed for /brokers/ids
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>   at 
> kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554)
>   at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
>   at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455)
>   at 
> kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293)
>   at 
> org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395)
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31078) outputOrdering should handle aliases correctly

2020-03-06 Thread Terry Kim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-31078:
--
Description: 
Currently, `outputOrdering` doesn't respect aliases. Thus, the following would 
produce an unnecessary sort node:

{code:java}
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
  val df = (0 until 20).toDF("i").as("df")
  df.repartition(8, df("i")).write.format("parquet")
.bucketBy(8, "i").sortBy("i").saveAsTable("t")
  val t1 = spark.table("t")
  val t2 = t1.selectExpr("i as ii")
  t1.join(t2, t1("i") === t2("ii")).explain
}
{code}

would produce an unnecessary sort node:
{code:java}
== Physical Plan ==
*(3) SortMergeJoin [i#8], [ii#10], Inner
:- *(1) Project [i#8]
:  +- *(1) Filter isnotnull(i#8)
: +- *(1) ColumnarToRow
:+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
[isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
+- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0
   +- *(2) Project [i#8 AS ii#10]
  +- *(2) Filter isnotnull(i#8)
 +- *(2) ColumnarToRow
+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
[isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
{code}


  was:
Currently, `outputOrdering` doesn't respect aliases. Thus, the following would 
produce an unnecessary sort node:

{code:java}
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
  val df = (0 until 20).toDF("i").as("df")
  df.repartition(8, df("i")).write.format("parquet")
.bucketBy(8, "i").sortBy("i").saveAsTable("t")
  val t1 = spark.table("t")
  val t2 = t1.selectExpr("i as ii")
  t1.join(t2, t1("i") === t2("ii")).explain
}

== Physical Plan ==
*(3) SortMergeJoin [i#8], [ii#10], Inner
:- *(1) Project [i#8]
:  +- *(1) Filter isnotnull(i#8)
: +- *(1) ColumnarToRow
:+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
[isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
+- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0
   +- *(2) Project [i#8 AS ii#10]
  +- *(2) Filter isnotnull(i#8)
 +- *(2) ColumnarToRow
+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
[isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8

{code}



> outputOrdering should handle aliases correctly
> --
>
> Key: SPARK-31078
> URL: https://issues.apache.org/jira/browse/SPARK-31078
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> Currently, `outputOrdering` doesn't respect aliases. Thus, the following 
> would produce an unnecessary sort node:
> {code:java}
> withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
>   val df = (0 until 20).toDF("i").as("df")
>   df.repartition(8, df("i")).write.format("parquet")
> .bucketBy(8, "i").sortBy("i").saveAsTable("t")
>   val t1 = spark.table("t")
>   val t2 = t1.selectExpr("i as ii")
>   t1.join(t2, t1("i") === t2("ii")).explain
> }
> {code}
> would produce an unnecessary sort node:
> {code:java}
> == Physical Plan ==
> *(3) SortMergeJoin [i#8], [ii#10], Inner
> :- *(1) Project [i#8]
> :  +- *(1) Filter isnotnull(i#8)
> : +- *(1) ColumnarToRow
> :+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
> [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: 
> struct, SelectedBucketsCount: 8 out of 8
> +- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0
>+- *(2) Project [i#8 AS ii#10]
>   +- *(2) Filter isnotnull(i#8)
>  +- *(2) ColumnarToRow
> +- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
> [isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
> PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: 
> struct, SelectedBucketsCount: 8 out of 8
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31078) outputOrdering should handle aliases correctly

2020-03-06 Thread Terry Kim (Jira)
Terry Kim created SPARK-31078:
-

 Summary: outputOrdering should handle aliases correctly
 Key: SPARK-31078
 URL: https://issues.apache.org/jira/browse/SPARK-31078
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Terry Kim


Currently, `outputOrdering` doesn't respect aliases. Thus, the following would 
produce an unnecessary sort node:

{code:java}
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "0") {
  val df = (0 until 20).toDF("i").as("df")
  df.repartition(8, df("i")).write.format("parquet")
.bucketBy(8, "i").sortBy("i").saveAsTable("t")
  val t1 = spark.table("t")
  val t2 = t1.selectExpr("i as ii")
  t1.join(t2, t1("i") === t2("ii")).explain
}

== Physical Plan ==
*(3) SortMergeJoin [i#8], [ii#10], Inner
:- *(1) Project [i#8]
:  +- *(1) Filter isnotnull(i#8)
: +- *(1) ColumnarToRow
:+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
[isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8
+- *(2) Sort [ii#10 ASC NULLS FIRST], false, 0
   +- *(2) Project [i#8 AS ii#10]
  +- *(2) Filter isnotnull(i#8)
 +- *(2) ColumnarToRow
+- FileScan parquet default.t[i#8] Batched: true, DataFilters: 
[isnotnull(i#8)], Format: Parquet, Location: InMemoryFileIndex[file:/..., 
PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct, 
SelectedBucketsCount: 8 out of 8

{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31064) New Parquet Predicate Filter APIs with multi-part Identifier Support

2020-03-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31064:
---

Assignee: DB Tsai

> New Parquet Predicate Filter APIs with multi-part Identifier Support
> 
>
> Key: SPARK-31064
> URL: https://issues.apache.org/jira/browse/SPARK-31064
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Parquet's *org.apache.parquet.filter2.predicate.FilterApi* uses *dots* as 
> separators to split the column name into multi-parts of nested fields. The 
> drawback is this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.
> It's intended to move this code back to parquet community. See [PARQUET-1809]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31064) New Parquet Predicate Filter APIs with multi-part Identifier Support

2020-03-06 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31064.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27824
[https://github.com/apache/spark/pull/27824]

> New Parquet Predicate Filter APIs with multi-part Identifier Support
> 
>
> Key: SPARK-31064
> URL: https://issues.apache.org/jira/browse/SPARK-31064
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>
> Parquet's *org.apache.parquet.filter2.predicate.FilterApi* uses *dots* as 
> separators to split the column name into multi-parts of nested fields. The 
> drawback is this causes issues when the field name contains *dot*.
> The new APIs that will be added will take array of string directly for 
> multi-parts of nested fields, so no confusion as using *dot* as a separator.
> It's intended to move this code back to parquet community. See [PARQUET-1809]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31077) Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel

2020-03-06 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-31077:
--

 Summary: Remove ChiSqSelector dependency on 
mllib.ChiSqSelectorModel
 Key: SPARK-31077
 URL: https://issues.apache.org/jira/browse/SPARK-31077
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Currently, ChiSqSelector depends on mllib.ChiSqSelectorModel. Remove this 
dependency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time

2020-03-06 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31076:
--

 Summary: Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp 
via local date-time
 Key: SPARK-31076
 URL: https://issues.apache.org/jira/browse/SPARK-31076
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


By default, collect() returns java.sql.Timestamp/Date instances with offsets 
derived from internal values of Catalyst's TIMESTAMP/DATE that store 
microseconds since the epoch. The conversion from internal values to 
java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting 
the resulted values before 1582 year to strings produces timestamp/date string 
in Julian calendar. For example:
{code}
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
{code} 

This can be fixed if internal Catalyst's values are converted to local 
date-time in Gregorian calendar,  and construct local date-time from the 
resulted year, month, ..., seconds in Julian calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-31072:
-
Summary: Default to ParquetOutputCommitter even after configuring s3a 
committer as "partitioned"  (was: Default to ParquetOutputCommitter even after 
configuring committer as "partitioned")

> Default to ParquetOutputCommitter even after configuring s3a committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31072) Default to ParquetOutputCommitter even after configuring committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Kizhakkel Jose updated SPARK-31072:
-
Summary: Default to ParquetOutputCommitter even after configuring committer 
as "partitioned"  (was: Default to ParquetOutputCommitter even after 
configuring setting committer as "partitioned")

> Default to ParquetOutputCommitter even after configuring committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-03-06 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053717#comment-17053717
 ] 

Bryan Cutler commented on SPARK-30961:
--

Just to be clear, this is only an issue with Spark 2.4.x. The issue does not 
affect Spark 3.0.0 and above.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30961) Arrow enabled: to_pandas with date column fails

2020-03-06 Thread Bryan Cutler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-30961.
--
Resolution: Won't Fix

Thanks [~KevinAppel] and [~nicornk] for the info, I'll go ahead and close this 
then.

> Arrow enabled: to_pandas with date column fails
> ---
>
> Key: SPARK-30961
> URL: https://issues.apache.org/jira/browse/SPARK-30961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5
> Environment: Apache Spark 2.4.5
>Reporter: Nicolas Renkamp
>Priority: Major
>  Labels: ready-to-commit
>
> Hi,
> there seems to be a bug in the arrow enabled to_pandas conversion from spark 
> dataframe to pandas dataframe when the dataframe has a column of type 
> DateType. Here is a minimal example to reproduce the issue:
> {code:java}
> spark = SparkSession.builder.getOrCreate()
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> spark_df = spark.createDataFrame(
> [['2019-12-06']], 'created_at: string') \
> .withColumn('created_at', F.to_date('created_at'))
> # works
> spark_df.toPandas()
> spark.conf.set("spark.sql.execution.arrow.enabled", 'true')
> is_arrow_enabled = spark.conf.get("spark.sql.execution.arrow.enabled")
> print("Arrow optimization is enabled: " + is_arrow_enabled)
> # raises AttributeError: Can only use .dt accessor with datetimelike values
> # series is still of type object, .dt does not exist
> spark_df.toPandas(){code}
> A fix would be to modify the _check_series_convert_date function in 
> pyspark.sql.types to:
> {code:java}
> def _check_series_convert_date(series, data_type):
> """
> Cast the series to datetime.date if it's a date type, otherwise returns 
> the original series.:param series: pandas.Series
> :param data_type: a Spark data type for the series
> """
> from pyspark.sql.utils import require_minimum_pandas_version
> require_minimum_pandas_version()from pandas import to_datetime
> if type(data_type) == DateType:
> return to_datetime(series).dt.date
> else:
> return series
> {code}
> Let me know if I should prepare a Pull Request for the 2.4.5 branch.
> I have not tested the behavior on master branch.
>  
> Thanks,
> Nicolas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29367) pandas udf not working with latest pyarrow release (0.15.0)

2020-03-06 Thread Alexander Tronchin-James (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053715#comment-17053715
 ] 

Alexander Tronchin-James commented on SPARK-29367:
--

The fix suggested above to add ARROW_PRE_0_15_IPC_FORMAT=1 in 
SPARK_HOME/conf/spark-env.sh did NOT resolve the issue for me.

Instead, I needed to set this environment variable in my python environment as 
described here: 
https://stackoverflow.com/questions/58269115/how-to-enable-apache-arrow-in-pyspark/58273294#58273294

> pandas udf not working with latest pyarrow release (0.15.0)
> ---
>
> Key: SPARK-29367
> URL: https://issues.apache.org/jira/browse/SPARK-29367
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 2.4.0, 2.4.1, 2.4.3
>Reporter: Julien Peloton
>Assignee: Bryan Cutler
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Hi,
> I recently upgraded pyarrow from 0.14 to 0.15 (released on Oct 5th), and my 
> pyspark jobs using pandas udf are failing with 
> java.lang.IllegalArgumentException (tested with Spark 2.4.0, 2.4.1, and 
> 2.4.3). Here is a full example to reproduce the failure with pyarrow 0.15:
> {code:python}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType
> from pyspark.sql.types import BooleanType
> import pandas as pd
> @pandas_udf(BooleanType(), PandasUDFType.SCALAR)
> def qualitycuts(nbad: int, rb: float, magdiff: float) -> pd.Series:
> """ Apply simple quality cuts
> Returns
> --
> out: pandas.Series of booleans
> Return a Pandas DataFrame with the appropriate flag: false for bad alert,
> and true for good alert.
> """
> mask = nbad.values == 0
> mask *= rb.values >= 0.55
> mask *= abs(magdiff.values) <= 0.1
> return pd.Series(mask)
> spark = SparkSession.builder.getOrCreate()
> # Create dummy DF
> colnames = ["nbad", "rb", "magdiff"]
> df = spark.sparkContext.parallelize(
> zip(
> [0, 1, 0, 0],
> [0.01, 0.02, 0.6, 0.01],
> [0.02, 0.05, 0.1, 0.01]
> )
> ).toDF(colnames)
> df.show()
> # Apply cuts
> df = df\
> .withColumn("toKeep", qualitycuts(*colnames))\
> .filter("toKeep == true")\
> .drop("toKeep")
> # This will fail if latest pyarrow 0.15.0 is used
> df.show()
> {code}
> and the log is:
> {code}
> Driver stacktrace:
> 19/10/07 09:37:49 INFO DAGScheduler: Job 3 failed: showString at 
> NativeMethodAccessorImpl.java:0, took 0.660523 s
> Traceback (most recent call last):
>   File 
> "/Users/julien/Documents/workspace/myrepos/fink-broker/test_pyarrow.py", line 
> 44, in 
> df.show()
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py",
>  line 378, in show
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
>   File 
> "/Users/julien/Documents/workspace/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o64.showString.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 
> (TID 5, localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at 
> org.apache.spark.sql.execution.python.ArrowEvalPythonExec$$anon$2.(ArrowEvalPythonExec.scala:98)
>   at 
> 

[jira] [Created] (SPARK-31075) Add documentation for ALTER TABLE ... ADD PARTITION

2020-03-06 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31075:


 Summary: Add documentation for ALTER TABLE ... ADD PARTITION
 Key: SPARK-31075
 URL: https://issues.apache.org/jira/browse/SPARK-31075
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


Our docs for {{ALTER TABLE}} 
[currently|https://github.com/apache/spark/blob/cba17e07e9f15673f274de1728f6137d600026e1/docs/sql-ref-syntax-ddl-alter-table.md]
 make no mention of {{ADD PARTITION}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24640) size(null) returns null

2020-03-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24640:
--
Labels:   (was: api bulk-closed)

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Show Maven errors from within make-distribution.sh

2020-03-06 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:
{code:java}
./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud {code}
 
 But this doesn't:
{code:java}
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip{code}
 

The latter invocation yields the following, confusing output:
{code:java}
 + VERSION=' -X,--debug Produce execution debug output'{code}
 That's because Maven is accepting {{--pip}} as an option and failing, but the 
user doesn't get to see the error from Maven.

  was:
This works:
{code:java}
./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud {code}
 
 But this doesn't:
{code:java}
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip{code}
 

The latter invocation yields the following, confusing output:
{code:java}
 + VERSION=' -X,--debug Produce execution debug output'{code}
 


> Show Maven errors from within make-distribution.sh
> --
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  That's because Maven is accepting {{--pip}} as an option and failing, but 
> the user doesn't get to see the error from Maven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Show Maven errors from within make-distribution.sh

2020-03-06 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Summary: Show Maven errors from within make-distribution.sh  (was: Make 
arguments to make-distribution.sh position-independent)

> Show Maven errors from within make-distribution.sh
> --
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-06 Thread Kyrill Alyoshin (Jira)
Kyrill Alyoshin created SPARK-31074:
---

 Summary: Avro serializer should not fail when a nullable Spark 
field is written to a non-null Avro column
 Key: SPARK-31074
 URL: https://issues.apache.org/jira/browse/SPARK-31074
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.4
Reporter: Kyrill Alyoshin


Spark StructType schema are strongly biased towards having _nullable_ fields. 
In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
automatically _nullable_. When we attempt to serialize dataframes into 
*user-supplied* Avro schemas where such corresponding fields are marked as 
_non-null_ (i.e., they are not of _union_ type) any such attempt will fail with 
the following exception

 
{code:java}
Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
at org.apache.avro.Schema.getTypes(Schema.java:299)
at 
org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
at 
org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
 {code}
This seems as rather draconian. We certainly should be able to write a field of 
the same type and with the same name if it is not a null into a non-nullable 
Avro column. In fact, the problem is so *severe* that it is not clear what 
should be done in such situations when Avro schema is given to you as part of 
API communication contract (i.e., it is non-changeable).

This is an important issue.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31073) Add "shuffle write time" to task metrics summary in StagePage.

2020-03-06 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-31073:
---
Summary: Add "shuffle write time" to task metrics summary in StagePage.  
(was: Add shuffle write time to task metrics summary in StagePage.)

> Add "shuffle write time" to task metrics summary in StagePage.
> --
>
> Key: SPARK-31073
> URL: https://issues.apache.org/jira/browse/SPARK-31073
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
>
> In StagePage, "shuffle write time" is not shown in task metrics summary even 
> though "shuffle read blocked time" is shown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31073) Add shuffle write time to task metrics summary in StagePage.

2020-03-06 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-31073:
--

 Summary: Add shuffle write time to task metrics summary in 
StagePage.
 Key: SPARK-31073
 URL: https://issues.apache.org/jira/browse/SPARK-31073
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In StagePage, "shuffle write time" is not shown in task metrics summary even 
though "shuffle read blocked time" is shown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053543#comment-17053543
 ] 

Felix Kizhakkel Jose edited comment on SPARK-31072 at 3/6/20, 3:48 PM:
---

[~steve_l], 
 I have seen some issues you have addressed in this area (zero rename with s3a 
etc), could you please give me some insights?

All,
 Please provide some help on this issue.


was (Author: felixkjose):
[~steve_l], 
I have seen some issues you have addressed in this area, could you please give 
me some insights?

All,
Please provide some help on this issue.

> Default to ParquetOutputCommitter even after configuring setting committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053543#comment-17053543
 ] 

Felix Kizhakkel Jose commented on SPARK-31072:
--

[~steve_l], 
I have seen some issues you have addressed in this area, could you please give 
me some insights?

All,
Please provide some help on this issue.

> Default to ParquetOutputCommitter even after configuring setting committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31054) Turn on deprecation in Scala REPL/spark-shell by default

2020-03-06 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31054.
--
Resolution: Won't Fix

> Turn on deprecation in Scala REPL/spark-shell  by default
> -
>
> Key: SPARK-31054
> URL: https://issues.apache.org/jira/browse/SPARK-31054
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> Turn on deprecation in Scala REPL/spark-shell by default, so user can aways 
> see the details about deprecated API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31072) Default to ParquetOutputCommitter even after configuring setting committer as "partitioned"

2020-03-06 Thread Felix Kizhakkel Jose (Jira)
Felix Kizhakkel Jose created SPARK-31072:


 Summary: Default to ParquetOutputCommitter even after configuring 
setting committer as "partitioned"
 Key: SPARK-31072
 URL: https://issues.apache.org/jira/browse/SPARK-31072
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.4.5
Reporter: Felix Kizhakkel Jose


My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
even after I configure to use "PartitionedStagingCommitter" with the following 
configuration:
 * 
sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
 "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
 * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
 * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", "append");
 * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
 * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
false);

Application logs stacktrace:

20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2
20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output committer 
class org.apache.parquet.hadoop.ParquetOutputCommitter

But when I use _*ORC*_ as the file format, with the same configuration as above 
it correctly pick "PartitionedStagingCommitter":
20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 1
20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: false
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer partitioned 
to output data to s3a:
20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
PartitionedStagingCommitter**

So I am wondering why Parquet and ORC has different behavior ?
How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?

I started this because when I was trying to save data to S3 directly with 
partitionBy() two columns -  I was getting  file not found exceptions 
intermittently.  
So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31071) Spark Encoders.bean() should allow marking non-null fields in its Spark schema

2020-03-06 Thread Kyrill Alyoshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kyrill Alyoshin updated SPARK-31071:

Summary: Spark Encoders.bean() should allow marking non-null fields in its 
Spark schema  (was: Spark Encoders.bean() should allow setting non-null fields 
in its Spark schema)

> Spark Encoders.bean() should allow marking non-null fields in its Spark schema
> --
>
> Key: SPARK-31071
> URL: https://issues.apache.org/jira/browse/SPARK-31071
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark _Encoders.bean()_ method should allow the generated StructType schema 
> fields be *non-nullable*.
> Currently, any non-primitive type is automatically _nullable_. It is 
> hard-coded in the _org.apache.spark.sql.catalyst.JavaTypeReference_ class.  
> This can lead to rather interesting situations... For example, let's say I 
> want to save a dataframe using an Avro format with my own non-spark generated 
> Avro schema. Let's also say that my Avro schema has a field that is non-null 
> (i.e., not a union type). Well, it appears *impossible* to store a dataframe 
> using such an Avro schema since Spark would assume that the field is nullable 
> (as it is in its own schema) which would conflict with Avro schema semantics 
> and throw an exception.
> I propose making a change to the _JavaTypeReference_ class to observe the 
> JSR-305 _Nonnull_ annotation (and its children) on the provided bean class 
> during StructType schema generation. This would allow bean creators to 
> control the resulting Spark schema so much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31071) Spark Encoders.bean() should allow setting non-null fields in its Spark schema

2020-03-06 Thread Kyrill Alyoshin (Jira)
Kyrill Alyoshin created SPARK-31071:
---

 Summary: Spark Encoders.bean() should allow setting non-null 
fields in its Spark schema
 Key: SPARK-31071
 URL: https://issues.apache.org/jira/browse/SPARK-31071
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Kyrill Alyoshin


Spark _Encoders.bean()_ method should allow the generated StructType schema 
fields be *non-nullable*.

Currently, any non-primitive type is automatically _nullable_. It is hard-coded 
in the _org.apache.spark.sql.catalyst.JavaTypeReference_ class.  This can lead 
to rather interesting situations... For example, let's say I want to save a 
dataframe using an Avro format with my own non-spark generated Avro schema. 
Let's also say that my Avro schema has a field that is non-null (i.e., not a 
union type). Well, it appears *impossible* to store a dataframe using such an 
Avro schema since Spark would assume that the field is nullable (as it is in 
its own schema) which would conflict with Avro schema semantics and throw an 
exception.

I propose making a change to the _JavaTypeReference_ class to observe the 
JSR-305 _Nonnull_ annotation (and its children) on the provided bean class 
during StructType schema generation. This would allow bean creators to control 
the resulting Spark schema so much better.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31070) make skew join split skewed partitions more evenly

2020-03-06 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31070:
---

 Summary: make skew join split skewed partitions more evenly
 Key: SPARK-31070
 URL: https://issues.apache.org/jira/browse/SPARK-31070
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30899) CreateArray/CreateMap's data type should not depend on SQLConf.get

2020-03-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30899:
---

Assignee: Rakesh Raushan

> CreateArray/CreateMap's data type should not depend on SQLConf.get
> --
>
> Key: SPARK-30899
> URL: https://issues.apache.org/jira/browse/SPARK-30899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Rakesh Raushan
>Priority: Blocker
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30899) CreateArray/CreateMap's data type should not depend on SQLConf.get

2020-03-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30899.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27657
[https://github.com/apache/spark/pull/27657]

> CreateArray/CreateMap's data type should not depend on SQLConf.get
> --
>
> Key: SPARK-30899
> URL: https://issues.apache.org/jira/browse/SPARK-30899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Blocker
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31069) high cpu caused by chunksBeingTransferred in external shuffle service

2020-03-06 Thread Xiaoju Wu (Jira)
Xiaoju Wu created SPARK-31069:
-

 Summary: high cpu caused by chunksBeingTransferred in external 
shuffle service
 Key: SPARK-31069
 URL: https://issues.apache.org/jira/browse/SPARK-31069
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: Xiaoju Wu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31011) Failed to register signal handler for PWR

2020-03-06 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-31011:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> Failed to register signal handler for PWR
> -
>
> Key: SPARK-31011
> URL: https://issues.apache.org/jira/browse/SPARK-31011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> I've just tried to test something on standalone mode but the application 
> fails.
> Environment:
>  * MacOS Catalina 10.15.3 (19D76)
>  * Scala 2.12.10
>  * Java 1.8.0_241-b07
> Steps to reproduce:
>  * Compile Spark (mvn -DskipTests clean install -Dskip)
>  * ./sbin/start-master.sh
>  * ./sbin/start-slave.sh spark://host:7077
>  * submit an empty application
> Error:
> {code:java}
> 20/03/02 14:25:44 INFO SignalUtils: Registering signal handler for PWR
> 20/03/02 14:25:44 WARN SignalUtils: Failed to register signal handler for PWR
> java.lang.IllegalArgumentException: Unknown signal: PWR
>   at sun.misc.Signal.(Signal.java:143)
>   at 
> org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:64)
>   at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
>   at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:62)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:85)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime

2020-03-06 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-31030:

Summary: Backward Compatibility for Parsing and Formatting Datetime  (was: 
Backward Compatibility for Parsing Datetime)

> Backward Compatibility for Parsing and Formatting Datetime
> --
>
> Key: SPARK-31030
> URL: https://issues.apache.org/jira/browse/SPARK-31030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
> Attachments: image-2020-03-04-10-54-05-208.png, 
> image-2020-03-04-10-54-13-238.png
>
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
> are performed by using the hybrid calendar ([Julian + 
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>  
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as 
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by 
> using Java 8 API classes (the java.time packages that are based on [ISO 
> chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
>  ).
> The switching job is completed in SPARK-26651. 
>  
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark 
> 2.4 and earlier when parsing datetime. Spark need its own patters definition 
> on datetime parsing and formatting.
>  
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch, 
> we propose the following solution. 
>  * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
> need to detect these behavior differences by falling back to the legacy 
> parser, and fail with a user-friendly error message to tell users what gets 
> changed and how to fix the pattern.
>  * Document the Spark’s datetime patterns: The date-time formatter of Spark 
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on 
> the [Java 7’s 
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
>  (for better backward compatibility) with the customized logic (caused by the 
> breaking changes between [Java 
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
> and [Java 
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
>  pattern string). Below are the customized rules:
> ||Pattern||Java 7||Java 8|| Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
> accept a negative value to represent BC, while y should be used together with 
> G to do the same thing.)|!image-2020-03-04-10-54-05-208.png!  |Substitute ‘u’ 
> to ‘e’ and use Java 8 parser to parse the string. If parsable, return the 
> result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to 
> parse. When it is successfully parsed, throw an exception and ask users to 
> change the pattern strings or turn on the legacy mode; otherwise, return NULL 
> as what Spark 2.4 does.|
> | z| General time zone which also accepts
>  [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. 
> Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png!  |The 
> semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
> follows the semantics of Java 8. 
>  Use Java 8 to parse the string. If parsable, return the result; otherwise, 
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw 
> an exception and ask users to change the pattern strings or turn on the 
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31030) Backward Compatibility for Parsing Datetime

2020-03-06 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-31030:

Description: 
*Background*

In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
are performed by using the hybrid calendar ([Julian + 
Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
 

Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well 
as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 
8 API classes (the java.time packages that are based on [ISO 
chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
 ).

The switching job is completed in SPARK-26651. 

 

*Problem*

Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4 
and earlier when parsing datetime. Spark need its own patters definition on 
datetime parsing and formatting.

 

*Solution*

To avoid unexpected result changes after the underlying datetime API switch, we 
propose the following solution. 
 * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
need to detect these behavior differences by falling back to the legacy parser, 
and fail with a user-friendly error message to tell users what gets changed and 
how to fix the pattern.

 * Document the Spark’s datetime patterns: The date-time formatter of Spark is 
decoupled with the Java patterns. The Spark’s patterns are mainly based on the 
[Java 7’s 
pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
 (for better backward compatibility) with the customized logic (caused by the 
breaking changes between [Java 
7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
and [Java 
8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
 pattern string). Below are the customized rules:

||Pattern||Java 7||Java 8|| Example||Rule||
|u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
accept a negative value to represent BC, while y should be used together with G 
to do the same thing.)|!image-2020-03-04-10-54-05-208.png!  |Substitute ‘u’ to 
‘e’ and use Java 8 parser to parse the string. If parsable, return the result; 
otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to parse. 
When it is successfully parsed, throw an exception and ask users to change the 
pattern strings or turn on the legacy mode; otherwise, return NULL as what 
Spark 2.4 does.|
| z| General time zone which also accepts
 [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. Pacific 
Standard Time; PST|!image-2020-03-04-10-54-13-238.png!  |The semantics of ‘z’ 
are different between Java 7 and Java 8. Here, Spark 3.0 follows the semantics 
of Java 8. 
 Use Java 8 to parse the string. If parsable, return the result; otherwise, use 
the legacy Java 7 parser to parse. When it is successfully parsed, throw an 
exception and ask users to change the pattern strings or turn on the legacy 
mode; otherwise, return NULL as what Spark 2.4 does.|

 

 

 

  was:
*Background*

In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
are performed by using the hybrid calendar ([Julian + 
Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
 

Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well 
as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 
8 API classes (the java.time packages that are based on [ISO 
chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
 ).

The switching job is completed in SPARK-26651. 

 

*Problem*

Switching to Java 8 datetime API breaks the backward compatibility of Spark 2.4 
and earlier when parsing datetime. Moreover, for the build-in SQL expressions 
like to_date, to_timestamp and etc,  in the existing implementation of Spark 
3.0 will catch all the exceptions and return `null` when hitting the parsing 
errors. This will cause the silent result changes, which are hard to debug for 
end-users when the data volume is huge and business logics are complex.

 

*Solution*

To avoid unexpected result changes after the underlying datetime API switch, we 
propose the following solution. 
 * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
need to detect these behavior differences by falling back to the legacy parser, 
and fail with a user-friendly error message to tell users what gets changed and 
how to fix the pattern.

 * Document the Spark’s datetime patterns: The date-time formatter of Spark is 
decoupled with the Java patterns. The Spark’s patterns are mainly based on the 
[Java 7’s 
pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
 (for better backward compatibility) 

[jira] [Resolved] (SPARK-30279) Support 32 or more grouping attributes for GROUPING_ID

2020-03-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-30279.
--
Fix Version/s: 3.1.0
 Assignee: Takeshi Yamamuro
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/26918]

> Support 32 or more grouping attributes for GROUPING_ID 
> ---
>
> Key: SPARK-30279
> URL: https://issues.apache.org/jira/browse/SPARK-30279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 2.4.6
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> This ticket targets to support 32 or more grouping attributes for 
> GROUPING_ID. In the current master, an integer overflow can occur to compute 
> grouping IDs;
> https://github.com/apache/spark/blob/e75d9afb2f282ce79c9fd8bce031287739326a4f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L613
> For example, the query below generates wrong grouping IDs in the master;
> {code}
> scala> val numCols = 32 // or, 31
> scala> val cols = (0 until numCols).map { i => s"c$i" }
> scala> sql(s"create table test_$numCols (${cols.map(c => s"$c 
> int").mkString(",")}, v int) using parquet")
> scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",")
> scala> sql(s"insert into test_$numCols values ($insertVals,3)")
> scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by 
> grouping sets ((${cols.mkString(",")}), 
> (${cols.init.mkString(",")}))").show(10, false)
> scala> sql(s"drop table test_$numCols")
> // numCols = 32
> +-+--+
> |grouping_id()|sum(v)|
> +-+--+
> |0|3 |
> |0|3 | // Wrong Grouping ID
> +-+--+
> // numCols = 31
> +-+--+
> |grouping_id()|sum(v)|
> +-+--+
> |0|3 |
> |1|3 |
> +-+--+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org