[jira] [Created] (SPARK-31050) Disable flaky KafkaDelegationTokenSuite

2020-03-04 Thread wuyi (Jira)
wuyi created SPARK-31050:


 Summary: Disable flaky KafkaDelegationTokenSuite
 Key: SPARK-31050
 URL: https://issues.apache.org/jira/browse/SPARK-31050
 Project: Spark
  Issue Type: Bug
  Components: SQL, Structured Streaming
Affects Versions: 3.0.0
 Environment: Disable flaky KafkaDelegationTokenSuite since it's too 
flaky.
Reporter: wuyi






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30886) Deprecate LTRIM, RTRIM, and two-parameter TRIM functions

2020-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Target Version/s: 3.0.0

> Deprecate LTRIM, RTRIM, and two-parameter TRIM functions
> 
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30886) Deprecate LTRIM, RTRIM, and two-parameter TRIM functions

2020-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Issue Type: Bug  (was: Task)

> Deprecate LTRIM, RTRIM, and two-parameter TRIM functions
> 
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30886) Deprecate LTRIM, RTRIM, and two-parameter TRIM functions

2020-03-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30886:
--
Summary: Deprecate LTRIM, RTRIM, and two-parameter TRIM functions  (was: 
Warn two-parameter TRIM/LTRIM/RTRIM functions)

> Deprecate LTRIM, RTRIM, and two-parameter TRIM functions
> 
>
> Key: SPARK-30886
> URL: https://issues.apache.org/jira/browse/SPARK-30886
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Apache Spark community decided to keep the existing esoteric two-parameter 
> use cases with a proper warning. This JIRA aims to show warning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30541) Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite

2020-03-04 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-30541:
-
Priority: Blocker  (was: Major)

> Flaky test: org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite
> ---
>
> Key: SPARK-30541
> URL: https://issues.apache.org/jira/browse/SPARK-30541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The test suite has been failing intermittently as of now:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/116862/testReport/]
>  
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.(It is not a test it 
> is a sbt.testing.SuiteSelector)
>   
> {noformat}
> Error Details
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
> Stack Trace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 3939 times over 
> 1.000122353532 minutes. Last failure message: KeeperErrorCode = 
> AuthFailed for /brokers/ids.
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:432)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:439)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:391)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at org.scalatest.concurrent.Eventually.eventually(Eventually.scala:337)
>   at org.scalatest.concurrent.Eventually.eventually$(Eventually.scala:336)
>   at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:479)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.setup(KafkaTestUtils.scala:292)
>   at 
> org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite.beforeAll(KafkaDelegationTokenSuite.scala:49)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
>   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
>   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:58)
>   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:317)
>   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:510)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
>   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: sbt.ForkMain$ForkError: 
> org.apache.zookeeper.KeeperException$AuthFailedException: KeeperErrorCode = 
> AuthFailed for /brokers/ids
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:130)
>   at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
>   at 
> kafka.zookeeper.AsyncResponse.resultException(ZooKeeperClient.scala:554)
>   at kafka.zk.KafkaZkClient.getChildren(KafkaZkClient.scala:719)
>   at kafka.zk.KafkaZkClient.getSortedBrokerList(KafkaZkClient.scala:455)
>   at 
> kafka.zk.KafkaZkClient.getAllBrokersInCluster(KafkaZkClient.scala:404)
>   at 
> org.apache.spark.sql.kafka010.KafkaTestUtils.$anonfun$setup$3(KafkaTestUtils.scala:293)
>   at 
> org.scalatest.concurrent.Eventually.makeAValiantAttempt$1(Eventually.scala:395)
>   at 
> org.scalatest.concurrent.Eventually.tryTryAgain$1(Eventually.scala:409)
>   ... 20 more
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31049) Support nested adjacent generators, e.g., explode(explode(v))

2020-03-04 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-31049:


 Summary: Support nested adjacent generators, e.g., 
explode(explode(v))
 Key: SPARK-31049
 URL: https://issues.apache.org/jira/browse/SPARK-31049
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 2.4.6
Reporter: Takeshi Yamamuro


In the master, we currently don't support any nested generators, but I think 
supporting limited nested cases is somewhat useful for users, e.g., 
explode(explode(v)).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31048) alter hive column datatype is not supported

2020-03-04 Thread Sunil Aryal (Jira)
Sunil Aryal created SPARK-31048:
---

 Summary: alter hive column datatype is not supported
 Key: SPARK-31048
 URL: https://issues.apache.org/jira/browse/SPARK-31048
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.2.2
 Environment: spark sql with hive metadata store.
Reporter: Sunil Aryal


describe tb2;
Getting log thread is interrupted, since query is done!
+-+-+-+--+
| col_name | data_type | comment |
+-+-+-+--+
| fn | int | NULL |
| ln | string | NULL |
| age | int | NULL |
| # Partition Information | | |
| # col_name | data_type | comment |
| age | int | NULL |
+-+-+-+--+
6 rows selected (0.213 seconds)
 alter table tb2 change fn fn bigint;
Getting log thread is interrupted, since query is done!
Error: org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not 
supported for changing column 'fn' with type 'IntegerType' to 'fn' with type 
'LongType'; (state=,code=0)
java.sql.SQLException: org.apache.spark.sql.AnalysisException: ALTER TABLE 
CHANGE COLUMN is not supported for changing column 'fn' with type 'IntegerType' 
to 'fn' with type 'LongType';
 at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
 at org.apache.hive.beeline.Commands.execute(Commands.java:848)
 at org.apache.hive.beeline.Commands.sql(Commands.java:713)
 at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:973)
 at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:813)
 at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:771)
 at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:484)
 at org.apache.hive.beeline.BeeLine.main(BeeLine.java:467)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051844#comment-17051844
 ] 

Hyukjin Kwon commented on SPARK-29058:
--

Yeah, so there's kind of tradeoff. If we should have the same results, we 
should parse and convert everything always which is pretty costly. Workaround 
itself seems simple enough though.

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051839#comment-17051839
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon] I agree with you on this.

However, the dataframe is getting created without the second row which is 
malformed. This can be observed from df.show()

>>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
>>> df.show()
+--+--+-++
| Fruit| color|price|quantity|
+--+--+-++
| apple|   red|1|   3|
|orange|orange|3|   5|
+--+--+-++

so, ideally it should return the correct row count accordingly. What you say?

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051827#comment-17051827
 ] 

Hyukjin Kwon commented on SPARK-29058:
--

Yeah, so essentially it doesn't need to parse and convert anything. That's why 
it doesn't treat the second row as malformed.

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051820#comment-17051820
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon] How does column pruning work here because count() does not need 
any columns to perform the count. It just returns the row count.

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30913) Add version information to the configuration of Tests.scala

2020-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30913.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27783
[https://github.com/apache/spark/pull/27783]

> Add version information to the configuration of Tests.scala
> ---
>
> Key: SPARK-30913
> URL: https://issues.apache.org/jira/browse/SPARK-30913
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/Tests.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30889) Add version information to the configuration of Worker

2020-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30889.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27783
[https://github.com/apache/spark/pull/27783]

> Add version information to the configuration of Worker
> --
>
> Key: SPARK-30889
> URL: https://issues.apache.org/jira/browse/SPARK-30889
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/Worker.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31047) Improve file listing for ViewFileSystem

2020-03-04 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-31047:
---
Component/s: (was: Input/Output)
 SQL

> Improve file listing for ViewFileSystem
> ---
>
> Key: SPARK-31047
> URL: https://issues.apache.org/jira/browse/SPARK-31047
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Manu Zhang
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing 
> for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes 
> use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. 
> This ticket intends to improve the case where ViewFileSystem is used to 
> manage multiple DistributedFileSystems. It has also overridden the 
> {{listLocatedStatus}} method by delegating to the filesystem it resolves to, 
> e.g. DistributedFileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30980) Issue not resolved of Caught Hive MetaException attempting to get partition metadata by filter from Hive

2020-03-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051767#comment-17051767
 ] 

Hyukjin Kwon commented on SPARK-30980:
--

[~coderbond007], can you provide reproducible example? are you able to 
reproduce in your local too?

> Issue not resolved of Caught Hive MetaException attempting to get partition 
> metadata by filter from Hive
> 
>
> Key: SPARK-30980
> URL: https://issues.apache.org/jira/browse/SPARK-30980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.2
> Environment: 2.4.0-CDH6.3.1 (which I guess points to Spark Version 
> 2.4.2)
>Reporter: Pradyumn Agrawal
>Priority: Major
>
> I am querying on table created in Hive. Getting repetitive exception of 
> failing to query data with following stacktrace.
>  
> {code:java}
> // code placeholder
> java.lang.RuntimeException: Caught Hive MetaException attempting to get 
> partition metadata by filter from Hive. You can set the Spark configuration 
> setting spark.sql.hive.manageFilesourcePartitions to false to work around 
> this problem, however this will result in degraded performance. Please report 
> a bug: https://issues.apache.org/jira/browse/SPARKjava.lang.RuntimeException: 
> Caught Hive MetaException attempting to get partition metadata by filter from 
> Hive. You can set the Spark configuration setting 
> spark.sql.hive.manageFilesourcePartitions to false to work around this 
> problem, however this will result in degraded performance. Please report a 
> bug: https://issues.apache.org/jira/browse/SPARK at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1258)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1251)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1251)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957)
>  at 
> org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:261)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:326)
>  at 
> 

[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051765#comment-17051765
 ] 

Hyukjin Kwon commented on SPARK-29058:
--

It parses what it needs intenally via column pruning. It's kind of a feature.

{code}
spark.read.csv(path="tmp.csv",mode="DROPMALFORMED",schema=schema).rdd.count()
{code}

this workaround is pretty much feasible I guess?

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31047) Improve file listing for ViewFileSystem

2020-03-04 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31047:
--

 Summary: Improve file listing for ViewFileSystem
 Key: SPARK-31047
 URL: https://issues.apache.org/jira/browse/SPARK-31047
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 3.1.0
Reporter: Manu Zhang


https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing for 
DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes use of 
DistributedFileSystem's one single {{listLocatedStatus}} to namenode. This 
ticket intends to improve the case where ViewFileSystem is used to manage 
multiple DistributedFileSystems. It has also overridden the 
{{listLocatedStatus}} method by delegating to the filesystem it resolves to, 
e.g. DistributedFileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-04 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051664#comment-17051664
 ] 

Thomas Graves edited comment on SPARK-31043 at 3/4/20, 10:18 PM:
-

rebuilt and still see the error. The full exception in the master log is:

 

java.lang.NoClassDefFoundError: org/w3c/dom/ElementTraversal
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
 at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source)
 at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown Source)
 at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown Source)
 at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown Source)
 at org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown 
Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
 at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
 at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
 at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
 at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482)
 at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470)
 at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
 at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494)
 at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407)
 at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
 at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
 at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1432)
 at org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:72)
 at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:274)
 at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:262)
 at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:807)
 at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:777)
 at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:650)
 at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2412)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2412)
 at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
 at 
org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1137)
 at org.apache.spark.deploy.master.Master$.main(Master.scala:1122)
 at org.apache.spark.deploy.master.Master.main(Master.scala)
Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
 ... 44 more


was (Author: tgraves):
rebuilt and still see the error. The full exception in the master log is:

 

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/w3c/dom/ElementTraversal
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
 at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source)
 at 

[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-04 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051664#comment-17051664
 ] 

Thomas Graves commented on SPARK-31043:
---

rebuilt and still see the error. The full exception in the master log is:

 

Exception in thread "main" java.lang.NoClassDefFoundError: 
org/w3c/dom/ElementTraversal
 at java.lang.ClassLoader.defineClass1(Native Method)
 at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
 at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
 at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
 at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
 at org.apache.xerces.parsers.AbstractDOMParser.startDocument(Unknown Source)
 at org.apache.xerces.xinclude.XIncludeHandler.startDocument(Unknown Source)
 at org.apache.xerces.impl.dtd.XMLDTDValidator.startDocument(Unknown Source)
 at org.apache.xerces.impl.XMLDocumentScannerImpl.startEntity(Unknown Source)
 at org.apache.xerces.impl.XMLVersionDetector.startDocumentParsing(Unknown 
Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
 at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
 at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
 at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
 at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2482)
 at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2470)
 at org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
 at org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2494)
 at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2407)
 at org.apache.hadoop.conf.Configuration.get(Configuration.java:981)
 at org.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1031)
 at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1432)
 at org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:72)
 at 
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:274)
 at 
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:262)
 at 
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:807)
 at 
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:777)
 at 
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:650)
 at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2412)
 at scala.Option.getOrElse(Option.scala:189)
 at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2412)
 at org.apache.spark.SecurityManager.(SecurityManager.scala:79)
 at 
org.apache.spark.deploy.history.HistoryServer$.createSecurityManager(HistoryServer.scala:327)
 at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:288)
 at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
 ... 44 more

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> 

[jira] [Created] (SPARK-31046) Make more efficient and clean up AQE update UI code

2020-03-04 Thread Wei Xue (Jira)
Wei Xue created SPARK-31046:
---

 Summary: Make more efficient and clean up AQE update UI code
 Key: SPARK-31046
 URL: https://issues.apache.org/jira/browse/SPARK-31046
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-04 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051655#comment-17051655
 ] 

Thomas Graves commented on SPARK-31043:
---

A couple of my colleagues actually ran into this and reported it to me. I built 
and saw the same thing. I did a clean when building, but I'll run again just to 
verify.

I was building with:

build/mvn -Phadoop-2.7 -Pyarn -Pkinesis-asl -Pkubernetes -Pmesos -Phadoop-cloud 
-Pspark-ganglia-lgpl clean package -DskipTests 2>&1 | tee out

I reverted the one xerces version change commit and rebuilt with command above 
and the error went away.

One thing is that I don't have hadoop env variables set - not sure if you do or 
have them in path such that it might be picking up jars from there.

Yeah I actually started looking for other things because it was complaining 
about xml-apis so thought the xerces change was weird that it caused but 
haven't investigated further

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31045) Add config for AQE logging level

2020-03-04 Thread Wei Xue (Jira)
Wei Xue created SPARK-31045:
---

 Summary: Add config for AQE logging level
 Key: SPARK-31045
 URL: https://issues.apache.org/jira/browse/SPARK-31045
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-04 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051643#comment-17051643
 ] 

Sean R. Owen commented on SPARK-31043:
--

Hm, I'm not seeing failures for the 3.0 branch or master after this change, for 
Hadoop 2.7.
It sure does look suspiciously related as that class is XML-related.
However that class is also not in Xerces, but in xml-apis.
You don't by chance have old and new Xerces in your deployment somehow?

Anyway this makes me nervous enough relative to the gain, that unless you have 
a reason to think it's a fluke, I think I'm going to revert it.

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27651) Avoid the network when block manager fetches shuffle blocks from the same host

2020-03-04 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051630#comment-17051630
 ] 

Thomas Graves commented on SPARK-27651:
---

It looks like this only works when using the external shuffle service, is that 
correct? The way I read the description implies it works from both "from an 
executor (or the external shuffle service)" so perhaps we should clarify.  In 
both this Jira and the config descriptions.

Also was there any technical  reasons we didn't support it for executor to 
executor shuffle?

> Avoid the network when block manager fetches shuffle blocks from the same host
> --
>
> Key: SPARK-27651
> URL: https://issues.apache.org/jira/browse/SPARK-27651
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.0.0
>
>
> When a shuffle block (content) is fetched the network is always used even 
> when it is fetched from an executor (or the external shuffle service) running 
> on the same host.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31044) Support foldable input by `schema_of_json`

2020-03-04 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-31044:
--

 Summary: Support foldable input by `schema_of_json`
 Key: SPARK-31044
 URL: https://issues.apache.org/jira/browse/SPARK-31044
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, the `schema_of_json()` function allows only string literal as the 
input. The ticket aims to support any foldable string expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-04 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051557#comment-17051557
 ] 

Sean R. Owen commented on SPARK-31043:
--

Weird but I think I have to revert it . It isn't essential enough as an update. 
I messed up the change in a way that didn't get it tested by the pr builder 
properly

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-04 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051554#comment-17051554
 ] 

Thomas Graves commented on SPARK-31043:
---

I'm working on tracing down what broke this.

[~srowen] looks like  [SPARK-30994][CORE] Update xerces to 2.12.0 broke this. 
When I revert that it works again.

> Spark 3.0 built against hadoop2.7 can't start standalone master
> ---
>
> Key: SPARK-31043
> URL: https://issues.apache.org/jira/browse/SPARK-31043
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Priority: Critical
>
> trying to start a standalone master when building spark branch 3.0 with 
> hadoop2.7 fails with:
>  
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/w3c/dom/ElementTraversal
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.defineClass(URLClassLoader.java:468)
> at 
> [java.net|http://java.net/]
> .URLClassLoader.access$100(URLClassLoader.java:74)
> at 
> [java.net|http://java.net/]
> .URLClassLoader$1.run(URLClassLoader.java:369)
> ...
> Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
> at 
> [java.net|http://java.net/]
> .URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> ... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30784) Hive 2.3 profile should still use orc-nohive

2020-03-04 Thread Yin Huai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-30784.
--
Resolution: Not A Bug

Resolving it because with Hive 2.3, using regular orc is required. 

> Hive 2.3 profile should still use orc-nohive
> 
>
> Key: SPARK-30784
> URL: https://issues.apache.org/jira/browse/SPARK-30784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yin Huai
>Priority: Critical
>
> Originally reported at 
> [https://github.com/apache/spark/pull/26619#issuecomment-583802901]
>  
> Right now, Hive 2.3 profile pulls in regular orc, which depends on 
> hive-storage-api. However, hive-storage-api and hive-common have the 
> following common class files
>  
> org/apache/hadoop/hive/common/ValidReadTxnList.class
>  org/apache/hadoop/hive/common/ValidTxnList.class
>  org/apache/hadoop/hive/common/ValidTxnList$RangeResponse.class
> For example, 
> [https://github.com/apache/hive/blob/rel/storage-release-2.6.0/storage-api/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java]
>  (pulled in by orc 1.5.8) and 
> [https://github.com/apache/hive/blob/rel/release-2.3.6/common/src/java/org/apache/hadoop/hive/common/ValidReadTxnList.java]
>  (from hive-common 2.3.6) both are in the classpath and they are different. 
> Having both versions in the classpath can cause unexpected behavior due to 
> classloading order. We should still use orc-nohive, which has 
> hive-storage-api shaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31043) Spark 3.0 built against hadoop2.7 can't start standalone master

2020-03-04 Thread Thomas Graves (Jira)
Thomas Graves created SPARK-31043:
-

 Summary: Spark 3.0 built against hadoop2.7 can't start standalone 
master
 Key: SPARK-31043
 URL: https://issues.apache.org/jira/browse/SPARK-31043
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: Thomas Graves


trying to start a standalone master when building spark branch 3.0 with 
hadoop2.7 fails with:

 
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/w3c/dom/ElementTraversal
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
at 
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at 
[java.net|http://java.net/]
.URLClassLoader.defineClass(URLClassLoader.java:468)
at 
[java.net|http://java.net/]
.URLClassLoader.access$100(URLClassLoader.java:74)
at 
[java.net|http://java.net/]
.URLClassLoader$1.run(URLClassLoader.java:369)
...
Caused by: java.lang.ClassNotFoundException: org.w3c.dom.ElementTraversal
at 
[java.net|http://java.net/]
.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 42 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown

2020-03-04 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-31027:

Fix Version/s: (was: 3.1.0)
   3.0.0

> Refactor `DataSourceStrategy.scala` to minimize the changes to support nested 
> predicate pushdown
> 
>
> Key: SPARK-31027
> URL: https://issues.apache.org/jira/browse/SPARK-31027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31029) Occasional class not found error in user's Future code using global ExecutionContext

2020-03-04 Thread shanyu zhao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shanyu zhao updated SPARK-31029:

Description: 
*Problem:*
When running tpc-ds test (https://github.com/databricks/spark-sql-perf), 
occasionally we see error related to class not found:

2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw 
exception: scala.ScalaReflectionException: class 
com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with 
sun.misc.Launcher$AppClassLoader@28ba21f3 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] 
and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class 
sun.misc.Launcher$ExtClassLoader with classpath [...] 
and parent being primordial classloader with boot classpath [...] not found.

*Root cause:*
Spark driver starts ApplicationMaster in the main thread, which starts a user 
thread and set MutableURLClassLoader to that thread's ContextClassLoader.
userClassThread = startUserApplication()

The main thread then setup YarnSchedulerBackend RPC endpoints, which handles 
these calls using scala Future with the default global ExecutionContext:
- doRequestTotalExecutors
- doKillExecutors

If main thread starts a future to handle doKillExecutors() before user thread 
does then the default thread pool thread's ContextClassLoader would be the 
default (AppClassLoader). 
If user thread starts a future first then the thread pool thread will have 
MutableURLClassLoader.

So if user's code uses a future which references a user provided class (only 
MutableURLClassLoader can load), and before the future if there are executor 
lost, you will see errors related to class not found.

*Proposed Solution:*
We can potentially solve this problem in one of two ways:
1) Set the same class loader (userClassLoader) to both the main thread and user 
thread in ApplicationMaster.scala

2) Do not use "ExecutionContext.Implicits.global" in YarnSchedulerBackend

  was:
*Problem:*
When running tpc-ds test (https://github.com/databricks/spark-sql-perf), 
occasionally we see error related to class not found:

2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw 
exception: scala.ScalaReflectionException: class 
com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with 
sun.misc.Launcher$AppClassLoader@28ba21f3 of type class 
sun.misc.Launcher$AppClassLoader with classpath [...] 
and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class 
sun.misc.Launcher$ExtClassLoader with classpath [...] 
and parent being primordial classloader with boot classpath [...] not found.

*Root cause:*
Spark driver starts ApplicationMaster in the main thread, which starts a user 
thread and set MutableURLClassLoader to that thread's ContextClassLoader.
userClassThread = startUserApplication()

The main thread then setup YarnSchedulerBackend RPC endpoints, which handles 
these calls using scala Future with the default global ExecutionContext:
- doRequestTotalExecutors
- doKillExecutors

If main thread starts a future to handle doKillExecutors() before user thread 
does then the default thread pool thread's ContextClassLoader would be the 
default (AppClassLoader). 
If user thread starts a future first then the thread pool thread will have 
MutableURLClassLoader.

So if user's code uses a future which references a user provided class (only 
MutableURLClassLoader can load), and before the future if there are executor 
lost, you will see errors related to class not found.

*Proposed Solution:*
Set the same class loader (userClassLoader) to both the main thread and user 
thread in ApplicationMaster.scala


> Occasional class not found error in user's Future code using global 
> ExecutionContext
> 
>
> Key: SPARK-31029
> URL: https://issues.apache.org/jira/browse/SPARK-31029
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.4.5
>Reporter: shanyu zhao
>Priority: Major
>
> *Problem:*
> When running tpc-ds test (https://github.com/databricks/spark-sql-perf), 
> occasionally we see error related to class not found:
> 2020-02-04 20:00:26,673 ERROR yarn.ApplicationMaster: User class threw 
> exception: scala.ScalaReflectionException: class 
> com.databricks.spark.sql.perf.ExperimentRun in JavaMirror with 
> sun.misc.Launcher$AppClassLoader@28ba21f3 of type class 
> sun.misc.Launcher$AppClassLoader with classpath [...] 
> and parent being sun.misc.Launcher$ExtClassLoader@3ff5d147 of type class 
> sun.misc.Launcher$ExtClassLoader with classpath [...] 
> and parent being primordial classloader with boot classpath [...] not found.
> *Root cause:*
> Spark driver starts ApplicationMaster in the main thread, which starts a user 
> thread and set 

[jira] [Commented] (SPARK-29058) Reading csv file with DROPMALFORMED showing incorrect record count

2020-03-04 Thread Suchintak Patnaik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051524#comment-17051524
 ] 

Suchintak Patnaik commented on SPARK-29058:
---

[~hyukjin.kwon] Any update on this issue?

> Reading csv file with DROPMALFORMED showing incorrect record count
> --
>
> Key: SPARK-29058
> URL: https://issues.apache.org/jira/browse/SPARK-29058
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Suchintak Patnaik
>Priority: Minor
>
> The spark sql csv reader is dropping malformed records as expected, but the 
> record count is showing as incorrect.
> Consider this file (fruit.csv)
> {code}
> apple,red,1,3
> banana,yellow,2,4.56
> orange,orange,3,5
> {code}
> Defining schema as follows:
> {code}
> schema = "Fruit string,color string,price int,quantity int"
> {code}
> Notice that the "quantity" field is defined as integer type, but the 2nd row 
> in the file contains a floating point value, hence it is a corrupt record.
> {code}
> >>> df = spark.read.csv(path="fruit.csv",mode="DROPMALFORMED",schema=schema)
> >>> df.show()
> +--+--+-++
> | Fruit| color|price|quantity|
> +--+--+-++
> | apple|   red|1|   3|
> |orange|orange|3|   5|
> +--+--+-++
> >>> df.count()
> 3
> {code}
> Malformed record is getting dropped as expected, but incorrect record count 
> is getting displayed.
> Here the df.count() should give value as 2
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31042) Error in writing a pyspark streaming dataframe created from Kafka source to a csv file

2020-03-04 Thread Suchintak Patnaik (Jira)
Suchintak Patnaik created SPARK-31042:
-

 Summary: Error in writing a pyspark streaming dataframe created 
from Kafka source to a csv file 
 Key: SPARK-31042
 URL: https://issues.apache.org/jira/browse/SPARK-31042
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Structured Streaming
Affects Versions: 2.4.5
Reporter: Suchintak Patnaik


While writing a streaming dataframe created from Kafka source to a csv file 
gives following error in PySpark.

NOTE : The same streaming dataframe is getting displayed in the console.

sdf.writeStream.format("console").start().awaitTermination()  // Working

sdf.writeStream\
.format("csv")\
.option("path", "C://output")\
.option("checkpointLocation", "C://Checkpoint")\
.outputMode("append")\
.start().awaitTermination()// Not working


Error
-
 *File "C:\Spark\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
  File "C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py", line 328, in 
get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling 
o63.awaitTermination.
: org.apache.spark.sql.streaming.StreamingQueryException: Expected e.g. 
{"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}, got {"logOffset":1}
=== Streaming Query ===
Identifier: [id = 6718625c-489e-44c8-b273-0da3429e97a8, runId = 
b64887ba-ca32-499e-9ab5-f839fd44ec26]
Current Committed Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}}
Current Available Offsets: {KafkaV2[Subscribe[test1]]: {"logOffset":1}}

Current State: ACTIVE
Thread State: RUNNABLE*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:
{code:java}
./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud {code}
 
 But this doesn't:
{code:java}
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip{code}
 

The latter invocation yields the following, confusing output:
{code:java}
 + VERSION=' -X,--debug Produce execution debug output'{code}
 

  was:
This works:

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud

```
  
 But this doesn't:

```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```


> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud

```
  
 But this doesn't:

```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```

  was:
This works:

 

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud
 ```
  
 But this doesn't:
  
 ```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```


> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
> ```
>  ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud
> ```
>   
>  But this doesn't:
> ```
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Description: 
This works:

 

```
 ./dev/make-distribution.sh \
 --pip \
 -Phadoop-2.7 -Phive -Phadoop-cloud
 ```
  
 But this doesn't:
  
 ```
 ./dev/make-distribution.sh \
 -Phadoop-2.7 -Phive -Phadoop-cloud \
 --pip

```

  was:
This works:

 

```
./dev/make-distribution.sh \
--pip \
-Phadoop-2.7 -Phive -Phadoop-cloud
```
 
But this doesn't:
 
```
./dev/make-distribution.sh \
-Phadoop-2.7 -Phive -Phadoop-cloud \
--pip```


> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
>  
> ```
>  ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud
>  ```
>   
>  But this doesn't:
>   
>  ```
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31041) Make arguments to make-distribution.sh position-independent

2020-03-04 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-31041:
-
Summary: Make arguments to make-distribution.sh position-independent  (was: 
Make argument to make-distribution position-independent)

> Make arguments to make-distribution.sh position-independent
> ---
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Trivial
>
> This works:
>  
> ```
> ./dev/make-distribution.sh \
> --pip \
> -Phadoop-2.7 -Phive -Phadoop-cloud
> ```
>  
> But this doesn't:
>  
> ```
> ./dev/make-distribution.sh \
> -Phadoop-2.7 -Phive -Phadoop-cloud \
> --pip```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31041) Make argument to make-distribution position-independent

2020-03-04 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-31041:


 Summary: Make argument to make-distribution position-independent
 Key: SPARK-31041
 URL: https://issues.apache.org/jira/browse/SPARK-31041
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Nicholas Chammas


This works:

 

```
./dev/make-distribution.sh \
--pip \
-Phadoop-2.7 -Phive -Phadoop-cloud
```
 
But this doesn't:
 
```
./dev/make-distribution.sh \
-Phadoop-2.7 -Phive -Phadoop-cloud \
--pip```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31009) Support json_object_keys function

2020-03-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31009:
---
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> Mysql -> 
> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]
> MariaDB -> [https://mariadb.com/kb/en/json-functions/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31008) Support json_array_length function

2020-03-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31008:
---
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support json_array_length function
> --
>
> Key: SPARK-31008
> URL: https://issues.apache.org/jira/browse/SPARK-31008
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> At the moment we don't support json_array_length function in spark.
> This function is supported by
> a.) PostgreSQL -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> b.) Presto -> [https://prestodb.io/docs/current/functions/json.html]
> c.) redshift -> 
> [https://docs.aws.amazon.com/redshift/latest/dg/JSON_ARRAY_LENGTH.html]
>  
> This allows naive users to directly get array length with a well defined json 
> function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31035) Show assigned resource information for local mode

2020-03-04 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-31035.

Resolution: Invalid

Resource aware scheduling doesn't support local mode for now so I'll close this 
ticket.

> Show assigned resource information for local mode
> -
>
> Key: SPARK-31035
> URL: https://issues.apache.org/jira/browse/SPARK-31035
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> ExecutorsPage shows resource information like GPUs and FPGAs for each 
> Executor.
> But for local mode, resource information is not shown.
> It's useful during application development if we can confirm the information 
> from WebUI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31040) Offsets are only logged for partitions which had data this causes next batch to read the partitions that were not included from the beginning when using kafka

2020-03-04 Thread Richard Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Gilmore updated SPARK-31040:

Description: 
Each batch should either log all offsets for each partition or should scan back 
across offset logs.

[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala]

offset log 23615

 
{code:java}
{"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}%
{code}
 

 

offset log 23616

 
{code:java}
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%
{code}
 

 
{code:java}
/0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with 
committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets 
{KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit
 log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 
INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 
26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 
26533608, myTopic.myTopic.orders-5 -> 26533486)
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have 
been missed.

{code}
 

 

  was:
Each batch should either log all offsets for each partition or should scan back 
across commit logs.

[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala]

offset log 23615

 
{code:java}
{"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}%
{code}
 

 

offset log 23616

 
{code:java}
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%
{code}
 

 
{code:java}
/0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with 
committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets 
{KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit
 log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 
INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 
26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 
26533608, myTopic.myTopic.orders-5 -> 26533486)
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have 
been missed.

{code}
 

 


> Offsets are only logged for partitions which had data this causes next batch 
> to read the partitions that were not included from the beginning when using 
> kafka
> --
>
> Key: SPARK-31040
> URL: https://issues.apache.org/jira/browse/SPARK-31040
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.4, 2.4.5
>Reporter: Richard Gilmore
>Priority: Major
>
> Each batch should either log all offsets for each partition or should scan 
> back across offset logs.
> [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala]
> offset log 23615
>  
> {code:java}
> {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}%
> {code}
>  
>  
> offset log 23616
>  
> {code:java}
> {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%
> {code}

[jira] [Updated] (SPARK-31040) Offsets are only logged for partitions which had data this causes next batch to read the partitions that were not included from the beginning when using kafka

2020-03-04 Thread Richard Gilmore (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Gilmore updated SPARK-31040:

Description: 
Each batch should either log all offsets for each partition or should scan back 
across commit logs.

[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala]

offset log 23615

 
{code:java}
{"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}%
{code}
 

 

offset log 23616

 
{code:java}
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%
{code}
 

 
{code:java}
/0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with 
committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets 
{KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit
 log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 
INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 
26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 
26533608, myTopic.myTopic.orders-5 -> 26533486)
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have 
been missed.

{code}
 

 

  was:
Each batch should either log all offsets for each partition or should scan back 
across commit logs.

[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala]

offset log 23615

 
{code:java}
{"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}%
{code}
 

 

offset log 23616

 
{code:java}
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%Topic
{code}
 

 
{code:java}
/0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with 
committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets 
{KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit
 log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 
INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 
26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 
26533608, myTopic.myTopic.orders-5 -> 26533486)
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have 
been missed.

{code}
 

 


> Offsets are only logged for partitions which had data this causes next batch 
> to read the partitions that were not included from the beginning when using 
> kafka
> --
>
> Key: SPARK-31040
> URL: https://issues.apache.org/jira/browse/SPARK-31040
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0, 2.4.4, 2.4.5
>Reporter: Richard Gilmore
>Priority: Major
>
> Each batch should either log all offsets for each partition or should scan 
> back across commit logs.
> [https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala]
> offset log 23615
>  
> {code:java}
> {"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}%
> {code}
>  
>  
> offset log 23616
>  
> {code:java}
> {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%
> 

[jira] [Created] (SPARK-31040) Offsets are only logged for partitions which had data this causes next batch to read the partitions that were not included from the beginning when using kafka

2020-03-04 Thread Richard Gilmore (Jira)
Richard Gilmore created SPARK-31040:
---

 Summary: Offsets are only logged for partitions which had data 
this causes next batch to read the partitions that were not included from the 
beginning when using kafka
 Key: SPARK-31040
 URL: https://issues.apache.org/jira/browse/SPARK-31040
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.5, 2.4.4, 2.4.0
Reporter: Richard Gilmore


Each batch should either log all offsets for each partition or should scan back 
across commit logs.

[https://github.com/apache/spark/blob/v2.4.4/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala]

offset log 23615

 
{code:java}
{"myTopic.myTopic.orders":{"2":27531503,"5":27562423,"4":27528794,"1":27514991,"3":27528899,"0":27504949}}%
{code}
 

 

offset log 23616

 
{code:java}
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%Topic
{code}
 

 
{code:java}
/0/03/04 13:49:05 INFO MicroBatchExecution: Resuming at batch 26317 with 
committed offsets {KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}} and available offsets 
{KafkaV2[Subscribe[myTopic.myTopic.orders]]: 
{"myTopic.myTopic.orders":{"2":27531625,"5":27562568,"4":27528990,"1":27515131,"3":27529075,"0":27505141}}}commit
 log: {"myTopic.myTopic.orders":{"1":27515130,"0":27505140}}%0/03/04 13:50:24 
INFO KafkaMicroBatchReader: Partitions added: Map(myTopic.myTopic.orders-3 -> 
26533520, myTopic.myTopic.orders-2 -> 26533730, myTopic.myTopic.orders-4 -> 
26533608, myTopic.myTopic.orders-5 -> 26533486)
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-3 starts from 26533520 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-2 starts from 26533730 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-4 starts from 26533608 instead of 0. Some data may have 
been missed.
20/03/04 13:50:24 WARN KafkaMicroBatchReader: Added partition 
myTopic.myTopic.orders-5 starts from 26533486 instead of 0. Some data may have 
been missed.

{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31039) Unable to use vendor specific datatypes with JDBC

2020-03-04 Thread Frank Oosterhuis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051327#comment-17051327
 ] 

Frank Oosterhuis commented on SPARK-31039:
--

As a workaround I have created the table manually and am using the option 
"truncate" with saveMode "overwrite".

You can then just insert "13:17:00" strings :)

> Unable to use vendor specific datatypes with JDBC
> -
>
> Key: SPARK-31039
> URL: https://issues.apache.org/jira/browse/SPARK-31039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Frank Oosterhuis
>Priority: Major
>
> I'm trying to create a table in MSSQL with a time(7) type.
> For this I'm using the createTableColumnTypes option like "CallStartTime 
> time(7)", with driver 
> "{color:#212121}com.microsoft.sqlserver.jdbc.SQLServerDriver"{color}
> I'm getting an error:  
> {color:#212121}org.apache.spark.sql.catalyst.parser.ParseException: DataType 
> time(7) is not supported.(line 1, pos 43){color}
> {color:#212121}What is then the point of using this option?{color}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31039) Unable to use vendor specific datatypes with JDBC

2020-03-04 Thread Frank Oosterhuis (Jira)
Frank Oosterhuis created SPARK-31039:


 Summary: Unable to use vendor specific datatypes with JDBC
 Key: SPARK-31039
 URL: https://issues.apache.org/jira/browse/SPARK-31039
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
Reporter: Frank Oosterhuis


I'm trying to create a table in MSSQL with a time(7) type.

For this I'm using the createTableColumnTypes option like "CallStartTime 
time(7)", with driver 
"{color:#212121}com.microsoft.sqlserver.jdbc.SQLServerDriver"{color}

I'm getting an error:  
{color:#212121}org.apache.spark.sql.catalyst.parser.ParseException: DataType 
time(7) is not supported.(line 1, pos 43){color}

{color:#212121}What is then the point of using this option?{color}

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31038) Add checkValue for spark.sql.session.timeZone

2020-03-04 Thread Kent Yao (Jira)
Kent Yao created SPARK-31038:


 Summary: Add checkValue for spark.sql.session.timeZone
 Key: SPARK-31038
 URL: https://issues.apache.org/jira/browse/SPARK-31038
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


The `spark.sql.session.timeZone` config can accept any string value including 
invalid time zone ids, then it will fail other queries that rely on the time 
zone. We should do the value checking in the set phase and fail fast if the 
zone value is invalid. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31037) refine AQE config names

2020-03-04 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-31037:
---

 Summary: refine AQE config names
 Key: SPARK-31037
 URL: https://issues.apache.org/jira/browse/SPARK-31037
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31009) Support json_object_keys function

2020-03-04 Thread Rakesh Raushan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rakesh Raushan updated SPARK-31009:
---
Description: 
This function will return all the keys from outer json object.

 

PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]

Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]

MariaDB -> [https://mariadb.com/kb/en/json-functions/]

  was:
This function will return all the keys from outer json object.

 

PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]

Mysql -> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]


> Support json_object_keys function
> -
>
> Key: SPARK-31009
> URL: https://issues.apache.org/jira/browse/SPARK-31009
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Rakesh Raushan
>Priority: Major
>
> This function will return all the keys from outer json object.
>  
> PostgreSQL  -> [https://www.postgresql.org/docs/9.3/functions-json.html]
> Mysql -> 
> [https://dev.mysql.com/doc/refman/8.0/en/json-function-reference.html]
> MariaDB -> [https://mariadb.com/kb/en/json-functions/]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown

2020-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31027.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27778
[https://github.com/apache/spark/pull/27778]

> Refactor `DataSourceStrategy.scala` to minimize the changes to support nested 
> predicate pushdown
> 
>
> Key: SPARK-31027
> URL: https://issues.apache.org/jira/browse/SPARK-31027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31027) Refactor `DataSourceStrategy.scala` to minimize the changes to support nested predicate pushdown

2020-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31027:


Assignee: DB Tsai

> Refactor `DataSourceStrategy.scala` to minimize the changes to support nested 
> predicate pushdown
> 
>
> Key: SPARK-31027
> URL: https://issues.apache.org/jira/browse/SPARK-31027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: DB Tsai
>Assignee: DB Tsai
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30992) Arrange scattered config of streaming module

2020-03-04 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30992:
---
Summary: Arrange scattered config of streaming module  (was:  Arrange 
scattered config for streaming)

> Arrange scattered config of streaming module
> 
>
> Key: SPARK-30992
> URL: https://issues.apache.org/jira/browse/SPARK-30992
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> I found a lot scattered config in Streaming module.
> I think should arrange these config in unified position.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30992) Arrange scattered config for streaming

2020-03-04 Thread Gabor Somogyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabor Somogyi updated SPARK-30992:
--
Component/s: (was: Structured Streaming)
 DStreams

>  Arrange scattered config for streaming
> ---
>
> Key: SPARK-30992
> URL: https://issues.apache.org/jira/browse/SPARK-30992
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> I found a lot scattered config in Streaming module.
> I think should arrange these config in unified position.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31006) Mark Spark streaming as deprecated and add warnings.

2020-03-04 Thread Gabor Somogyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051179#comment-17051179
 ] 

Gabor Somogyi commented on SPARK-31006:
---

I see many users are using DStreams at the moment. Not sure that dropping 
support is a good message.

In general I'm always suggesting Structured Streaming to use but I think it 
would be good to wait users to migrate...

> Mark Spark streaming as deprecated and add warnings.
> 
>
> Key: SPARK-31006
> URL: https://issues.apache.org/jira/browse/SPARK-31006
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Prashant Sharma
>Priority: Major
>
> It is noticed that some of the users of Spark streaming do not immediately 
> realise that it is a deprecated component and it would be scary, if they end 
> up with it in production. Now that we are in a position to release about 
> Spark 3.0.0, may be we should discuss - should the spark streaming carry an 
> explicit notice? That it is not under active development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31017) Test for shuffle requests packaging with different size and numBlocks limit

2020-03-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31017.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27767
[https://github.com/apache/spark/pull/27767]

> Test for shuffle requests packaging with different size and numBlocks limit
> ---
>
> Key: SPARK-31017
> URL: https://issues.apache.org/jira/browse/SPARK-31017
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
> When packaging shuffle fetch requests in ShuffleBlockFetcherIterator, there 
> are two limitations: maxBytesInFlight and maxBlocksInFlightPerAddress. 
> However, we don’t have test cases to test them both, e.g. the size limitation 
> is hit before the numBlocks limitation.
> We should add test cases in ShuffleBlockFetcherIteratorSuite to test:
>  # the size limitation is hit before the numBlocks limitation
>  # the numBlocks limitation is hit before the size limitation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31017) Test for shuffle requests packaging with different size and numBlocks limit

2020-03-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31017:
---

Assignee: wuyi

> Test for shuffle requests packaging with different size and numBlocks limit
> ---
>
> Key: SPARK-31017
> URL: https://issues.apache.org/jira/browse/SPARK-31017
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core, Tests
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
> When packaging shuffle fetch requests in ShuffleBlockFetcherIterator, there 
> are two limitations: maxBytesInFlight and maxBlocksInFlightPerAddress. 
> However, we don’t have test cases to test them both, e.g. the size limitation 
> is hit before the numBlocks limitation.
> We should add test cases in ShuffleBlockFetcherIteratorSuite to test:
>  # the size limitation is hit before the numBlocks limitation
>  # the numBlocks limitation is hit before the size limitation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar

2020-03-04 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051094#comment-17051094
 ] 

Wenchen Fan commented on SPARK-30951:
-

[~bersprockets] You are making a good point here. It's very hard to roll out 
the calendar switching smoothly, but we at least should give users a way to 
read their legacy data.

The hive approach looks good to me. [~maxgekk] can we implement something like 
that?

> Potential data loss for legacy applications after switch to proleptic 
> Gregorian calendar
> 
>
> Key: SPARK-30951
> URL: https://issues.apache.org/jira/browse/SPARK-30951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Bruce Robbins
>Priority: Major
>
> tl;dr: We recently discovered some Spark 2.x sites that have lots of data 
> containing dates before October 15, 1582. This could be an issue when such 
> sites try to upgrade to Spark 3.0.
> From SPARK-26651:
> {quote}"The changes might impact on the results for dates and timestamps 
> before October 15, 1582 (Gregorian)
> {quote}
> We recently discovered that some large scale Spark 2.x applications rely on 
> dates before October 15, 1582.
> Two cases came up recently:
>  * An application that uses a commercial third-party library to encode 
> sensitive dates. On insert, the library encodes the actual date as some other 
> date. On select, the library decodes the date back to the original date. The 
> encoded value could be any date, including one before October 15, 1582 (e.g., 
> "0602-04-04").
>  * An application that uses a specific unlikely date (e.g., "1200-01-01") as 
> a marker to indicate "unknown date" (in lieu of null)
> Both sites ran into problems after another component in their system was 
> upgraded to use the proleptic Gregorian calendar. Spark applications that 
> read files created by the upgraded component were interpreting encoded or 
> marker dates incorrectly, and vice versa. Also, their data now had a mix of 
> calendars (hybrid and proleptic Gregorian) with no metadata to indicate which 
> file used which calendar.
> Both sites had enormous amounts of existing data, so re-encoding the dates 
> using some other scheme was not a feasible solution.
> This is relevant to Spark 3:
> Any Spark 2 application that uses such date-encoding schemes may run into 
> trouble when run on Spark 3. The application may not properly interpret the 
> dates previously written by Spark 2. Also, once the Spark 3 version of the 
> application writes data, the tables will have a mix of calendars (hybrid and 
> proleptic gregorian) with no metadata to indicate which file uses which 
> calendar.
> Similarly, sites might run with mixed Spark versions, resulting in data 
> written by one version that cannot be interpreted by the other. And as above, 
> the tables will now have a mix of calendars with no way to detect which file 
> uses which calendar.
> As with the two real-life example cases, these applications may have enormous 
> amounts of legacy data, so re-encoding the dates using some other scheme may 
> not be feasible.
> We might want to consider a configuration setting to allow the user to 
> specify the calendar for storing and retrieving date and timestamp values 
> (not sure how such a flag would affect other date and timestamp-related 
> functions). I realize the change is far bigger than just adding a 
> configuration setting.
> Here's a quick example of where trouble may happen, using the real-life case 
> of the marker date.
> In Spark 2.4:
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 1
> scala>
> {noformat}
> In Spark 3.0 (reading from the same legacy file):
> {noformat}
> scala> spark.read.orc(s"$home/data/datefile").filter("dt == 
> '1200-01-01'").count
> res0: Long = 0
> scala> 
> {noformat}
> By the way, Hive had a similar problem. Hive switched from hybrid calendar to 
> proleptic Gregorian calendar between 2.x and 3.x. After some upgrade 
> headaches related to dates before 1582, the Hive community made the following 
> changes:
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> checks a configuration setting to determine which calendar to use.
>  * When writing date or timestamp data to ORC, Parquet, and Avro files, Hive 
> stores the calendar type in the metadata.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files, 
> Hive checks the metadata for the calendar type.
>  * When reading date or timestamp data from ORC, Parquet, and Avro files that 
> lack calendar metadata, Hive's behavior is determined by a configuration 
> setting. This allows Hive to read legacy data (note: if the data already 
> 

[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks

2020-03-04 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051076#comment-17051076
 ] 

Maxim Gekk commented on SPARK-30563:


[~petertoth] If you think it is possible to avoid some overhead of NoOp 
datasource, please, open a PR.

> Regressions in Join benchmarks
> --
>
> Key: SPARK-30563
> URL: https://issues.apache.org/jira/browse/SPARK-30563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Regenerated benchmark results in the 
> https://github.com/apache/spark/pull/27078 shows many regressions in 
> JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see
> old results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10
> new results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10
> One of the difference in queries is using the `NoOp` datasource in new 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks

2020-03-04 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051073#comment-17051073
 ] 

Maxim Gekk commented on SPARK-30563:


> we spend a lot of time in this loop even

The loop just forces materialization of joined rows. By df.groupBy().count(), 
you skip some steps in join, it seems. I think in most cases, users need 
results of join but not just count on top of it.

> Regressions in Join benchmarks
> --
>
> Key: SPARK-30563
> URL: https://issues.apache.org/jira/browse/SPARK-30563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Regenerated benchmark results in the 
> https://github.com/apache/spark/pull/27078 shows many regressions in 
> JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see
> old results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10
> new results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10
> One of the difference in queries is using the `NoOp` datasource in new 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30960) add back the legacy date/timestamp format support in CSV/JSON parser

2020-03-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30960.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27710
[https://github.com/apache/spark/pull/27710]

> add back the legacy date/timestamp format support in CSV/JSON parser
> 
>
> Key: SPARK-30960
> URL: https://issues.apache.org/jira/browse/SPARK-30960
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31036) Use stringArgs in Expression.toString to respect hidden parameters

2020-03-04 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-31036:


 Summary: Use stringArgs in Expression.toString to respect hidden 
parameters
 Key: SPARK-31036
 URL: https://issues.apache.org/jira/browse/SPARK-31036
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


Currently, the top of https://github.com/apache/spark/pull/27657, 

{code}
val identify = udf((input: Seq[Int]) => input)
spark.range(10).select(identify(array("id"))).show()
{code}

shows hidden parameter `useStringTypeWhenEmpty`.

{code}
+-+
|UDF(array(id, false))|
+-+
|  [0]|
|  [1]|
...
{code}

This is a general problem and we should respect hidden parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30563) Regressions in Join benchmarks

2020-03-04 Thread Peter Toth (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17051014#comment-17051014
 ] 

Peter Toth commented on SPARK-30563:


[~maxgekk], [~dongjoon], [~hyukjin.kwon] it looks like the change in the 
{{JoinBenchmark}} 
([https://github.com/apache/spark/commit/f5118f81e395bde0cd8253dbef6a9e6455c3958a#diff-da1033f4d10b6046046202dd8f85e3f7L49-R49])
 causes this regression. If we used {{df.groupBy().count().noop()}} and measure 
the same as previously there won't be any regression in this suite. Please see 
the results running the fixed benchmark on my machine: 
[https://github.com/peter-toth/spark/commit/207d15d1801cfcf9c40635a481d4aa7192911548]

This is because lots of rows are returned and we spend a lot of time in [this 
loop|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L438-L442]
 even if {{NoopWriter}} does nothing. 

Another very minor improvement regarding `NoOp` datasource could be to turn off 
using commit coordinator in {{NoopBatchWrite}}.

Shall I open a PR with these changes (excluding the non-official benchmark 
result)?

 

> Regressions in Join benchmarks
> --
>
> Key: SPARK-30563
> URL: https://issues.apache.org/jira/browse/SPARK-30563
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Regenerated benchmark results in the 
> https://github.com/apache/spark/pull/27078 shows many regressions in 
> JoinBenchmark. The benchmarked queries slowed down by up to 3 times, see
> old results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dL10
> new results:
> https://github.com/apache/spark/pull/27078/files#diff-d5cbaab2b49ee9fddfa0e229de8f607dR10
> One of the difference in queries is using the `NoOp` datasource in new 
> queries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org