[jira] [Updated] (SPARK-30120) LSH approxNearestNeighbors should use TopByKeyAggregator when numNearestNeighbors is small

2019-12-03 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30120:
-
Description: ping [~huaxingao]

> LSH approxNearestNeighbors should use TopByKeyAggregator when 
> numNearestNeighbors is small
> --
>
> Key: SPARK-30120
> URL: https://issues.apache.org/jira/browse/SPARK-30120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> ping [~huaxingao]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30120) LSH approxNearestNeighbors should use TopByKeyAggregator when numNearestNeighbors is small

2019-12-03 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30120:


 Summary: LSH approxNearestNeighbors should use TopByKeyAggregator 
when numNearestNeighbors is small
 Key: SPARK-30120
 URL: https://issues.apache.org/jira/browse/SPARK-30120
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28921) Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 1.12.10, 1.11.10)

2019-12-03 Thread jugosag (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987593#comment-16987593
 ] 

jugosag commented on SPARK-28921:
-

We are also observing this on two of our clusters, both set up with Rancher, 
one on Kubernetes version 1.15.5 the other on 1.14.6. Interestingly, on 
Minikube with version 1.15.4 it works.

Problem is that the Spark version which has the fix (2.4.5) has not been 
released yet and the 3.0.0 preview from Nov 6th has the same problem.

When will 2.4.5 be released?

 

> Spark jobs failing on latest versions of Kubernetes (1.15.3, 1.14.6, 1,13.10, 
> 1.12.10, 1.11.10)
> ---
>
> Key: SPARK-28921
> URL: https://issues.apache.org/jira/browse/SPARK-28921
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0, 2.3.1, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: Paul Schweigert
>Assignee: Andy Grove
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Spark jobs are failing on latest versions of Kubernetes when jobs attempt to 
> provision executor pods (jobs like Spark-Pi that do not launch executors run 
> without a problem):
>  
> Here's an example error message:
>  
> {code:java}
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.
> 19/08/30 01:29:09 INFO ExecutorPodsAllocator: Going to request 2 executors 
> from Kubernetes.19/08/30 01:29:09 WARN WatchConnectionManager: Exec Failure: 
> HTTP 403, Status: 403 - 
> java.net.ProtocolException: Expected HTTP 101 response but was '403 
> Forbidden' 
> at 
> okhttp3.internal.ws.RealWebSocket.checkResponse(RealWebSocket.java:216) 
> at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:183) 
> at okhttp3.RealCall$AsyncCall.execute(RealCall.java:141) 
> at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) 
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  
> at java.lang.Thread.run(Thread.java:748)
> {code}
>  
> Looks like the issue is caused by fixes for a recent CVE : 
> CVE: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-14809]
> Fix: [https://github.com/fabric8io/kubernetes-client/pull/1669]
>  
> Looks like upgrading kubernetes-client to 4.4.2 would solve this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30099.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26734
[https://github.com/apache/spark/pull/26734]

> Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
> 
>
> Key: SPARK-30099
> URL: https://issues.apache.org/jira/browse/SPARK-30099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Minor
> Fix For: 3.0.0
>
>
> Spark SQL 
>  explain extended select * from any non existing table shows duplicate 
> AnalysisExceptions.
> {code:java}
>  spark-sql>explain extended select * from wrong
> == Parsed Logical Plan ==
>  'Project [*]
>  +- 'UnresolvedRelation `wrong`
> == Analyzed Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Optimized Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Physical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  Time taken: 6.0 seconds, Fetched 1 row(s)
>  19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 
> row
>  (s)
>  spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30099) Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30099:
---

Assignee: jobit mathew

> Improve Analyzed Logical Plan as duplicate AnalysisExceptions are coming
> 
>
> Key: SPARK-30099
> URL: https://issues.apache.org/jira/browse/SPARK-30099
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Minor
>
> Spark SQL 
>  explain extended select * from any non existing table shows duplicate 
> AnalysisExceptions.
> {code:java}
>  spark-sql>explain extended select * from wrong
> == Parsed Logical Plan ==
>  'Project [*]
>  +- 'UnresolvedRelation `wrong`
> == Analyzed Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Optimized Logical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  == Physical Plan ==
>  org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 
> 1 p
>  os 31
>  Time taken: 6.0 seconds, Fetched 1 row(s)
>  19/12/02 14:33:32 INFO SparkSQLCLIDriver: Time taken: 6.0 seconds, Fetched 1 
> row
>  (s)
>  spark-sql>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30082) Zeros are being treated as NaNs

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30082:

Fix Version/s: 2.4.5

> Zeros are being treated as NaNs
> ---
>
> Key: SPARK-30082
> URL: https://issues.apache.org/jira/browse/SPARK-30082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: John Ayad
>Assignee: John Ayad
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> If you attempt to run
> {code:java}
> df = df.replace(float('nan'), somethingToReplaceWith)
> {code}
> It will replace all {{0}} s in columns of type {{Integer}}
> Example code snippet to repro this:
> {code:java}
> from pyspark.sql import SQLContext
> spark = SQLContext(sc).sparkSession
> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value"))
> df.show()
> df = df.replace(float('nan'), 5)
> df.show()
> {code}
> Here's the output I get when I run this code:
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.4.4
>   /_/
> Using Python version 3.7.5 (default, Nov  1 2019 02:16:32)
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SQLContext
> >>> spark = SQLContext(sc).sparkSession
> >>> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value"))
> >>> df.show()
> +-+-+
> |index|value|
> +-+-+
> |1|0|
> |2|3|
> |3|0|
> +-+-+
> >>> df = df.replace(float('nan'), 5)
> >>> df.show()
> +-+-+
> |index|value|
> +-+-+
> |1|5|
> |2|3|
> |3|5|
> +-+-+
> >>>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30119) Support pagination for spark streaming tab

2019-12-03 Thread jobit mathew (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jobit mathew updated SPARK-30119:
-
Affects Version/s: 3.0.0

> Support pagination for  spark streaming tab
> ---
>
> Key: SPARK-30119
> URL: https://issues.apache.org/jira/browse/SPARK-30119
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4, 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> Support pagination for spark streaming tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30119) Support pagination for spark streaming tab

2019-12-03 Thread Rakesh Raushan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987522#comment-16987522
 ] 

Rakesh Raushan commented on SPARK-30119:


I will raise PR for this one.

> Support pagination for  spark streaming tab
> ---
>
> Key: SPARK-30119
> URL: https://issues.apache.org/jira/browse/SPARK-30119
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> Support pagination for spark streaming tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30119) Support pagination for spark streaming tab

2019-12-03 Thread jobit mathew (Jira)
jobit mathew created SPARK-30119:


 Summary: Support pagination for  spark streaming tab
 Key: SPARK-30119
 URL: https://issues.apache.org/jira/browse/SPARK-30119
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.4
Reporter: jobit mathew


Support pagination for spark streaming tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27547) fix DataFrame self-join problems

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-27547.
-
Resolution: Duplicate

> fix DataFrame self-join problems
> 
>
> Key: SPARK-27547
> URL: https://issues.apache.org/jira/browse/SPARK-27547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987491#comment-16987491
 ] 

Lantao Jin edited comment on SPARK-30118 at 12/4/19 3:42 AM:
-

I think it had been fixed. I cannot reproduce it in master code base.


was (Author: cltlfcjin):
I think it had been fixed in latest version. Could you try it in 2.4 or 
3.0.0-preview?

> ALTER VIEW QUERY does not work
> --
>
> Key: SPARK-30118
> URL: https://issues.apache.org/jira/browse/SPARK-30118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
> state.
> {code:sql}
> spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> TABLE jzhuge.v1;
> Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
> SubqueryAlias `jzhuge`.`v1`
> +- View (`jzhuge`.`v1`, [foo1#33])
>+- Project [foo AS foo1#34]
>   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread Lantao Jin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987491#comment-16987491
 ] 

Lantao Jin commented on SPARK-30118:


I think it had been fixed in latest version. Could you try it in 2.4 or 
3.0.0-preview?

> ALTER VIEW QUERY does not work
> --
>
> Key: SPARK-30118
> URL: https://issues.apache.org/jira/browse/SPARK-30118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
> state.
> {code:sql}
> spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> TABLE jzhuge.v1;
> Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
> SubqueryAlias `jzhuge`.`v1`
> +- View (`jzhuge`.`v1`, [foo1#33])
>+- Project [foo AS foo1#34]
>   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30113) Document mergeSchema option in Python Orc APIs

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30113.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26755
[https://github.com/apache/spark/pull/26755]

> Document mergeSchema option in Python Orc APIs
> --
>
> Key: SPARK-30113
> URL: https://issues.apache.org/jira/browse/SPARK-30113
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30113) Document mergeSchema option in Python Orc APIs

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30113:


Assignee: Nicholas Chammas

> Document mergeSchema option in Python Orc APIs
> --
>
> Key: SPARK-30113
> URL: https://issues.apache.org/jira/browse/SPARK-30113
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-12-03 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987456#comment-16987456
 ] 

huangtianhua commented on SPARK-29106:
--

[~shaneknapp],:)

So seems you're back:)  Could you please to add the new arm instance into 
amplab, it has high performance, maven test costs about 5 hours, the instance 
info I have sent to your email few days ago. And if you have any problem please 
contact me, thank you very much! 

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, 
> SparkR-and-pyspark36-testing.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30091) Document mergeSchema option directly in the Python Parquet APIs

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30091:


Assignee: Nicholas Chammas

> Document mergeSchema option directly in the Python Parquet APIs
> ---
>
> Key: SPARK-30091
> URL: https://issues.apache.org/jira/browse/SPARK-30091
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet]
> Strangely, the `mergeSchema` option is mentioned in the docstring but not 
> implemented in the method signature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30091) Document mergeSchema option directly in the Python Parquet APIs

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30091.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26730
[https://github.com/apache/spark/pull/26730]

> Document mergeSchema option directly in the Python Parquet APIs
> ---
>
> Key: SPARK-30091
> URL: https://issues.apache.org/jira/browse/SPARK-30091
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.0.0
>
>
> [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet]
> Strangely, the `mergeSchema` option is mentioned in the docstring but not 
> implemented in the method signature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27547) fix DataFrame self-join problems

2019-12-03 Thread Nicholas Chammas (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987444#comment-16987444
 ] 

Nicholas Chammas commented on SPARK-27547:
--

Should this be marked as resolved by 
[#25107|https://github.com/apache/spark/pull/25107]?

 

> fix DataFrame self-join problems
> 
>
> Key: SPARK-27547
> URL: https://issues.apache.org/jira/browse/SPARK-27547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: correctness
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29667.
--
Resolution: Duplicate

I am resolving this as a duplicate of SPARK-29860.

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> {code}
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> {code}
> the sql and clause
> {code}
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> {code}
> Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
> ran just fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987440#comment-16987440
 ] 

Hyukjin Kwon commented on SPARK-29667:
--

I think this issue is a duplicate of SPARK-29860. [The PR against 
SPARK-29860|https://github.com/apache/spark/pull/26485] fixes this issue.
There are also related JIRAs and PRs such as 
[SPARK-25056|https://github.com/apache/spark/pull/22038] and 
[SPARK-22413|https://github.com/apache/spark/pull/19635].

Seems like we should better actively review than just leaving them 
indefinitely. CC [~yumwang], [~mgaido], [~smilegator], [~cloud_fan].

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> {code}
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> {code}
> the sql and clause
> {code}
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> {code}
> Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
> ran just fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-12-03 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987435#comment-16987435
 ] 

Shane Knapp commented on SPARK-29106:
-

nice catch...  fixed and relaunched.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, 
> SparkR-and-pyspark36-testing.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30110) Support type judgment for ArrayData

2019-12-03 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng resolved SPARK-30110.

Resolution: Won't Fix

> Support type judgment for ArrayData
> ---
>
> Key: SPARK-30110
> URL: https://issues.apache.org/jira/browse/SPARK-30110
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: jiaan.geng
>Priority: Major
> Fix For: 3.0.0
>
>
> ArrayData is only some interfaces for getting data, such as: getBoolean, 
> getByte.
> When the element type of array is unknow, I want to judgment the element type 
> first.
> I have PR working in process:
> [https://github.com/beliefer/spark/commit/5787c6f062795aa7931c58ac2302ba607d3a97aa]
> The array function ArrayNDims is used to get the Dimension of array.
> If ArrayData can support some interfaces like isBoolean, isByte, isArray, My 
> job will be easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)

2019-12-03 Thread Guy Huinen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987432#comment-16987432
 ] 

Guy Huinen commented on SPARK-30062:


ok

> bug with DB2Driver using mode("overwrite") option("truncate",True)
> --
>
> Key: SPARK-30062
> URL: https://issues.apache.org/jira/browse/SPARK-30062
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Guy Huinen
>Priority: Major
>  Labels: db2, pyspark
>
> using DB2Driver using mode("overwrite") option("truncate",True) gives sql 
> error
>  
> {code:java}
> dfClient.write\
>  .format("jdbc")\
>  .mode("overwrite")\
>  .option('driver', 'com.ibm.db2.jcc.DB2Driver')\
>  .option("url","jdbc:db2://")\
>  .option("user","xxx")\
>  .option("password","")\
>  .option("dbtable","")\
>  .option("truncate",True)\{code}
>  
>  gives the error below
> in summary i belief the semicolon is misplaced or malformated
>  
> {code:java}
> EXPO.EXPO#CMR_STG;IMMEDIATE{code}
>  
>  
> full error
> {code:java}
> An error occurred while calling o47.save. : 
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, 
> SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, 
> DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at 
> com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) 
> at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at 
> com.ibm.db2.jcc.am.kh.d(kh.java:2776) at 
> com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) 
> at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) 
> at com.ibm.db2.jcc.t4.av.h(av.java:124) at 
> com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at 
> com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) 
> at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748){code}
>  



--
This message was sent by Atlassian Jira

[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987429#comment-16987429
 ] 

John Zhuge commented on SPARK-30118:


I am running Spark master with Hive 1.2.1. Same issue in Spark 2.3.

> ALTER VIEW QUERY does not work
> --
>
> Key: SPARK-30118
> URL: https://issues.apache.org/jira/browse/SPARK-30118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
> state.
> {code:sql}
> spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> TABLE jzhuge.v1;
> Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
> SubqueryAlias `jzhuge`.`v1`
> +- View (`jzhuge`.`v1`, [foo1#33])
>+- Project [foo AS foo1#34]
>   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987428#comment-16987428
 ] 

John Zhuge commented on SPARK-30118:


{code:java}
spark-sql> DESC FORMATTED jzhuge.v1;
foo1string  NULL

# Detailed Table Information
Databasejzhuge
Table   v1
Owner   jzhuge
Created TimeTue Dec 03 17:53:59 PST 2019
Last Access UNKNOWN
Created By  Spark 3.0.0-SNAPSHOT
TypeVIEW
View Text   SELECT 'foo' foo1
View Original Text  SELECT 'foo' foo1
View Default Database   default
View Query Output Columns   [foo2]
Table Properties[transient_lastDdlTime=1575424439, 
view.query.out.col.0=foo2, view.query.out.numCols=1, 
view.default.database=default]
Locationfile://tmp/jzhuge/v1
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties  [serialization.format=1]
{code}


> ALTER VIEW QUERY does not work
> --
>
> Key: SPARK-30118
> URL: https://issues.apache.org/jira/browse/SPARK-30118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
> state.
> {code:sql}
> spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> TABLE jzhuge.v1;
> Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
> SubqueryAlias `jzhuge`.`v1`
> +- View (`jzhuge`.`v1`, [foo1#33])
>+- Project [foo AS foo1#34]
>   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread John Zhuge (Jira)
John Zhuge created SPARK-30118:
--

 Summary: ALTER VIEW QUERY does not work
 Key: SPARK-30118
 URL: https://issues.apache.org/jira/browse/SPARK-30118
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


`ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
state.
{code:sql}
spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
spark-sql> SHOW CREATE TABLE jzhuge.v1;
CREATE VIEW `jzhuge`.`v1`(foo1) AS
SELECT 'foo' foo1

spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
spark-sql> SHOW CREATE TABLE jzhuge.v1;
CREATE VIEW `jzhuge`.`v1`(foo1) AS
SELECT 'foo' foo1

spark-sql> TABLE jzhuge.v1;
Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
SubqueryAlias `jzhuge`.`v1`
+- View (`jzhuge`.`v1`, [foo1#33])
   +- Project [foo AS foo1#34]
  +- OneRowRelation
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30111) spark R dockerfile fails to build

2019-12-03 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp resolved SPARK-30111.
-
Resolution: Fixed

> spark R dockerfile fails to build
> -
>
> Key: SPARK-30111
> URL: https://issues.apache.org/jira/browse/SPARK-30111
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Ilan Filonenko
>Priority: Major
>
> all recent k8s builds have been failing when trying to build the sparkR 
> dockerfile:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console]
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/]
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/]
> [~ifilonenko]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30111) spark R dockerfile fails to build

2019-12-03 Thread Shane Knapp (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp reassigned SPARK-30111:
---

Assignee: Ilan Filonenko

> spark R dockerfile fails to build
> -
>
> Key: SPARK-30111
> URL: https://issues.apache.org/jira/browse/SPARK-30111
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Assignee: Ilan Filonenko
>Priority: Major
>
> all recent k8s builds have been failing when trying to build the sparkR 
> dockerfile:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console]
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/]
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/]
> [~ifilonenko]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27025) Speed up toLocalIterator

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27025.
--
Resolution: Duplicate

Actually, I think this was done by SPARK-27659 ..

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-27025) Speed up toLocalIterator

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-27025:
--

> Speed up toLocalIterator
> 
>
> Key: SPARK-27025
> URL: https://issues.apache.org/jira/browse/SPARK-27025
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.3.3
>Reporter: Erik van Oosten
>Priority: Major
>
> Method {{toLocalIterator}} fetches the partitions to the driver one by one. 
> However, as far as I can see, any required computation for the 
> yet-to-be-fetched-partitions is not kicked off until it is fetched. 
> Effectively only one partition is being computed at the same time. 
> Desired behavior: immediately start calculation of all partitions while 
> retaining the download-a-partition at a time behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30062) bug with DB2Driver using mode("overwrite") option("truncate",True)

2019-12-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987424#comment-16987424
 ] 

Hyukjin Kwon commented on SPARK-30062:
--

Can you open a PR to fix DB2 dialect then?

> bug with DB2Driver using mode("overwrite") option("truncate",True)
> --
>
> Key: SPARK-30062
> URL: https://issues.apache.org/jira/browse/SPARK-30062
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Guy Huinen
>Priority: Major
>  Labels: db2, pyspark
>
> using DB2Driver using mode("overwrite") option("truncate",True) gives sql 
> error
>  
> {code:java}
> dfClient.write\
>  .format("jdbc")\
>  .mode("overwrite")\
>  .option('driver', 'com.ibm.db2.jcc.DB2Driver')\
>  .option("url","jdbc:db2://")\
>  .option("user","xxx")\
>  .option("password","")\
>  .option("dbtable","")\
>  .option("truncate",True)\{code}
>  
>  gives the error below
> in summary i belief the semicolon is misplaced or malformated
>  
> {code:java}
> EXPO.EXPO#CMR_STG;IMMEDIATE{code}
>  
>  
> full error
> {code:java}
> An error occurred while calling o47.save. : 
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: DB2 SQL Error: SQLCODE=-104, 
> SQLSTATE=42601, SQLERRMC=END-OF-STATEMENT;LE EXPO.EXPO#CMR_STG;IMMEDIATE, 
> DRIVER=4.19.77 at com.ibm.db2.jcc.am.b4.a(b4.java:747) at 
> com.ibm.db2.jcc.am.b4.a(b4.java:66) at com.ibm.db2.jcc.am.b4.a(b4.java:135) 
> at com.ibm.db2.jcc.am.kh.c(kh.java:2788) at 
> com.ibm.db2.jcc.am.kh.d(kh.java:2776) at 
> com.ibm.db2.jcc.am.kh.b(kh.java:2143) at com.ibm.db2.jcc.t4.ab.i(ab.java:226) 
> at com.ibm.db2.jcc.t4.ab.c(ab.java:48) at com.ibm.db2.jcc.t4.p.b(p.java:38) 
> at com.ibm.db2.jcc.t4.av.h(av.java:124) at 
> com.ibm.db2.jcc.am.kh.ak(kh.java:2138) at 
> com.ibm.db2.jcc.am.kh.a(kh.java:3325) at com.ibm.db2.jcc.am.kh.c(kh.java:765) 
> at com.ibm.db2.jcc.am.kh.executeUpdate(kh.java:744) at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.truncateTable(JdbcUtils.scala:113)
>  at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:56)
>  at 
> org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>  at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>  at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) 
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:676)
>  at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at 
> org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:676) at 
> org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:285)
>  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:271) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498) at 
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at 
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at 
> py4j.Gateway.invoke(Gateway.java:282) at 
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at 
> py4j.commands.CallCommand.execute(CallCommand.java:79) at 
> py4j.GatewayConnection.run(GatewayConnection.java:238) at 
> java.lang.Thread.run(Thread.java:748){code}
>  



--

[jira] [Updated] (SPARK-29591) Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgresql

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29591:
-
Description: 
Support data insertion in a different order if you wish or even omit some 
columns in spark sql also like postgre sql.

*In postgre sql*
{code:java}
CREATE TABLE weather (
 city varchar(80),
 temp_lo int, – low temperature
 temp_hi int, – high temperature
 prcp real, – precipitation
 date date
 );
{code}
*You can list the columns in a different order if you wish or even omit some 
columns,*
{code:java}
INSERT INTO weather (date, city, temp_hi, temp_lo)
 VALUES ('1994-11-29', 'Hayward', 54, 37);
{code}
 

*Spark SQL*

But in spark sql is not allowing to insert data in different order or omit any 
column.Better to support this as it can save time if we can not predict any 
specific column value or if some value is fixed always.
{code:java}
create table jobit(id int,name string);

> insert into jobit values(1,"Ankit");
 Time taken: 0.548 seconds
 spark-sql> *insert into jobit (id) values(1);*
 *Error in query:*
 mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (id) values(1)
 ---^^^

spark-sql> *insert into jobit (name,id) values("Ankit",1);*
 *Error in query:*
 mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (name,id) values("Ankit",1)
 ---^^^

spark-sql>
{code}
 

  was:
Support data insertion in a different order if you wish or even omit some 
columns in spark sql also like postgre sql.

*In postgre sql*
{code:java}
CREATE TABLE weather (
 city varchar(80),
 temp_lo int, – low temperature
 temp_hi int, – high temperature
 prcp real, – precipitation
 date date
 );
{code}
*You can list the columns in a different order if you wish or even omit some 
columns,*
{code:java}
INSERT INTO weather (date, city, temp_hi, temp_lo)
 VALUES ('1994-11-29', 'Hayward', 54, 37);
{code}
*Spark SQL*

But in spark sql is not allowing to insert data in different order or omit any 
column.Better to support this as it can save time if we can not predict any 
specific column value or if some value is fixed always.
{code:java}
create table jobit(id int,name string);

> insert into jobit values(1,"Ankit");
 Time taken: 0.548 seconds
 spark-sql> *insert into jobit (id) values(1);*
 *Error in query:*
 mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (id) values(1)
 ---^^^

spark-sql> *insert into jobit (name,id) values("Ankit",1);*
 *Error in query:*
 mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (name,id) values("Ankit",1)
 ---^^^

spark-sql>

 {code}
pos : point-of-sale


> Support data insertion in a different order if you wish or even omit some 
> columns in spark sql also like postgresql
> ---
>
> Key: SPARK-29591
> URL: https://issues.apache.org/jira/browse/SPARK-29591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Major
>
> Support data insertion in a different order if you wish or even omit some 
> columns in spark sql also like postgre sql.
> *In postgre sql*
> {code:java}
> CREATE TABLE weather (
>  city varchar(80),
>  temp_lo int, – low temperature
>  temp_hi int, – high temperature
>  prcp real, – precipitation
>  date date
>  );
> {code}
> *You can list the columns in a different order if you wish or even omit some 
> columns,*
> {code:java}
> INSERT INTO weather (date, city, temp_hi, temp_lo)
>  VALUES ('1994-11-29', 'Hayward', 54, 37);
> {code}
>  
> *Spark SQL*
> But in spark sql is not allowing to insert data in different order or omit 
> any column.Better to support this as it can save time if we can not predict 
> any specific column value or if some value is fixed always.
> {code:java}
> create table jobit(id int,name string);
> > insert into jobit values(1,"Ankit");
>  Time taken: 0.548 seconds
>  spark-sql> *insert into jobit (id) values(1);*
>  *Error in query:*
>  mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
> 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)
> == SQL ==
>  insert into jobit (id) values(1)
>  ---^^^
> spark-sql> *insert into jobit (name,id) values("Ankit",1);*
>  *Error in query:*
>  mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 
> 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)
> 

[jira] [Updated] (SPARK-29591) Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgresql

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29591:
-
Description: 
Support data insertion in a different order if you wish or even omit some 
columns in spark sql also like postgre sql.

*In postgre sql*

{code}
CREATE TABLE weather (
 city varchar(80),
 temp_lo int, – low temperature
 temp_hi int, – high temperature
 prcp real, – precipitation
 date date
 );
{code}

*You can list the columns in a different order if you wish or even omit some 
columns,*

{code}
INSERT INTO weather (date, city, temp_hi, temp_lo)
 VALUES ('1994-11-29', 'Hayward', 54, 37)
{code};

*Spark SQL*

But in spark sql is not allowing to insert data in different order or omit any 
column.Better to support this as it can save time if we can not predict any 
specific column value or if some value is fixed always.

{code}
create table jobit(id int,name string);

> insert into jobit values(1,"Ankit");
 Time taken: 0.548 seconds
 spark-sql> *insert into jobit (id) values(1);*
 *Error in query:*
 mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (id) values(1)
 ---^^^

spark-sql> *insert into jobit (name,id) values("Ankit",1);*
 *Error in query:*
 mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (name,id) values("Ankit",1)
 ---^^^

spark-sql>
{code}

 

  was:
Support data insertion in a different order if you wish or even omit some 
columns in spark sql also like postgre sql.

*In postgre sql*

CREATE TABLE weather (
 city varchar(80),
 temp_lo int, – low temperature
 temp_hi int, – high temperature
 prcp real, – precipitation
 date date
 );

*You can list the columns in a different order if you wish or even omit some 
columns,*

INSERT INTO weather (date, city, temp_hi, temp_lo)
 VALUES ('1994-11-29', 'Hayward', 54, 37);

*Spark SQL*

But in spark sql is not allowing to insert data in different order or omit any 
column.Better to support this as it can save time if we can not predict any 
specific column value or if some value is fixed always.

create table jobit(id int,name string);

> insert into jobit values(1,"Ankit");
 Time taken: 0.548 seconds
 spark-sql> *insert into jobit (id) values(1);*
 *Error in query:*
 mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (id) values(1)
 ---^^^

spark-sql> *insert into jobit (name,id) values("Ankit",1);*
 *Error in query:*
 mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (name,id) values("Ankit",1)
 ---^^^

spark-sql>

 


> Support data insertion in a different order if you wish or even omit some 
> columns in spark sql also like postgresql
> ---
>
> Key: SPARK-29591
> URL: https://issues.apache.org/jira/browse/SPARK-29591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Major
>
> Support data insertion in a different order if you wish or even omit some 
> columns in spark sql also like postgre sql.
> *In postgre sql*
> {code}
> CREATE TABLE weather (
>  city varchar(80),
>  temp_lo int, – low temperature
>  temp_hi int, – high temperature
>  prcp real, – precipitation
>  date date
>  );
> {code}
> *You can list the columns in a different order if you wish or even omit some 
> columns,*
> {code}
> INSERT INTO weather (date, city, temp_hi, temp_lo)
>  VALUES ('1994-11-29', 'Hayward', 54, 37)
> {code};
> *Spark SQL*
> But in spark sql is not allowing to insert data in different order or omit 
> any column.Better to support this as it can save time if we can not predict 
> any specific column value or if some value is fixed always.
> {code}
> create table jobit(id int,name string);
> > insert into jobit values(1,"Ankit");
>  Time taken: 0.548 seconds
>  spark-sql> *insert into jobit (id) values(1);*
>  *Error in query:*
>  mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
> 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)
> == SQL ==
>  insert into jobit (id) values(1)
>  ---^^^
> spark-sql> *insert into jobit (name,id) values("Ankit",1);*
>  *Error in query:*
>  mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 
> 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)
> == SQL ==
>  insert into jobit (name,id) values("Ankit",1)
>  ---^^^
> spark-sql>
> 

[jira] [Updated] (SPARK-29591) Support data insertion in a different order if you wish or even omit some columns in spark sql also like postgresql

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29591:
-
Description: 
Support data insertion in a different order if you wish or even omit some 
columns in spark sql also like postgre sql.

*In postgre sql*
{code:java}
CREATE TABLE weather (
 city varchar(80),
 temp_lo int, – low temperature
 temp_hi int, – high temperature
 prcp real, – precipitation
 date date
 );
{code}
*You can list the columns in a different order if you wish or even omit some 
columns,*
{code:java}
INSERT INTO weather (date, city, temp_hi, temp_lo)
 VALUES ('1994-11-29', 'Hayward', 54, 37);
{code}
*Spark SQL*

But in spark sql is not allowing to insert data in different order or omit any 
column.Better to support this as it can save time if we can not predict any 
specific column value or if some value is fixed always.
{code:java}
create table jobit(id int,name string);

> insert into jobit values(1,"Ankit");
 Time taken: 0.548 seconds
 spark-sql> *insert into jobit (id) values(1);*
 *Error in query:*
 mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (id) values(1)
 ---^^^

spark-sql> *insert into jobit (name,id) values("Ankit",1);*
 *Error in query:*
 mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (name,id) values("Ankit",1)
 ---^^^

spark-sql>

 {code}
pos : point-of-sale

  was:
Support data insertion in a different order if you wish or even omit some 
columns in spark sql also like postgre sql.

*In postgre sql*

{code}
CREATE TABLE weather (
 city varchar(80),
 temp_lo int, – low temperature
 temp_hi int, – high temperature
 prcp real, – precipitation
 date date
 );
{code}

*You can list the columns in a different order if you wish or even omit some 
columns,*

{code}
INSERT INTO weather (date, city, temp_hi, temp_lo)
 VALUES ('1994-11-29', 'Hayward', 54, 37)
{code};

*Spark SQL*

But in spark sql is not allowing to insert data in different order or omit any 
column.Better to support this as it can save time if we can not predict any 
specific column value or if some value is fixed always.

{code}
create table jobit(id int,name string);

> insert into jobit values(1,"Ankit");
 Time taken: 0.548 seconds
 spark-sql> *insert into jobit (id) values(1);*
 *Error in query:*
 mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (id) values(1)
 ---^^^

spark-sql> *insert into jobit (name,id) values("Ankit",1);*
 *Error in query:*
 mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)

== SQL ==
 insert into jobit (name,id) values("Ankit",1)
 ---^^^

spark-sql>
{code}

 


> Support data insertion in a different order if you wish or even omit some 
> columns in spark sql also like postgresql
> ---
>
> Key: SPARK-29591
> URL: https://issues.apache.org/jira/browse/SPARK-29591
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Major
>
> Support data insertion in a different order if you wish or even omit some 
> columns in spark sql also like postgre sql.
> *In postgre sql*
> {code:java}
> CREATE TABLE weather (
>  city varchar(80),
>  temp_lo int, – low temperature
>  temp_hi int, – high temperature
>  prcp real, – precipitation
>  date date
>  );
> {code}
> *You can list the columns in a different order if you wish or even omit some 
> columns,*
> {code:java}
> INSERT INTO weather (date, city, temp_hi, temp_lo)
>  VALUES ('1994-11-29', 'Hayward', 54, 37);
> {code}
> *Spark SQL*
> But in spark sql is not allowing to insert data in different order or omit 
> any column.Better to support this as it can save time if we can not predict 
> any specific column value or if some value is fixed always.
> {code:java}
> create table jobit(id int,name string);
> > insert into jobit values(1,"Ankit");
>  Time taken: 0.548 seconds
>  spark-sql> *insert into jobit (id) values(1);*
>  *Error in query:*
>  mismatched input 'id' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 
> 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)
> == SQL ==
>  insert into jobit (id) values(1)
>  ---^^^
> spark-sql> *insert into jobit (name,id) values("Ankit",1);*
>  *Error in query:*
>  mismatched input 'name' expecting \{'(', 'SELECT', 'FROM', 'VALUES', 
> 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 1, pos 19)
> == SQL ==
>  

[jira] [Assigned] (SPARK-30109) PCA use BLAS.gemv with sparse vector

2019-12-03 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30109:


Assignee: zhengruifeng

> PCA use BLAS.gemv with sparse vector
> 
>
> Key: SPARK-30109
> URL: https://issues.apache.org/jira/browse/SPARK-30109
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> When PCA was first impled in 
> [SPARK-5521|https://issues.apache.org/jira/browse/SPARK-5521], at that time 
> Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So it 
> worked around it by applying a complex matrix multiplication.
> Since [SPARK-7681|https://issues.apache.org/jira/browse/SPARK-7681], 
> BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30109) PCA use BLAS.gemv with sparse vector

2019-12-03 Thread zhengruifeng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30109.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26745
[https://github.com/apache/spark/pull/26745]

> PCA use BLAS.gemv with sparse vector
> 
>
> Key: SPARK-30109
> URL: https://issues.apache.org/jira/browse/SPARK-30109
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.0.0
>
>
> When PCA was first impled in 
> [SPARK-5521|https://issues.apache.org/jira/browse/SPARK-5521], at that time 
> Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So it 
> worked around it by applying a complex matrix multiplication.
> Since [SPARK-7681|https://issues.apache.org/jira/browse/SPARK-7681], 
> BLAS.gemv supported sparse vector. So we can directly use Matrix.multiply now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30117) Improve limit only query on Hive table and view

2019-12-03 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-30117:
--

 Summary: Improve limit only query on Hive table and view
 Key: SPARK-30117
 URL: https://issues.apache.org/jira/browse/SPARK-30117
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30116) Improve limit only query on views

2019-12-03 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-30116:
--

 Summary: Improve limit only query on views
 Key: SPARK-30116
 URL: https://issues.apache.org/jira/browse/SPARK-30116
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30115) Improve limit only query on datasource table

2019-12-03 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-30115:
--

 Summary: Improve limit only query on datasource table
 Key: SPARK-30115
 URL: https://issues.apache.org/jira/browse/SPARK-30115
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30114) Improve LIMIT only query by partial listing files

2019-12-03 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-30114:
---
Summary: Improve LIMIT only query by partial listing files  (was: Optimize 
LIMIT only query by partial listing files)

> Improve LIMIT only query by partial listing files
> -
>
> Key: SPARK-30114
> URL: https://issues.apache.org/jira/browse/SPARK-30114
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT 
> operation like
> 1) SELECT * FROM TABLE_A LIMIT N
> 2) SELECT colA FROM TABLE_A LIMIT N
> 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
> If the TABLE_A is a large table (a RDD with thousands and thousands of 
> partitions), the execution time would be very big since it has to list all 
> files to build a RDD before execution. But almost time, the N is just like 
> 10, 100, 1000, not very big. We don't need to scan all files. This 
> optimization will create a *SinglePartitionReadRDD* to address it.
> In our production result, this optimization benefits a lot. The duration time 
> of simple query with LIMIT could reduce 5~10 times. For example, before this 
> optimization, a query on a table which has about one hundred thousands files 
> would run over 30 seconds, after applying this optimization, the time 
> decreased to 5 seconds.
> Should support both Spark DataSource Table and Hive Table which can be 
> converted to DataSource table.
> Should support bucket table, partition table, normal table.
> Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29600) array_contains built in function is not backward compatible in 3.0

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29600:
-
Description: 
{code}
SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); 
{code}

throws Exception in 3.0 where as in 2.3.2 is working fine.

Spark 3.0 output:

{code}
0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
 Error: org.apache.spark.sql.AnalysisException: cannot resolve 
'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), 
CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS 
DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 
0.2BD)' due to data type mismatch: Input to function array_contains should have 
been array followed by a value with same element type, but it's 
[array, decimal(1,1)].; line 1 pos 7;
 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), 
cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as 
decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), 
cast(0.033 as decimal(13,3))), 0.2), None)]
{code}

Spark 2.3.2 output

{code}
0: jdbc:hive2://10.18.18.214:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
|array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), 
CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS 
DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), 
CAST(0.2 AS DECIMAL(13,3)))|
|true|

1 row selected (0.18 seconds)
{code}

 

 

  was:
SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); throws 
Exception in 3.0 where as in 2.3.2 is working fine.

Spark 3.0 output:

0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
 Error: org.apache.spark.sql.AnalysisException: cannot resolve 
'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), 
CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS 
DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS DECIMAL(13,3))), 
0.2BD)' due to data type mismatch: Input to function array_contains should have 
been array followed by a value with same element type, but it's 
[array, decimal(1,1)].; line 1 pos 7;
 'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), 
cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as 
decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), 
cast(0.033 as decimal(13,3))), 0.2), None)]

Spark 2.3.2 output

0: jdbc:hive2://10.18.18.214:23040/default> SELECT 
array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
|array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), 
CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS 
DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), 
CAST(0.2 AS DECIMAL(13,3)))|
|true|

1 row selected (0.18 seconds)

 

 


> array_contains built in function is not backward compatible in 3.0
> --
>
> Key: SPARK-29600
> URL: https://issues.apache.org/jira/browse/SPARK-29600
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Major
>
> {code}
> SELECT array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2); 
> {code}
> throws Exception in 3.0 where as in 2.3.2 is working fine.
> Spark 3.0 output:
> {code}
> 0: jdbc:hive2://10.18.19.208:23040/default> SELECT 
> array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
>  Error: org.apache.spark.sql.AnalysisException: cannot resolve 
> 'array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1BD AS DECIMAL(13,3)), 
> CAST(0.2BD AS DECIMAL(13,3)), CAST(0.3BD AS DECIMAL(13,3)), CAST(0.5BD AS 
> DECIMAL(13,3)), CAST(0.02BD AS DECIMAL(13,3)), CAST(0.033BD AS 
> DECIMAL(13,3))), 0.2BD)' due to data type mismatch: Input to function 
> array_contains should have been array followed by a value with same element 
> type, but it's [array, decimal(1,1)].; line 1 pos 7;
>  'Project [unresolvedalias(array_contains(array(cast(0 as decimal(13,3)), 
> cast(0.1 as decimal(13,3)), cast(0.2 as decimal(13,3)), cast(0.3 as 
> decimal(13,3)), cast(0.5 as decimal(13,3)), cast(0.02 as decimal(13,3)), 
> cast(0.033 as decimal(13,3))), 0.2), None)]
> {code}
> Spark 2.3.2 output
> {code}
> 0: jdbc:hive2://10.18.18.214:23040/default> SELECT 
> array_contains(array(0,0.1,0.2,0.3,0.5,0.02,0.033), .2);
> |array_contains(array(CAST(0 AS DECIMAL(13,3)), CAST(0.1 AS DECIMAL(13,3)), 
> CAST(0.2 AS DECIMAL(13,3)), CAST(0.3 AS DECIMAL(13,3)), CAST(0.5 AS 
> DECIMAL(13,3)), CAST(0.02 AS DECIMAL(13,3)), CAST(0.033 AS DECIMAL(13,3))), 
> CAST(0.2 

[jira] [Updated] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29667:
-
Description: 
Ran into error on this sql
Mismatched columns:

{code}
[(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
the sql and clause
  AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
{code}

Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
ran just fine. Can the sql engine cast implicitly in this case?
 
 

 

 

  was:
Ran into error on this sql
Mismatched columns:
[(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
the sql and clause
  AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
fine. Can the sql engine cast implicitly in this case?
 
 

 

 


> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> {code}
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> {code}
> Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
> ran just fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30114) Optimize LIMIT only query by partial listing files

2019-12-03 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-30114:
---
Description: 
We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT 
operation like
1) SELECT * FROM TABLE_A LIMIT N
2) SELECT colA FROM TABLE_A LIMIT N
3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
If the TABLE_A is a large table (a RDD with thousands and thousands of 
partitions), the execution time would be very big since it has to list all 
files to build a RDD before execution. But almost time, the N is just like 10, 
100, 1000, not very big. We don't need to scan all files. This optimization 
will create a *SinglePartitionReadRDD* to address it.

In our production result, this optimization benefits a lot. The duration time 
of simple query with LIMIT could reduce 5~10 times. For example, before this 
optimization, a query on a table which has about one hundred thousands files 
would run over 30 seconds, after applying this optimization, the time decreased 
to 5 seconds.


Should support both Spark DataSource Table and Hive Table which can be 
converted to DataSource table.
Should support bucket table, partition table, normal table.
Should support different file formats like parquet, orc.

  was:
We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT 
operation. When we execute some queries like
1) SELECT * FROM TABLE_A LIMIT N
2) SELECT colA FROM TABLE_A LIMIT N
3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
If the TABLE_A is a large table (a RDD with thousands and thousands of 
partitions), the execution time would be very big since it has to list all 
files to build a RDD before execution. But almost time, the N is just like 10, 
100, 1000, not very big. We don't need to scan all files. This optimization 
will create a *SinglePartitionReadRDD* to address it.

In our production result, this optimization benefits a lot. The duration time 
of simple query with LIMIT could reduce 5~10 times. For example, before this 
optimization, a query on a table which has about one hundred thousands files 
would run over 30 seconds, after applying this optimization, the time decreased 
to 5 seconds.


Should support both Spark DataSource Table and Hive Table which can be 
converted to DataSource table.
Should support bucket table, partition table, normal table.
Should support different file formats like parquet, orc.


> Optimize LIMIT only query by partial listing files
> --
>
> Key: SPARK-30114
> URL: https://issues.apache.org/jira/browse/SPARK-30114
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT 
> operation like
> 1) SELECT * FROM TABLE_A LIMIT N
> 2) SELECT colA FROM TABLE_A LIMIT N
> 3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
> If the TABLE_A is a large table (a RDD with thousands and thousands of 
> partitions), the execution time would be very big since it has to list all 
> files to build a RDD before execution. But almost time, the N is just like 
> 10, 100, 1000, not very big. We don't need to scan all files. This 
> optimization will create a *SinglePartitionReadRDD* to address it.
> In our production result, this optimization benefits a lot. The duration time 
> of simple query with LIMIT could reduce 5~10 times. For example, before this 
> optimization, a query on a table which has about one hundred thousands files 
> would run over 30 seconds, after applying this optimization, the time 
> decreased to 5 seconds.
> Should support both Spark DataSource Table and Hive Table which can be 
> converted to DataSource table.
> Should support bucket table, partition table, normal table.
> Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29667:
-
Description: 
Ran into error on this sql
Mismatched columns:

{code}
[(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
{code}

the sql and clause

{code}
  AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
{code}

Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
ran just fine. Can the sql engine cast implicitly in this case?
 
 

 

 

  was:
Ran into error on this sql
Mismatched columns:

{code}
[(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
the sql and clause
  AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
{code}

Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
ran just fine. Can the sql engine cast implicitly in this case?
 
 

 

 


> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> {code}
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> {code}
> the sql and clause
> {code}
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> {code}
> Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
> ran just fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs

2019-12-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987418#comment-16987418
 ] 

Hyukjin Kwon commented on SPARK-30063:
--

Closing this assuming the issue was resolved.

> Failure when returning a value from multiple Pandas UDFs
> 
>
> Key: SPARK-30063
> URL: https://issues.apache.org/jira/browse/SPARK-30063
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 2.4.4
> Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on 
> both 2.4.3 and 2.4.4
>Reporter: Tim Kellogg
>Priority: Major
> Attachments: spark-debug.txt, variety-of-schemas.ipynb
>
>
> I have 20 Pandas UDFs that I'm trying to evaluate all at the same time.
>  * PandasUDFType.GROUPED_AGG
>  * 3 columns in the input data frame being serialized over Arrow to Python 
> worker. See below for clarification.
>  * All functions take 2 parameters, some combination of the 3 received as 
> Arrow input.
>  * Varying return types, see details below.
> _*I get an IllegalArgumentException on the Scala side of the worker when 
> deserializing from Python.*_
> h2. Exception & Stack Trace
> {code:java}
> 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> 

[jira] [Resolved] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30063.
--
Resolution: Not A Problem

> Failure when returning a value from multiple Pandas UDFs
> 
>
> Key: SPARK-30063
> URL: https://issues.apache.org/jira/browse/SPARK-30063
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 2.4.4
> Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on 
> both 2.4.3 and 2.4.4
>Reporter: Tim Kellogg
>Priority: Major
> Attachments: spark-debug.txt, variety-of-schemas.ipynb
>
>
> I have 20 Pandas UDFs that I'm trying to evaluate all at the same time.
>  * PandasUDFType.GROUPED_AGG
>  * 3 columns in the input data frame being serialized over Arrow to Python 
> worker. See below for clarification.
>  * All functions take 2 parameters, some combination of the 3 received as 
> Arrow input.
>  * Varying return types, see details below.
> _*I get an IllegalArgumentException on the Scala side of the worker when 
> deserializing from Python.*_
> h2. Exception & Stack Trace
> {code:java}
> 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> 

[jira] [Updated] (SPARK-30091) Document mergeSchema option directly in the Python Parquet APIs

2019-12-03 Thread Nicholas Chammas (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-30091:
-
Summary: Document mergeSchema option directly in the Python Parquet APIs  
(was: Document mergeSchema option directly in the Python API)

> Document mergeSchema option directly in the Python Parquet APIs
> ---
>
> Key: SPARK-30091
> URL: https://issues.apache.org/jira/browse/SPARK-30091
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> [http://spark.apache.org/docs/2.4.4/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader.parquet]
> Strangely, the `mergeSchema` option is mentioned in the docstring but not 
> implemented in the method signature.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30114) Optimize LIMIT only query by partial listing files

2019-12-03 Thread Lantao Jin (Jira)
Lantao Jin created SPARK-30114:
--

 Summary: Optimize LIMIT only query by partial listing files
 Key: SPARK-30114
 URL: https://issues.apache.org/jira/browse/SPARK-30114
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Lantao Jin


We use Spark as ad-hoc query engine. Most of users' SELECT queries with LIMIT 
operation. When we execute some queries like
1) SELECT * FROM TABLE_A LIMIT N
2) SELECT colA FROM TABLE_A LIMIT N
3) CREATE TAB_B as SELECT * FROM TABLE_A LIMIT N
If the TABLE_A is a large table (a RDD with thousands and thousands of 
partitions), the execution time would be very big since it has to list all 
files to build a RDD before execution. But almost time, the N is just like 10, 
100, 1000, not very big. We don't need to scan all files. This optimization 
will create a *SinglePartitionReadRDD* to address it.

In our production result, this optimization benefits a lot. The duration time 
of simple query with LIMIT could reduce 5~10 times. For example, before this 
optimization, a query on a table which has about one hundred thousands files 
would run over 30 seconds, after applying this optimization, the time decreased 
to 5 seconds.


Should support both Spark DataSource Table and Hive Table which can be 
converted to DataSource table.
Should support bucket table, partition table, normal table.
Should support different file formats like parquet, orc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30113) Document mergeSchema option in Python Orc APIs

2019-12-03 Thread Nicholas Chammas (Jira)
Nicholas Chammas created SPARK-30113:


 Summary: Document mergeSchema option in Python Orc APIs
 Key: SPARK-30113
 URL: https://issues.apache.org/jira/browse/SPARK-30113
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Nicholas Chammas






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs

2019-12-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987417#comment-16987417
 ] 

Hyukjin Kwon commented on SPARK-30063:
--

{quote}
Set the PYTHONHASHSEED environment variable such that the whole cluster has the 
same value. I suggest that the original value is still random, for the same 
reason why it's random in the first place. This would alleviate a lot of crazy 
PySpark bugs that appear like concurrency bugs.
{quote}
{{PYTHONHASHSEED}} I think we already do it.

{quote}
Make Spark friendlier to different Arrow versions, or force override the 
PyArrow version so that it's harder to get it wrong.
{quote}
Yeah, we're trying to put a lot of efforts. 
https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x
 is one of the efforts we put. This incompatiblity was from Arrow for 
clarification.

{quote}
Consider upgrading Arrow. Newer versions are even friendlier to use
{quote}
We already did SPARK-29376.

> Failure when returning a value from multiple Pandas UDFs
> 
>
> Key: SPARK-30063
> URL: https://issues.apache.org/jira/browse/SPARK-30063
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 2.4.4
> Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on 
> both 2.4.3 and 2.4.4
>Reporter: Tim Kellogg
>Priority: Major
> Attachments: spark-debug.txt, variety-of-schemas.ipynb
>
>
> I have 20 Pandas UDFs that I'm trying to evaluate all at the same time.
>  * PandasUDFType.GROUPED_AGG
>  * 3 columns in the input data frame being serialized over Arrow to Python 
> worker. See below for clarification.
>  * All functions take 2 parameters, some combination of the 3 received as 
> Arrow input.
>  * Varying return types, see details below.
> _*I get an IllegalArgumentException on the Scala side of the worker when 
> deserializing from Python.*_
> h2. Exception & Stack Trace
> {code:java}
> 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver): java.lang.IllegalArgumentException
>   at 

[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination

2019-12-03 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987413#comment-16987413
 ] 

Hyukjin Kwon commented on SPARK-29988:
--

Sure, take your time. just as a reminder, Hive version can be configured by 
this env AMPLAB_JENKINS_BUILD_HIVE_PROFILE. (see also 
https://github.com/apache/spark/commit/4a73bed3180aeb79c92bb19aea2ac5a97899731a#
 )

> Adjust Jenkins jobs for `hive-1.2/2.3` combination
> --
>
> Key: SPARK-29988
> URL: https://issues.apache.org/jira/browse/SPARK-29988
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> We need to rename the following Jenkins jobs first.
> spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2
> spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3
> spark-master-test-maven-hadoop-2.7 -> 
> spark-master-test-maven-hadoop-2.7-hive-1.2
> spark-master-test-maven-hadoop-3.2 -> 
> spark-master-test-maven-hadoop-3.2-hive-2.3
> Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs.
> {code}
> -Phive \
> +-Phive-1.2 \
> {code}
> And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs.
> {code}
> -Phive \
> +-Phive-2.3 \
> {code}
> Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins 
> manually. (This should be added to SCM of AmpLab Jenkins.)
> After SPARK-29981, we need to create two new jobs.
> - spark-master-test-sbt-hadoop-2.7-hive-2.3
> - spark-master-test-maven-hadoop-2.7-hive-2.3
> This is for preparation for Apache Spark 3.0.0.
> We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-12-03 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987407#comment-16987407
 ] 

huangtianhua commented on SPARK-29106:
--

[~shaneknapp], ok, thanks, sorry and I found the mvn command of arm test still 
is "./build/mvn test -Paarch64 -Phadoop-2.7 {color:#de350b}-Phive1.2{color} 
-Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos" , it should be 
{color:#de350b}-Phive-1.2{color}?

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, 
> SparkR-and-pyspark36-testing.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29903) Add documentation for recursiveFileLookup

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29903.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/26718

> Add documentation for recursiveFileLookup
> -
>
> Key: SPARK-29903
> URL: https://issues.apache.org/jira/browse/SPARK-29903
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
> Fix For: 3.0.0
>
>
> SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
> loading data from a source directory. There is currently no documentation for 
> this option.
> We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29903) Add documentation for recursiveFileLookup

2019-12-03 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-29903:


Assignee: Nicholas Chammas

> Add documentation for recursiveFileLookup
> -
>
> Key: SPARK-29903
> URL: https://issues.apache.org/jira/browse/SPARK-29903
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.0.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> SPARK-27990 added a new option, {{recursiveFileLookup}}, for recursively 
> loading data from a source directory. There is currently no documentation for 
> this option.
> We should document this both for the DataFrame API as well as for SQL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30112) Insert overwrite should be able to overwrite to same table under dynamic partition overwrite

2019-12-03 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-30112:
---

 Summary: Insert overwrite should be able to overwrite to same 
table under dynamic partition overwrite
 Key: SPARK-30112
 URL: https://issues.apache.org/jira/browse/SPARK-30112
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Currently, Insert overwrite cannot overwrite to same table even it is dynamic 
partition overwrite. But for dynamic partition overwrite, we do not delete 
partition directories ahead. We write to staging directories and move data to 
final partition directories. We should be able to insert overwrite to same 
table under dynamic partition overwrite.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30111) spark R dockerfile fails to build

2019-12-03 Thread Ilan Filonenko (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987372#comment-16987372
 ] 

Ilan Filonenko commented on SPARK-30111:


The error seems to be from:

{code:yaml}
Step 6/12 : RUN apt install -y python python-pip && apt install -y python3 
python3-pip && rm -r /usr/lib/python*/ensurepip && pip install 
--upgrade pip setuptools && rm -r /root/.cache && rm -rf /var/cache/apt/*
 ---> Running in f3d520c3435b
{code}

so this is relating to pyspark dockerfile
The error is because:  *404  Not Found [IP: 151.101.188.204 80]* 

> spark R dockerfile fails to build
> -
>
> Key: SPARK-30111
> URL: https://issues.apache.org/jira/browse/SPARK-30111
> Project: Spark
>  Issue Type: Bug
>  Components: Build, jenkins, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Shane Knapp
>Priority: Major
>
> all recent k8s builds have been failing when trying to build the sparkR 
> dockerfile:
> [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console]
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/]
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/]
> [~ifilonenko]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30051) Clean up hadoop-3.2 transitive dependencies

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30051.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26742
[https://github.com/apache/spark/pull/26742]

> Clean up hadoop-3.2 transitive dependencies 
> 
>
> Key: SPARK-30051
> URL: https://issues.apache.org/jira/browse/SPARK-30051
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30051) Clean up hadoop-3.2 transitive dependencies

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30051:
-

Assignee: Dongjoon Hyun

> Clean up hadoop-3.2 transitive dependencies 
> 
>
> Key: SPARK-30051
> URL: https://issues.apache.org/jira/browse/SPARK-30051
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30060) Uniform naming for Spark Metrics configuration parameters

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30060.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26692
[https://github.com/apache/spark/pull/26692]

> Uniform naming for Spark Metrics configuration parameters
> -
>
> Key: SPARK-30060
> URL: https://issues.apache.org/jira/browse/SPARK-30060
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently Spark has a few parameters to enable/disable metrics reporting. 
> Their naming pattern is not uniform and this can create confusion.  Currently 
> we have:
>  {{spark.metrics.static.sources.enabled}}
>  {{spark.app.status.metrics.enabled}}
>  {{spark.sql.streaming.metricsEnabled}}
> This proposes to introduce a naming convention for such parameters: 
> spark.metrics.sourceNameCamelCase.enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30060) Uniform naming for Spark Metrics configuration parameters

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30060:
-

Assignee: Luca Canali

> Uniform naming for Spark Metrics configuration parameters
> -
>
> Key: SPARK-30060
> URL: https://issues.apache.org/jira/browse/SPARK-30060
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
>
> Currently Spark has a few parameters to enable/disable metrics reporting. 
> Their naming pattern is not uniform and this can create confusion.  Currently 
> we have:
>  {{spark.metrics.static.sources.enabled}}
>  {{spark.app.status.metrics.enabled}}
>  {{spark.sql.streaming.metricsEnabled}}
> This proposes to introduce a naming convention for such parameters: 
> spark.metrics.sourceNameCamelCase.enabled



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30106) DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be tested

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30106:
-

Assignee: deshanxiao

> DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be 
> tested
> ---
>
> Key: SPARK-30106
> URL: https://issues.apache.org/jira/browse/SPARK-30106
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: deshanxiao
>Assignee: deshanxiao
>Priority: Minor
>
> The test "no predicate on the dimension table is not be tested" has no 
> partiton key. We can change the sql to test it.
> {code:java}
>   Given("no predicate on the dimension table")
>   withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") {
> val df = sql(
>   """
> |SELECT * FROM fact_sk f
> |JOIN dim_store s
> |ON f.date_id = s.store_id
>   """.stripMargin)
> checkPartitionPruningPredicate(df, false, false)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30106) DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be tested

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30106.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26744
[https://github.com/apache/spark/pull/26744]

> DynamicPartitionPruningSuite#"no predicate on the dimension table" is not be 
> tested
> ---
>
> Key: SPARK-30106
> URL: https://issues.apache.org/jira/browse/SPARK-30106
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: deshanxiao
>Assignee: deshanxiao
>Priority: Minor
> Fix For: 3.0.0
>
>
> The test "no predicate on the dimension table is not be tested" has no 
> partiton key. We can change the sql to test it.
> {code:java}
>   Given("no predicate on the dimension table")
>   withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") {
> val df = sql(
>   """
> |SELECT * FROM fact_sk f
> |JOIN dim_store s
> |ON f.date_id = s.store_id
>   """.stripMargin)
> checkPartitionPruningPredicate(df, false, false)
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29748) Remove sorting of fields in PySpark SQL Row creation

2019-12-03 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987252#comment-16987252
 ] 

Bryan Cutler commented on SPARK-29748:
--

[~zero323] I made some updates to the PR with remove the _LegacyRow and option 
for OrderedDict, and also like you suggested for Python 3.6 will automatically 
fall back to legacy behavior of sorting and print a warning to the user.

> Remove sorting of fields in PySpark SQL Row creation
> 
>
> Key: SPARK-29748
> URL: https://issues.apache.org/jira/browse/SPARK-29748
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Currently, when a PySpark Row is created with keyword arguments, the fields 
> are sorted alphabetically. This has created a lot of confusion with users 
> because it is not obvious (although it is stated in the pydocs) that they 
> will be sorted alphabetically, and then an error can occur later when 
> applying a schema and the field order does not match.
> The original reason for sorting fields is because kwargs in python < 3.6 are 
> not guaranteed to be in the same order that they were entered. Sorting 
> alphabetically would ensure a consistent order.  Matters are further 
> complicated with the flag {{__from_dict__}} that allows the {{Row}} fields to 
> to be referenced by name when made by kwargs, but this flag is not serialized 
> with the Row and leads to inconsistent behavior.
> This JIRA proposes that any sorting of the Fields is removed. Users with 
> Python 3.6+ creating Rows with kwargs can continue to do so since Python will 
> ensure the order is the same as entered. Users with Python < 3.6 will have to 
> create Rows with an OrderedDict or by using the Row class as a factory 
> (explained in the pydoc).  If kwargs are used, an error will be raised or 
> based on a conf setting it can fall back to a LegacyRow that will sort the 
> fields as before. This LegacyRow will be immediately deprecated and removed 
> once support for Python < 3.6 is dropped.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs

2019-12-03 Thread Tim Kellogg (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987230#comment-16987230
 ] 

Tim Kellogg commented on SPARK-30063:
-

Improvement suggestions
 * Set the PYTHONHASHSEED environment variable such that the whole cluster has 
the same value. I suggest that the original value is still random, for the same 
reason why it's random in the first place. This would alleviate a lot of crazy 
PySpark bugs that appear like concurrency bugs.
 * Make Spark friendlier to different Arrow versions, or force override the 
PyArrow version so that it's harder to get it wrong.
 * Consider upgrading Arrow. Newer versions are even friendlier to use.

I'd love to help out, even contribute if possible.

> Failure when returning a value from multiple Pandas UDFs
> 
>
> Key: SPARK-30063
> URL: https://issues.apache.org/jira/browse/SPARK-30063
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 2.4.4
> Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on 
> both 2.4.3 and 2.4.4
>Reporter: Tim Kellogg
>Priority: Major
> Attachments: spark-debug.txt, variety-of-schemas.ipynb
>
>
> I have 20 Pandas UDFs that I'm trying to evaluate all at the same time.
>  * PandasUDFType.GROUPED_AGG
>  * 3 columns in the input data frame being serialized over Arrow to Python 
> worker. See below for clarification.
>  * All functions take 2 parameters, some combination of the 3 received as 
> Arrow input.
>  * Varying return types, see details below.
> _*I get an IllegalArgumentException on the Scala side of the worker when 
> deserializing from Python.*_
> h2. Exception & Stack Trace
> {code:java}
> 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> 

[jira] [Comment Edited] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-03 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987120#comment-16987120
 ] 

Aman Omer edited comment on SPARK-29667 at 12/3/19 7:03 PM:


Actually there is another JIRA for decimal type mismatching due to precision
https://issues.apache.org/jira/browse/SPARK-29600


was (Author: aman_omer):
Actually there is also JIRA for decimal type mismatching due to precision
https://issues.apache.org/jira/browse/SPARK-29600

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
> fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs

2019-12-03 Thread Bryan Cutler (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987154#comment-16987154
 ] 

Bryan Cutler commented on SPARK-30063:
--

I haven't looked at your bug report in detail but you are right that there was 
a change in the Arrow message format for 0.15.0+. To maintain compatibility 
with current versions of Spark an environment variable can be set by adding it 
to conf/spark-env.sh :

{{ARROW_PRE_0_15_IPC_FORMAT=1}}

It's described a little more in the docs here 
[https://github.com/apache/spark/blob/master/docs/sql-pyspark-pandas-with-arrow.md#compatibiliy-setting-for-pyarrow--0150-and-spark-23x-24x]

Could you try this out and see if it solves your issue?

> Failure when returning a value from multiple Pandas UDFs
> 
>
> Key: SPARK-30063
> URL: https://issues.apache.org/jira/browse/SPARK-30063
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 2.4.4
> Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on 
> both 2.4.3 and 2.4.4
>Reporter: Tim Kellogg
>Priority: Major
> Attachments: spark-debug.txt, variety-of-schemas.ipynb
>
>
> I have 20 Pandas UDFs that I'm trying to evaluate all at the same time.
>  * PandasUDFType.GROUPED_AGG
>  * 3 columns in the input data frame being serialized over Arrow to Python 
> worker. See below for clarification.
>  * All functions take 2 parameters, some combination of the 3 received as 
> Arrow input.
>  * Varying return types, see details below.
> _*I get an IllegalArgumentException on the Scala side of the worker when 
> deserializing from Python.*_
> h2. Exception & Stack Trace
> {code:java}
> 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> 

[jira] [Commented] (SPARK-29988) Adjust Jenkins jobs for `hive-1.2/2.3` combination

2019-12-03 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987148#comment-16987148
 ] 

Shane Knapp commented on SPARK-29988:
-

i probably won't get around to this until next week...

> Adjust Jenkins jobs for `hive-1.2/2.3` combination
> --
>
> Key: SPARK-29988
> URL: https://issues.apache.org/jira/browse/SPARK-29988
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> We need to rename the following Jenkins jobs first.
> spark-master-test-sbt-hadoop-2.7 -> spark-master-test-sbt-hadoop-2.7-hive-1.2
> spark-master-test-sbt-hadoop-3.2 -> spark-master-test-sbt-hadoop-3.2-hive-2.3
> spark-master-test-maven-hadoop-2.7 -> 
> spark-master-test-maven-hadoop-2.7-hive-1.2
> spark-master-test-maven-hadoop-3.2 -> 
> spark-master-test-maven-hadoop-3.2-hive-2.3
> Also, we need to add `-Phive-1.2` for the existing `hadoop-2.7` jobs.
> {code}
> -Phive \
> +-Phive-1.2 \
> {code}
> And, we need to add `-Phive-2.3` for the existing `hadoop-3.2` jobs.
> {code}
> -Phive \
> +-Phive-2.3 \
> {code}
> Now now, I added the above `-Phive-1.2` and `-Phive-2.3` to the Jenkins 
> manually. (This should be added to SCM of AmpLab Jenkins.)
> After SPARK-29981, we need to create two new jobs.
> - spark-master-test-sbt-hadoop-2.7-hive-2.3
> - spark-master-test-maven-hadoop-2.7-hive-2.3
> This is for preparation for Apache Spark 3.0.0.
> We may drop all `*-hive-1.2` jobs at Apache Spark 3.1.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30111) spark R dockerfile fails to build

2019-12-03 Thread Shane Knapp (Jira)
Shane Knapp created SPARK-30111:
---

 Summary: spark R dockerfile fails to build
 Key: SPARK-30111
 URL: https://issues.apache.org/jira/browse/SPARK-30111
 Project: Spark
  Issue Type: Bug
  Components: Build, jenkins, Kubernetes
Affects Versions: 3.0.0
Reporter: Shane Knapp


all recent k8s builds have been failing when trying to build the sparkR 
dockerfile:

[https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/19565/console]

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/426/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s/]

[https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/76/console|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-jdk11/]

[~ifilonenko]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-12-03 Thread Shane Knapp (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987143#comment-16987143
 ] 

Shane Knapp commented on SPARK-29106:
-

got it, thanks!

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, 
> SparkR-and-pyspark36-testing.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-03 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987120#comment-16987120
 ] 

Aman Omer commented on SPARK-29667:
---

Actually there is also JIRA for decimal type mismatching due to precision
https://issues.apache.org/jira/browse/SPARK-29600

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
> fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-03 Thread Aman Omer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987119#comment-16987119
 ] 

Aman Omer commented on SPARK-29667:
---

Hi [~jessielin], did you observed same issue with any other data type?

{code:java}
SELECT * FROM (VALUES (1.0), (2.0) AS t1(col1))
WHERE col1 IN
(SELECT * FROM (VALUES (1.0), (2.0), (3.00) AS t2(col2)));
{code}
Above query will fail with following error

{noformat}
Mismatched columns:
[(__auto_generated_subquery_name.`col1`:decimal(2,1), 
__auto_generated_subquery_name.`col2`:decimal(3,2))]
Left side:
[decimal(2,1)].
Right side:
[decimal(3,2)].
{noformat}

Since, `3.00` in subquery will upcast all element in column to decimal(3,2).

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> the sql and clause
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> Once I cast decimal(18,0) to decimal(28,0) explicitly above, the sql ran just 
> fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30012) Change classes extending scala collection classes to work with 2.13

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30012.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26728
[https://github.com/apache/spark/pull/26728]

> Change classes extending scala collection classes to work with 2.13
> ---
>
> Key: SPARK-30012
> URL: https://issues.apache.org/jira/browse/SPARK-30012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.0.0
>
>
> Several classes in the code base extend Scala collection classes, like:
> - ExternalAppendOnlyMap
> - CaseInsensitiveMap
> - AttributeMap
> - (probably more I haven't gotten to yet)
> The collection hierarchy changed in 2.13, and like elsewhere, I don't see 
> that it's possible to use a compat library or clever coding to bridge the 
> difference. I think we need to maintain two implementations in parallel 
> source trees, possibly with a common ancestor with the commonalities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30012) Change classes extending scala collection classes to work with 2.13

2019-12-03 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30012:
-

Assignee: Sean R. Owen

> Change classes extending scala collection classes to work with 2.13
> ---
>
> Key: SPARK-30012
> URL: https://issues.apache.org/jira/browse/SPARK-30012
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Minor
>
> Several classes in the code base extend Scala collection classes, like:
> - ExternalAppendOnlyMap
> - CaseInsensitiveMap
> - AttributeMap
> - (probably more I haven't gotten to yet)
> The collection hierarchy changed in 2.13, and like elsewhere, I don't see 
> that it's possible to use a compat library or clever coding to bridge the 
> difference. I think we need to maintain two implementations in parallel 
> source trees, possibly with a common ancestor with the commonalities.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-12-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29477:
-
Priority: Minor  (was: Major)

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.0.0
>
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-12-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-29477:


Assignee: Rakesh Raushan

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Rakesh Raushan
>Priority: Major
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29477) Improve tooltip information for Streaming Tab

2019-12-03 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-29477.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26467
[https://github.com/apache/spark/pull/26467]

> Improve tooltip information for Streaming  Tab
> --
>
> Key: SPARK-29477
> URL: https://issues.apache.org/jira/browse/SPARK-29477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Rakesh Raushan
>Priority: Major
> Fix For: 3.0.0
>
>
> Active Batches Table and Completed Batches can be re look and put proper tip 
> for Batch Time and Record column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30082) Zeros are being treated as NaNs

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30082.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26738
[https://github.com/apache/spark/pull/26738]

> Zeros are being treated as NaNs
> ---
>
> Key: SPARK-30082
> URL: https://issues.apache.org/jira/browse/SPARK-30082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: John Ayad
>Priority: Major
> Fix For: 3.0.0
>
>
> If you attempt to run
> {code:java}
> df = df.replace(float('nan'), somethingToReplaceWith)
> {code}
> It will replace all {{0}} s in columns of type {{Integer}}
> Example code snippet to repro this:
> {code:java}
> from pyspark.sql import SQLContext
> spark = SQLContext(sc).sparkSession
> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value"))
> df.show()
> df = df.replace(float('nan'), 5)
> df.show()
> {code}
> Here's the output I get when I run this code:
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.4.4
>   /_/
> Using Python version 3.7.5 (default, Nov  1 2019 02:16:32)
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SQLContext
> >>> spark = SQLContext(sc).sparkSession
> >>> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value"))
> >>> df.show()
> +-+-+
> |index|value|
> +-+-+
> |1|0|
> |2|3|
> |3|0|
> +-+-+
> >>> df = df.replace(float('nan'), 5)
> >>> df.show()
> +-+-+
> |index|value|
> +-+-+
> |1|5|
> |2|3|
> |3|5|
> +-+-+
> >>>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout

2019-12-03 Thread Thomas Graves (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987037#comment-16987037
 ] 

Thomas Graves commented on SPARK-18886:
---

Note there is discussion on this subject on prs:

[https://github.com/apache/spark/pull/26633] (hack to work around it for a 
particular RDD)

PR with proposed solution - but really more discussion solution:

[https://github.com/apache/spark/pull/26696]

 

My proposal I believe is similar to Kay's where we use slots and track the 
delay per slot.  I haven't looked at the code in specific detail, especially in 
the FairScheduler where most of the issues in the conversations above were 
mentioned.  One way around this is we have different policies and allow users 
to configure, or have one for FairScheduler and one for fifo.

> Delay scheduling should not delay some executors indefinitely if one task is 
> scheduled before delay timeout
> ---
>
> Key: SPARK-18886
> URL: https://issues.apache.org/jira/browse/SPARK-18886
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> Delay scheduling can introduce an unbounded delay and underutilization of 
> cluster resources under the following circumstances:
> 1. Tasks have locality preferences for a subset of available resources
> 2. Tasks finish in less time than the delay scheduling.
> Instead of having *one* delay to wait for resources with better locality, 
> spark waits indefinitely.
> As an example, consider a cluster with 100 executors, and a taskset with 500 
> tasks.  Say all tasks have a preference for one executor, which is by itself 
> on one host.  Given the default locality wait of 3s per level, we end up with 
> a 6s delay till we schedule on other hosts (process wait + host wait).
> If each task takes 5 seconds (under the 6 second delay), then _all 500_ tasks 
> get scheduled on _only one_ executor.  This means you're only using a 1% of 
> your cluster, and you get a ~100x slowdown.  You'd actually be better off if 
> tasks took 7 seconds.
> *WORKAROUNDS*: 
> (1) You can change the locality wait times so that it is shorter than the 
> task execution time.  You need to take into account the sum of all wait times 
> to use all the resources on your cluster.  For example, if you have resources 
> on different racks, this will include the sum of 
> "spark.locality.wait.process" + "spark.locality.wait.node" + 
> "spark.locality.wait.rack".  Those each default to "3s".  The simplest way to 
> be to set "spark.locality.wait.process" to your desired wait interval, and 
> set both "spark.locality.wait.node" and "spark.locality.wait.rack" to "0".  
> For example, if your tasks take ~3 seconds on average, you might set 
> "spark.locality.wait.process" to "1s".  *NOTE*: due to SPARK-18967, avoid 
> setting the {{spark.locality.wait=0}} -- instead, use 
> {{spark.locality.wait=1ms}}.
> Note that this workaround isn't perfect --with less delay scheduling, you may 
> not get as good resource locality.  After this issue is fixed, you'd most 
> likely want to undo these configuration changes.
> (2) The worst case here will only happen if your tasks have extreme skew in 
> their locality preferences.  Users may be able to modify their job to 
> controlling the distribution of the original input data.
> (2a) A shuffle may end up with very skewed locality preferences, especially 
> if you do a repartition starting from a small number of partitions.  (Shuffle 
> locality preference is assigned if any node has more than 20% of the shuffle 
> input data -- by chance, you may have one node just above that threshold, and 
> all other nodes just below it.)  In this case, you can turn off locality 
> preference for shuffle data by setting 
> {{spark.shuffle.reduceLocality.enabled=false}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30082) Zeros are being treated as NaNs

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30082:
---

Assignee: John Ayad

> Zeros are being treated as NaNs
> ---
>
> Key: SPARK-30082
> URL: https://issues.apache.org/jira/browse/SPARK-30082
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: John Ayad
>Assignee: John Ayad
>Priority: Major
> Fix For: 3.0.0
>
>
> If you attempt to run
> {code:java}
> df = df.replace(float('nan'), somethingToReplaceWith)
> {code}
> It will replace all {{0}} s in columns of type {{Integer}}
> Example code snippet to repro this:
> {code:java}
> from pyspark.sql import SQLContext
> spark = SQLContext(sc).sparkSession
> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value"))
> df.show()
> df = df.replace(float('nan'), 5)
> df.show()
> {code}
> Here's the output I get when I run this code:
> {code:java}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/__ / .__/\_,_/_/ /_/\_\   version 2.4.4
>   /_/
> Using Python version 3.7.5 (default, Nov  1 2019 02:16:32)
> SparkSession available as 'spark'.
> >>> from pyspark.sql import SQLContext
> >>> spark = SQLContext(sc).sparkSession
> >>> df = spark.createDataFrame([(1, 0), (2, 3), (3, 0)], ("index", "value"))
> >>> df.show()
> +-+-+
> |index|value|
> +-+-+
> |1|0|
> |2|3|
> |3|0|
> +-+-+
> >>> df = df.replace(float('nan'), 5)
> >>> df.show()
> +-+-+
> |index|value|
> +-+-+
> |1|5|
> |2|3|
> |3|5|
> +-+-+
> >>>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30083) visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30083:
---

Assignee: Kent Yao

> visitArithmeticUnary should wrap PLUS case with UnaryPositive for type 
> checking
> ---
>
> Key: SPARK-30083
> URL: https://issues.apache.org/jira/browse/SPARK-30083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> For PLUS case, visitArithmeticUnary do not wrap the expr with UnaryPositive, 
> so it escapes from type checking



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30083) visitArithmeticUnary should wrap PLUS case with UnaryPositive for type checking

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30083.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26716
[https://github.com/apache/spark/pull/26716]

> visitArithmeticUnary should wrap PLUS case with UnaryPositive for type 
> checking
> ---
>
> Key: SPARK-30083
> URL: https://issues.apache.org/jira/browse/SPARK-30083
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> For PLUS case, visitArithmeticUnary do not wrap the expr with UnaryPositive, 
> so it escapes from type checking



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30048.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26681
[https://github.com/apache/spark/pull/26681]

> Enable aggregates with interval type values for RelationalGroupedDataset 
> -
>
> Key: SPARK-30048
> URL: https://issues.apache.org/jira/browse/SPARK-30048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> Now the min/max/sum/avg are support for intervals, we should also enable it 
> in RelationalGroupedDataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-30048) Enable aggregates with interval type values for RelationalGroupedDataset

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30048:
---

Assignee: Kent Yao

> Enable aggregates with interval type values for RelationalGroupedDataset 
> -
>
> Key: SPARK-30048
> URL: https://issues.apache.org/jira/browse/SPARK-30048
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> Now the min/max/sum/avg are support for intervals, we should also enable it 
> in RelationalGroupedDataset



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

2019-12-03 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986859#comment-16986859
 ] 

Jungtaek Lim commented on SPARK-30101:
--

I'm not aware of how configuration page is constructed, but as the page guides 
on retrieving the list of configurations for Spark SQL, it doesn't enumerate 
these configs in the global configuration page as Spark SQL has too many 
configuration by itself.

[https://spark.apache.org/docs/latest/configuration.html#spark-sql]

There's a doc already describing the `spark.sql.shuffle.partitions`, though 
that is introduced as "other configuration options". We may deal with it we 
strongly agree about needs for prioritizing this.

[http://spark.apache.org/docs/latest/sql-performance-tuning.html]

Btw, `coalesce` may cover the needs of adding optional parameter, but I guess 
someone may not feel convenient with this.

> spark.sql.shuffle.partitions is not in Configuration docs, but a very 
> critical parameter
> 
>
> Key: SPARK-30101
> URL: https://issues.apache.org/jira/browse/SPARK-30101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: sam
>Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>   .builder().appName("foo").master("local")
>   .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.
> ...
> According to below comments, it uses spark.sql.shuffle.partitions, which 
> needs documenting in configuration.
> > Default number of partitions in RDDs returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.
> in https://spark.apache.org/docs/latest/configuration.html should say
> > Default number of partitions in RDDs, but not DS/DF (see 
> > spark.sql.shuffle.partitions) returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30063) Failure when returning a value from multiple Pandas UDFs

2019-12-03 Thread Ruben Berenguel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986817#comment-16986817
 ] 

Ruben Berenguel commented on SPARK-30063:
-

Wow, this looks bad for now (since grouped_aggs are one of the "super cool 
things you can do now in Arrow/Pandas"), means we may need some kind of 
regression testing for Arrow "usage" in PySpark or some stronger version 
pinning.

Tagging more knowledgeable people in this area of Spark so they are aware too: 
[~hyukjin.kwon] [~bryanc]

Thanks [~tkellogg] this definitely seems to be enough to get it forward (way 
more than enough, big thanks!)

> Failure when returning a value from multiple Pandas UDFs
> 
>
> Key: SPARK-30063
> URL: https://issues.apache.org/jira/browse/SPARK-30063
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3, 2.4.4
> Environment: Happens on Mac & Ubuntu (Docker). Seems to happen on 
> both 2.4.3 and 2.4.4
>Reporter: Tim Kellogg
>Priority: Major
> Attachments: spark-debug.txt, variety-of-schemas.ipynb
>
>
> I have 20 Pandas UDFs that I'm trying to evaluate all at the same time.
>  * PandasUDFType.GROUPED_AGG
>  * 3 columns in the input data frame being serialized over Arrow to Python 
> worker. See below for clarification.
>  * All functions take 2 parameters, some combination of the 3 received as 
> Arrow input.
>  * Varying return types, see details below.
> _*I get an IllegalArgumentException on the Scala side of the worker when 
> deserializing from Python.*_
> h2. Exception & Stack Trace
> {code:java}
> 19/11/27 11:38:36 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
> java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)
>   at 
> org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:162)
>   at 
> org.apache.spark.sql.execution.python.ArrowPythonRunner$$anon$1.read(ArrowPythonRunner.scala:122)
>   at 
> org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:410)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:123)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> 19/11/27 11:38:36 WARN TaskSetManager: Lost task 0.0 in stage 5.0 (TID 5, 
> localhost, executor driver): java.lang.IllegalArgumentException
>   at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)
>   at 
> org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543)
>   at 
> org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58)
>   at 
> org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)
>   at 
> 

[jira] [Created] (SPARK-30110) Support type judgment for ArrayData

2019-12-03 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-30110:
--

 Summary: Support type judgment for ArrayData
 Key: SPARK-30110
 URL: https://issues.apache.org/jira/browse/SPARK-30110
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: jiaan.geng
 Fix For: 3.0.0


ArrayData is only some interfaces for getting data, such as: getBoolean, 
getByte.

When the element type of array is unknow, I want to judgment the element type 
first.

I have PR working in process:

[https://github.com/beliefer/spark/commit/5787c6f062795aa7931c58ac2302ba607d3a97aa]

The array function ArrayNDims is used to get the Dimension of array.

If ArrayData can support some interfaces like isBoolean, isByte, isArray, My 
job will be easier.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30109) PCA use BLAS.gemv with sparse vector

2019-12-03 Thread zhengruifeng (Jira)
zhengruifeng created SPARK-30109:


 Summary: PCA use BLAS.gemv with sparse vector
 Key: SPARK-30109
 URL: https://issues.apache.org/jira/browse/SPARK-30109
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.0.0
Reporter: zhengruifeng


When PCA was first impled in 
[SPARK-5521|https://issues.apache.org/jira/browse/SPARK-5521], at that time 
Matrix.multiply(BLAS.gemv internally) did not support sparse vector. So it 
worked around it by applying a complex matrix multiplication.

Since [SPARK-7681|https://issues.apache.org/jira/browse/SPARK-7681], BLAS.gemv 
supported sparse vector. So we can directly use Matrix.multiply now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30108) Add robust accumulator for observable metrics

2019-12-03 Thread Jira
Herman van Hövell created SPARK-30108:
-

 Summary: Add robust accumulator for observable metrics
 Key: SPARK-30108
 URL: https://issues.apache.org/jira/browse/SPARK-30108
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Herman van Hövell






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29348) Add observable metrics

2019-12-03 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-29348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-29348.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Add observable metrics
> --
>
> Key: SPARK-29348
> URL: https://issues.apache.org/jira/browse/SPARK-29348
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30107) Expose nested schema pruning to all V2 sources

2019-12-03 Thread Anton Okolnychyi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986756#comment-16986756
 ] 

Anton Okolnychyi commented on SPARK-30107:
--

I'll submit a PR

> Expose nested schema pruning to all V2 sources
> --
>
> Key: SPARK-30107
> URL: https://issues.apache.org/jira/browse/SPARK-30107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> I think it would be great to expose the existing logic for nested schema 
> pruning to all V2 sources, which is in line with the description of 
> {{SupportsPushDownRequiredColumns}} . That way, all sources that are capable 
> of pruning nested columns will benefit from this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30107) Expose nested schema pruning to all V2 sources

2019-12-03 Thread Anton Okolnychyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Okolnychyi updated SPARK-30107:
-
Description: I think it would be great to expose the existing logic for 
nested schema pruning to all V2 sources, which is in line with the description 
of {{SupportsPushDownRequiredColumns}} . That way, all sources that are capable 
of pruning nested columns will benefit from this.  (was: I think it would be 
great to expose the existing logic for nested schema pruning to all V2 sources, 
which is in line with the description of `SupportsPushDownRequiredColumns` . 
That way, all sources that are capable of pruning nested columns will benefit 
from this.)

> Expose nested schema pruning to all V2 sources
> --
>
> Key: SPARK-30107
> URL: https://issues.apache.org/jira/browse/SPARK-30107
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Anton Okolnychyi
>Priority: Major
>
> I think it would be great to expose the existing logic for nested schema 
> pruning to all V2 sources, which is in line with the description of 
> {{SupportsPushDownRequiredColumns}} . That way, all sources that are capable 
> of pruning nested columns will benefit from this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30107) Expose nested schema pruning to all V2 sources

2019-12-03 Thread Anton Okolnychyi (Jira)
Anton Okolnychyi created SPARK-30107:


 Summary: Expose nested schema pruning to all V2 sources
 Key: SPARK-30107
 URL: https://issues.apache.org/jira/browse/SPARK-30107
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Anton Okolnychyi


I think it would be great to expose the existing logic for nested schema 
pruning to all V2 sources, which is in line with the description of 
`SupportsPushDownRequiredColumns` . That way, all sources that are capable of 
pruning nested columns will benefit from this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29106) Add jenkins arm test for spark

2019-12-03 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986743#comment-16986743
 ] 

huangtianhua commented on SPARK-29106:
--

[~shaneknapp], now we don't have to install leveldbjni-all manually, the pr has 
been merged [https://github.com/apache/spark/pull/26636], please add profile 
-Paarch64 when running maven commands for arm testing jobs, thanks.

> Add jenkins arm test for spark
> --
>
> Key: SPARK-29106
> URL: https://issues.apache.org/jira/browse/SPARK-29106
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Priority: Minor
> Attachments: R-ansible.yml, R-libs.txt, 
> SparkR-and-pyspark36-testing.txt, arm-python36.txt
>
>
> Add arm test jobs to amplab jenkins for spark.
> Till now we made two arm test periodic jobs for spark in OpenLab, one is 
> based on master with hadoop 2.7(similar with QA test of amplab jenkins), 
> other one is based on a new branch which we made on date 09-09, see  
> [http://status.openlabtesting.org/builds/job/spark-master-unit-test-hadoop-2.7-arm64]
>   and 
> [http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64.|http://status.openlabtesting.org/builds/job/spark-unchanged-branch-unit-test-hadoop-2.7-arm64]
>  We only have to care about the first one when integrate arm test with amplab 
> jenkins.
> About the k8s test on arm, we have took test it, see 
> [https://github.com/theopenlab/spark/pull/17], maybe we can integrate it 
> later. 
> And we plan test on other stable branches too, and we can integrate them to 
> amplab when they are ready.
> We have offered an arm instance and sent the infos to shane knapp, thanks 
> shane to add the first arm job to amplab jenkins :) 
> The other important thing is about the leveldbjni 
> [https://github.com/fusesource/leveldbjni,|https://github.com/fusesource/leveldbjni/issues/80]
>  spark depends on leveldbjni-all-1.8 
> [https://mvnrepository.com/artifact/org.fusesource.leveldbjni/leveldbjni-all/1.8],
>  we can see there is no arm64 supporting. So we build an arm64 supporting 
> release of leveldbjni see 
> [https://mvnrepository.com/artifact/org.openlabtesting.leveldbjni/leveldbjni-all/1.8],
>  but we can't modified the spark pom.xml directly with something like 
> 'property'/'profile' to choose correct jar package on arm or x86 platform, 
> because spark depends on some hadoop packages like hadoop-hdfs, the packages 
> depend on leveldbjni-all-1.8 too, unless hadoop release with new arm 
> supporting leveldbjni jar. Now we download the leveldbjni-al-1.8 of 
> openlabtesting and 'mvn install' to use it when arm testing for spark.
> PS: The issues found and fixed:
>  SPARK-28770
>  [https://github.com/apache/spark/pull/25673]
>   
>  SPARK-28519
>  [https://github.com/apache/spark/pull/25279]
>   
>  SPARK-28433
>  [https://github.com/apache/spark/pull/25186]
>  
> SPARK-28467
> [https://github.com/apache/spark/pull/25864]
>  
> SPARK-29286
> [https://github.com/apache/spark/pull/26021]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29537) throw exception when user defined a wrong base path

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29537:
---

Assignee: wuyi

> throw exception when user defined a wrong base path
> ---
>
> Key: SPARK-29537
> URL: https://issues.apache.org/jira/browse/SPARK-29537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>
>  
> When user gives base path which is not an ancestor directory for the input 
> paths, we should throw exception to let user know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29537) throw exception when user defined a wrong base path

2019-12-03 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29537.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26195
[https://github.com/apache/spark/pull/26195]

> throw exception when user defined a wrong base path
> ---
>
> Key: SPARK-29537
> URL: https://issues.apache.org/jira/browse/SPARK-29537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
> Fix For: 3.0.0
>
>
>  
> When user gives base path which is not an ancestor directory for the input 
> paths, we should throw exception to let user know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

2019-12-03 Thread sam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-30101:

Description: 
I'm creating a `SparkSession` like this:

```
SparkSession
  .builder().appName("foo").master("local")
  .config("spark.default.parallelism", 2).getOrCreate()
```

when I run

```
((1 to 10) ++ (1 to 10)).toDS().distinct().count()
```

I get 200 partitions

```
19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
...
19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
in 46 ms on localhost (executor driver) (1/200)
```

It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, 
while `ds.distinct().rdd.getNumPartitions` gives `200`.  
`ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
correctly.

Finally I notice that the good old `RDD` interface has a `distinct` that 
accepts `numPartitions` partitions, while `Dataset` does not.

...

According to below comments, it uses spark.sql.shuffle.partitions, which needs 
documenting in configuration.

> Default number of partitions in RDDs returned by transformations like join, 
> reduceByKey, and parallelize when not set by user.

in https://spark.apache.org/docs/latest/configuration.html should say

> Default number of partitions in RDDs, but not DS/DF (see 
> spark.sql.shuffle.partitions) returned by transformations like join, 
> reduceByKey, and parallelize when not set by user.



  was:
I'm creating a `SparkSession` like this:

```
SparkSession
  .builder().appName("foo").master("local")
  .config("spark.default.parallelism", 2).getOrCreate()
```

when I run

```
((1 to 10) ++ (1 to 10)).toDS().distinct().count()
```

I get 200 partitions

```
19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
...
19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
in 46 ms on localhost (executor driver) (1/200)
```

It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives `2`, 
while `ds.distinct().rdd.getNumPartitions` gives `200`.  
`ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
correctly.

Finally I notice that the good old `RDD` interface has a `distinct` that 
accepts `numPartitions` partitions, while `Dataset` does not.





> spark.sql.shuffle.partitions is not in Configuration docs, but a very 
> critical parameter
> 
>
> Key: SPARK-30101
> URL: https://issues.apache.org/jira/browse/SPARK-30101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: sam
>Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>   .builder().appName("foo").master("local")
>   .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.
> ...
> According to below comments, it uses spark.sql.shuffle.partitions, which 
> needs documenting in configuration.
> > Default number of partitions in RDDs returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.
> in https://spark.apache.org/docs/latest/configuration.html should say
> > Default number of partitions in RDDs, but not DS/DF (see 
> > spark.sql.shuffle.partitions) returned by transformations like join, 
> > reduceByKey, and parallelize when not set by user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30101) spark.sql.shuffle.partitions is not in Configuration docs, but a very critical parameter

2019-12-03 Thread sam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam updated SPARK-30101:

Summary: spark.sql.shuffle.partitions is not in Configuration docs, but a 
very critical parameter  (was: Dataset distinct does not respect 
spark.default.parallelism)

> spark.sql.shuffle.partitions is not in Configuration docs, but a very 
> critical parameter
> 
>
> Key: SPARK-30101
> URL: https://issues.apache.org/jira/browse/SPARK-30101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: sam
>Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>   .builder().appName("foo").master("local")
>   .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-30101) Dataset distinct does not respect spark.default.parallelism

2019-12-03 Thread sam (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sam reopened SPARK-30101:
-

What is expected, is what is documented.

> Dataset distinct does not respect spark.default.parallelism
> ---
>
> Key: SPARK-30101
> URL: https://issues.apache.org/jira/browse/SPARK-30101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: sam
>Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>   .builder().appName("foo").master("local")
>   .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30101) Dataset distinct does not respect spark.default.parallelism

2019-12-03 Thread sam (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986722#comment-16986722
 ] 

sam commented on SPARK-30101:
-

[~cloud_fan] [~kabhwan] Well this is at least a documentation error since 
`spark.sql.shuffle.partitions` isn't even in the configuration documentation 
https://spark.apache.org/docs/latest/configuration.html

Also "Returns a new Dataset that contains only the unique rows from this 
Dataset. This is an alias for dropDuplicates." in 
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset
 should be updated to say "... and repartitions this with 
`spark.sql.shuffle.partitions` partitions".

Do you agree we need a feature ticket to add `numPartitions` as an optional 
param to `distinct` since most shuffle operations have this?

> Dataset distinct does not respect spark.default.parallelism
> ---
>
> Key: SPARK-30101
> URL: https://issues.apache.org/jira/browse/SPARK-30101
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 2.4.4
>Reporter: sam
>Priority: Major
>
> I'm creating a `SparkSession` like this:
> ```
> SparkSession
>   .builder().appName("foo").master("local")
>   .config("spark.default.parallelism", 2).getOrCreate()
> ```
> when I run
> ```
> ((1 to 10) ++ (1 to 10)).toDS().distinct().count()
> ```
> I get 200 partitions
> ```
> 19/12/02 10:29:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 200 tasks
> ...
> 19/12/02 10:29:34 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) 
> in 46 ms on localhost (executor driver) (1/200)
> ```
> It is the `distinct` that is broken since `ds.rdd.getNumPartitions` gives 
> `2`, while `ds.distinct().rdd.getNumPartitions` gives `200`.  
> `ds.rdd.groupBy(identity).map(_._2.head)` and `ds.rdd.distinct()` work 
> correctly.
> Finally I notice that the good old `RDD` interface has a `distinct` that 
> accepts `numPartitions` partitions, while `Dataset` does not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org