date:20180110

[jira] [Created] (SPARK-23039) Fix the bug in alter table set location.

2018-01-10 Thread xubo245 (JIRA)

xubo245 created SPARK-23039:
---

 Summary:  Fix the bug in alter table set location.
 Key: SPARK-23039
 URL: https://issues.apache.org/jira/browse/SPARK-23039
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: xubo245
Priority: Critical


TOBO work: Fix the bug in alter table set location.   
org.apache.spark.sql.execution.command.DDLSuite#testSetLocation

{code:java}
// TODO(gatorsmile): fix the bug in alter table set location.
   //if (isUsingHiveMetastore) {
//assert(storageFormat.properties.get("path") === expected)
//   }
{code}

Analysis:

because user add locationUri and erase path by 
{code:java}
 newPath = None
{code}
in org.apache.spark.sql.hive.HiveExternalCatalog#restoreDataSourceTable:

{code:java}
val storageWithLocation = {
  val tableLocation = getLocationFromStorageProps(table)
  // We pass None as `newPath` here, to remove the path option in storage 
properties.
  updateLocationInStorageProps(table, newPath = None).copy(
locationUri = tableLocation.map(CatalogUtils.stringToURI(_)))
}
{code}

because " We pass None as `newPath` here, to remove the path option in storage 
properties."

And locationUri is obtain from path in storage properties

{code:java}
  private def getLocationFromStorageProps(table: CatalogTable): Option[String] 
= {
CaseInsensitiveMap(table.storage.properties).get("path")
  }
{code}

So we can use locationUri instead path




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.

2018-01-10 Thread Abhishek Soni (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321818#comment-16321818
 ] 

Abhishek Soni commented on SPARK-21179:
---

Thanks [~mwalton_mstr], I was only overriding "getJDBCType" method prior to 
this. Overriding the "quoteIdentifier" solved the issue.

> Unable to return Hive INT data type into Spark via Hive JDBC driver:  Caused 
> by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> -
>
> Key: SPARK-21179
> URL: https://issues.apache.org/jira/browse/SPARK-21179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> HDP version 2.5.0.1-60
> Hive version: 1.2.1
> Spark  version 2.0.0.2.5.0.1-60
> JDBC:  Download the latest Hortonworks JDBC driver
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive.  
> Unfortunately, when I try to query data that resides in an INT column I get 
> the following error:  
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> Steps to reproduce:
> 1) On Hive create a simple table with an INT column and insert some data (I 
> used SQuirreL Client with the Hortonworks JDBC driver):
> create table wh2.hivespark (country_id int, country_name string)
> insert into wh2.hivespark values (1, 'USA')
> 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files
> ./spark-shell --jars 
> /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar
> 4) In Spark shell load the data from Hive using the JDBC driver
> val hivespark = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable"
>  -> 
> "wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load()
> 5) In Spark shell try to display the data
> hivespark.show()
> At this point you should see the error:
> scala> hivespark.show()
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
> at 
> com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at 
> com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source)
> at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown 
> Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

[jira] [Assigned] (SPARK-23038) Update docker/spark-test (JDK/OS)

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23038:


Assignee: Apache Spark

> Update docker/spark-test (JDK/OS)
> -
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> This issue aims to update the followings in `docker/spark-test`.
> - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
> - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life 
> of `precise` was April 28, 2017.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23038) Update docker/spark-test (JDK/OS)

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321812#comment-16321812
 ] 

Apache Spark commented on SPARK-23038:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20230

> Update docker/spark-test (JDK/OS)
> -
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to update the followings in `docker/spark-test`.
> - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
> - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life 
> of `precise` was April 28, 2017.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23038) Update docker/spark-test (JDK/OS)

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23038:


Assignee: (was: Apache Spark)

> Update docker/spark-test (JDK/OS)
> -
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to update the followings in `docker/spark-test`.
> - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
> - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life 
> of `precise` was April 28, 2017.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23038) Update docker/spark-test (JDK/OS)

2018-01-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23038:
--
Summary: Update docker/spark-test (JDK/OS)  (was: Update docker/spark-test)

> Update docker/spark-test (JDK/OS)
> -
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to update the followings in `docker/spark-test`.
> - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
> - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life 
> of `precise` was April 28, 2017.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23038) Update docker/spark-test

2018-01-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23038:
--
Description: 
This issue aims to update the followings in `docker/spark-test`.
- JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
- Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life of 
`precise` was April 28, 2017.

  was:This PR updates JDK version in `docker/spark-test` because we support 
only JDK8 from Spark 2.2.


> Update docker/spark-test
> 
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This issue aims to update the followings in `docker/spark-test`.
> - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only.
> - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life 
> of `precise` was April 28, 2017.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23038) Update docker/spark-test

2018-01-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23038:
--
Summary: Update docker/spark-test  (was: Update docker/spark-test to use 
JDK8)

> Update docker/spark-test
> 
>
> Key: SPARK-23038
> URL: https://issues.apache.org/jira/browse/SPARK-23038
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.1
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> This PR updates JDK version in `docker/spark-test` because we support only 
> JDK8 from Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23038) Update docker/spark-test to use JDK8

2018-01-10 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-23038:
-

 Summary: Update docker/spark-test to use JDK8
 Key: SPARK-23038
 URL: https://issues.apache.org/jira/browse/SPARK-23038
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.2.1
Reporter: Dongjoon Hyun
Priority: Minor


This PR updates JDK version in `docker/spark-test` because we support only JDK8 
from Spark 2.2.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23037:


Assignee: Apache Spark

> RFormula should not use deprecated OneHotEncoder and should include 
> VectorSizeHint in pipeline
> --
>
> Key: SPARK-23037
> URL: https://issues.apache.org/jira/browse/SPARK-23037
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23037:


Assignee: (was: Apache Spark)

> RFormula should not use deprecated OneHotEncoder and should include 
> VectorSizeHint in pipeline
> --
>
> Key: SPARK-23037
> URL: https://issues.apache.org/jira/browse/SPARK-23037
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321740#comment-16321740
 ] 

Apache Spark commented on SPARK-23037:
--

User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/20229

> RFormula should not use deprecated OneHotEncoder and should include 
> VectorSizeHint in pipeline
> --
>
> Key: SPARK-23037
> URL: https://issues.apache.org/jira/browse/SPARK-23037
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline

2018-01-10 Thread Bago Amirbekian (JIRA)

Bago Amirbekian created SPARK-23037:
---

 Summary: RFormula should not use deprecated OneHotEncoder and 
should include VectorSizeHint in pipeline
 Key: SPARK-23037
 URL: https://issues.apache.org/jira/browse/SPARK-23037
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22921) Merge script should prompt for assigning jiras

2018-01-10 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321669#comment-16321669
 ] 

Saisai Shao commented on SPARK-22921:
-

Hi [~irashid], looks like the changes will throw an exception when the assignee 
is not yet a contributor. Please see the stack.

{code}
Traceback (most recent call last):
  File "./dev/merge_spark_pr.py", line 501, in 
main()
  File "./dev/merge_spark_pr.py", line 487, in main
resolve_jira_issues(title, merged_refs, jira_comment)
  File "./dev/merge_spark_pr.py", line 327, in resolve_jira_issues
resolve_jira_issue(merge_branches, comment, jira_id)
  File "./dev/merge_spark_pr.py", line 245, in resolve_jira_issue
cur_assignee = choose_jira_assignee(issue, asf_jira)
  File "./dev/merge_spark_pr.py", line 317, in choose_jira_assignee
asf_jira.assign_issue(issue.key, assignee.key)
  File "/Library/Python/2.7/site-packages/jira/client.py", line 108, in wrapper
result = func(*arg_list, **kwargs)
{code}

> Merge script should prompt for assigning jiras
> --
>
> Key: SPARK-22921
> URL: https://issues.apache.org/jira/browse/SPARK-22921
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 2.3.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Its a bit of a nuisance to have to go into jira to assign the issue when you 
> merge a pr.  In general you assign to either the original reporter or a 
> commentor, would be nice if the merge script made that easy to do.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2018-01-10 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao resolved SPARK-22587.
-
Resolution: Fixed

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Assignee: Mingjie Tang
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2018-01-10 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-22587:

Fix Version/s: 2.3.0

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Assignee: Mingjie Tang
> Fix For: 2.3.0
>
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url

2018-01-10 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-22587:
---

Assignee: Mingjie Tang

> Spark job fails if fs.defaultFS and application jar are different url
> -
>
> Key: SPARK-22587
> URL: https://issues.apache.org/jira/browse/SPARK-22587
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.3
>Reporter: Prabhu Joseph
>Assignee: Mingjie Tang
>
> Spark Job fails if the fs.defaultFs and url where application jar resides are 
> different and having same scheme,
> spark-submit  --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py
> core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop 
> fs -ls) works for both the url XXX and YYY.
> {code}
> Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
> wasb://XXX/tmp/test.py, expected: wasb://YYY 
> at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) 
> at 
> org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251)
>  
> at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) 
> at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) 
> at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507)
>  
> at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) 
> at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912)
>  
> at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) 
> at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) 
> at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) 
> at org.apache.spark.deploy.yarn.Client.main(Client.scala) 
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  
> at java.lang.reflect.Method.invoke(Method.java:498) 
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751)
>  
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) 
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) 
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) 
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 
> {code}
> The code Client.copyFileToRemote tries to resolve the path of application jar 
> (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead 
> of the actual url of application jar.
> val destFs = destDir.getFileSystem(hadoopConf)
> val srcFs = srcPath.getFileSystem(hadoopConf)
> getFileSystem will create the filesystem based on the url of the path and so 
> this is fine. But the below lines of code tries to get the srcPath (XXX url) 
> from the destFs (YYY url) and so it fails.
> var destPath = srcPath
> val qualifiedDestPath = destFs.makeQualified(destPath)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23027) optimizer a simple query using a non-existent data is too slow

2018-01-10 Thread wangminfeng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321645#comment-16321645
 ] 

wangminfeng commented on SPARK-23027:
-

i read the doc you gived, it looks like useful，i will try spark 2.1 then tell 
you the perfermance. 

> optimizer a simple query using a non-existent data is too slow
> --
>
> Key: SPARK-23027
> URL: https://issues.apache.org/jira/browse/SPARK-23027
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.1
>Reporter: wangminfeng
>
> When i use spark sql to do ad-hoc query, i have data partitioned by 
> event_day, event_minute, event_hour, data is large enough，one day data is 
> about 3T， we saved 3 month data.
> But when query use a non-existent day， get optimizedPlan is too slow.
> i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get 
> optimized plan， for five minutes，i can not get it. Query is simple enough, 
> like:
> SELECT
> event_day
> FROM db.table t1
> WHERE (t1.event_day='20170104' and t1.event_hour='23' and 
> t1.event_minute='55')
> LIMIT 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23036:


Assignee: Apache Spark

> Add withGlobalTempView for testing and correct some improper with view 
> related method usage
> ---
>
> Key: SPARK-23036
> URL: https://issues.apache.org/jira/browse/SPARK-23036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: Apache Spark
>
> Add withGlobalTempView when create global temp view, like withTempView and 
> withView.
> And correct some improper usage like: 
> {code:java}
>  test("list global temp views") {
> try {
>   sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
>   sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")
>   checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
> Row(globalTempDB, "v1", true) ::
> Row("", "v2", true) :: Nil)
>   
> assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == 
> Seq("v1", "v2"))
> } finally {
>   spark.catalog.dropTempView("v1")
>   spark.catalog.dropGlobalTempView("v2")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23036:


Assignee: (was: Apache Spark)

> Add withGlobalTempView for testing and correct some improper with view 
> related method usage
> ---
>
> Key: SPARK-23036
> URL: https://issues.apache.org/jira/browse/SPARK-23036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>
> Add withGlobalTempView when create global temp view, like withTempView and 
> withView.
> And correct some improper usage like: 
> {code:java}
>  test("list global temp views") {
> try {
>   sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
>   sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")
>   checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
> Row(globalTempDB, "v1", true) ::
> Row("", "v2", true) :: Nil)
>   
> assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == 
> Seq("v1", "v2"))
> } finally {
>   spark.catalog.dropTempView("v1")
>   spark.catalog.dropGlobalTempView("v2")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321638#comment-16321638
 ] 

Apache Spark commented on SPARK-23036:
--

User 'xubo245' has created a pull request for this issue:
https://github.com/apache/spark/pull/20228

> Add withGlobalTempView for testing and correct some improper with view 
> related method usage
> ---
>
> Key: SPARK-23036
> URL: https://issues.apache.org/jira/browse/SPARK-23036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>
> Add withGlobalTempView when create global temp view, like withTempView and 
> withView.
> And correct some improper usage like: 
> {code:java}
>  test("list global temp views") {
> try {
>   sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
>   sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")
>   checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
> Row(globalTempDB, "v1", true) ::
> Row("", "v2", true) :: Nil)
>   
> assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == 
> Seq("v1", "v2"))
> } finally {
>   spark.catalog.dropTempView("v1")
>   spark.catalog.dropGlobalTempView("v2")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage

2018-01-10 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated SPARK-23036:

Summary: Add withGlobalTempView for testing and correct some improper with 
view related method usage  (was: Add withGlobalTempView for testing)

> Add withGlobalTempView for testing and correct some improper with view 
> related method usage
> ---
>
> Key: SPARK-23036
> URL: https://issues.apache.org/jira/browse/SPARK-23036
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>
> Add withGlobalTempView when create global temp view, like withTempView and 
> withView.
> And correct some improper usage like: 
> {code:java}
>  test("list global temp views") {
> try {
>   sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
>   sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")
>   checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
> Row(globalTempDB, "v1", true) ::
> Row("", "v2", true) :: Nil)
>   
> assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == 
> Seq("v1", "v2"))
> } finally {
>   spark.catalog.dropTempView("v1")
>   spark.catalog.dropGlobalTempView("v2")
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23036) Add withGlobalTempView for testing

2018-01-10 Thread xubo245 (JIRA)

xubo245 created SPARK-23036:
---

 Summary: Add withGlobalTempView for testing
 Key: SPARK-23036
 URL: https://issues.apache.org/jira/browse/SPARK-23036
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: xubo245


Add withGlobalTempView when create global temp view, like withTempView and 
withView.

And correct some improper usage like: 
{code:java}
 test("list global temp views") {
try {
  sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4")
  sql("CREATE TEMP VIEW v2 AS SELECT 1, 2")

  checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"),
Row(globalTempDB, "v1", true) ::
Row("", "v2", true) :: Nil)

  assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) 
== Seq("v1", "v2"))
} finally {
  spark.catalog.dropTempView("v1")
  spark.catalog.dropGlobalTempView("v2")
}
  }
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23024) Spark ui about the contents of the form need to have hidden and show features, when the table records very much.

2018-01-10 Thread guoxiaolongzte (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guoxiaolongzte updated SPARK-23024:
---
Attachment: 1.png
2.png

> Spark ui about the contents of the form need to have hidden and show 
> features, when the table records very much. 
> -
>
> Key: SPARK-23024
> URL: https://issues.apache.org/jira/browse/SPARK-23024
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: guoxiaolongzte
>Priority: Minor
> Attachments: 1.png, 2.png
>
>
> Spark ui about the contents of the form need to have hidden and show 
> features, when the table records very much. Because sometimes you do not care 
> about the record of the table, you just want to see the contents of the next 
> table, but you have to scroll the scroll bar for a long time to see the 
> contents of the next table.
> Currently we have about 500 workers, but I just wanted to see the logs for 
> the running applications table. I had to scroll through the scroll bars for a 
> long time to see the logs for the running applications table.
> In order to ensure functional consistency, I modified the Master Page, Worker 
> Page, Job Page, Stage Page, Task Page, Configuration Page, Storage Page, Pool 
> Page.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23035:


Assignee: Apache Spark

> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> --
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: Apache Spark
>
> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test:   test("rename temporary view - destination 
> table with database name") 
> {code:java}
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
> {code}
> other test cases also have this warning
> Another problem, it throw TempTableAlreadyExistsException and output 
> "Temporary table '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321553#comment-16321553
 ] 

Apache Spark commented on SPARK-23035:
--

User 'xubo245' has created a pull request for this issue:
https://github.com/apache/spark/pull/20227

> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> --
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>
> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test:   test("rename temporary view - destination 
> table with database name") 
> {code:java}
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
> {code}
> other test cases also have this warning
> Another problem, it throw TempTableAlreadyExistsException and output 
> "Temporary table '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23035:


Assignee: (was: Apache Spark)

> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> --
>
> Key: SPARK-23035
> URL: https://issues.apache.org/jira/browse/SPARK-23035
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>
> Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
> TempViewAlreadyExistsException when create temp view
> There are warning when run test:   test("rename temporary view - destination 
> table with database name") 
> {code:java}
> 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
> TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
> ... USING ... instead
> {code}
> other test cases also have this warning
> Another problem, it throw TempTableAlreadyExistsException and output 
> "Temporary table '$table' already exists" when we create temp view by using 
> org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
> improper.
> {code:java}
>  /**
>* Creates a global temp view, or issue an exception if the view already 
> exists and
>* `overrideIfExists` is false.
>*/
>   def create(
>   name: String,
>   viewDefinition: LogicalPlan,
>   overrideIfExists: Boolean): Unit = synchronized {
> if (!overrideIfExists && viewDefinitions.contains(name)) {
>   throw new TempTableAlreadyExistsException(name)
> }
> viewDefinitions.put(name, viewDefinition)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view

2018-01-10 Thread xubo245 (JIRA)

xubo245 created SPARK-23035:
---

 Summary: Fix warning: TEMPORARY TABLE ... USING ... is deprecated  
and use TempViewAlreadyExistsException when create temp view
 Key: SPARK-23035
 URL: https://issues.apache.org/jira/browse/SPARK-23035
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.1
Reporter: xubo245


Fix warning: TEMPORARY TABLE ... USING ... is deprecated  and use 
TempViewAlreadyExistsException when create temp view

There are warning when run test:   test("rename temporary view - destination 
table with database name") 

{code:java}
02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE 
TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW 
... USING ... instead
{code}

other test cases also have this warning

Another problem, it throw TempTableAlreadyExistsException and output "Temporary 
table '$table' already exists" when we create temp view by using 
org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's 
improper.

{code:java}

 /**
   * Creates a global temp view, or issue an exception if the view already 
exists and
   * `overrideIfExists` is false.
   */
  def create(
  name: String,
  viewDefinition: LogicalPlan,
  overrideIfExists: Boolean): Unit = synchronized {
if (!overrideIfExists && viewDefinitions.contains(name)) {
  throw new TempTableAlreadyExistsException(name)
}
viewDefinitions.put(name, viewDefinition)
  }


{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23034) Display tablename for `HiveTableScan` node in UI

2018-01-10 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321529#comment-16321529
 ] 

Tejas Patil commented on SPARK-23034:
-

[~dongjoon] recommended that the scope of this JIRA could be extended to handle 
scans which are going through codegen. As pointed out in the jira description, 
for Spark native tables, the table scan node is abstracted out as a 
`WholeStageCodegen` node in the DAG. I speculate there would be a class of 
people who might say : A codegen node might be doing more things besides table 
scan so it is not good to label it as `tablescan-tablename` unlike in case of 
`HiveTableScan` where we know it doesn't do anything besides table scan.

We would like to get feedback from the community about this.

> Display tablename for `HiveTableScan` node in UI
> 
>
> Key: SPARK-23034
> URL: https://issues.apache.org/jira/browse/SPARK-23034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.2.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> For queries which scan multiple tables, it will be convenient if the DAG 
> shown in Spark UI also showed which table is being scanned. This will make 
> debugging easier. For this JIRA, I am scoping those for hive table scans 
> only. For Spark native tables, the table scan node is abstracted out as a 
> `WholeStageCodegen` node (which might be doing more things besides table 
> scan).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23027) optimizer a simple query using a non-existent data is too slow

2018-01-10 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321487#comment-16321487
 ] 

Takeshi Yamamuro commented on SPARK-23027:
--

You tried v2.1? 
https://databricks.com/blog/2016/12/15/scalable-partition-handling-for-cloud-native-architecture-in-apache-spark-2-1.html
Either way, you need to first ask questions in the spark-user mailing list.

> optimizer a simple query using a non-existent data is too slow
> --
>
> Key: SPARK-23027
> URL: https://issues.apache.org/jira/browse/SPARK-23027
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.1
>Reporter: wangminfeng
>
> When i use spark sql to do ad-hoc query, i have data partitioned by 
> event_day, event_minute, event_hour, data is large enough，one day data is 
> about 3T， we saved 3 month data.
> But when query use a non-existent day， get optimizedPlan is too slow.
> i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get 
> optimized plan， for five minutes，i can not get it. Query is simple enough, 
> like:
> SELECT
> event_day
> FROM db.table t1
> WHERE (t1.event_day='20170104' and t1.event_hour='23' and 
> t1.event_minute='55')
> LIMIT 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23034) Display tablename for `HiveTableScan` node in UI

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23034:


Assignee: (was: Apache Spark)

> Display tablename for `HiveTableScan` node in UI
> 
>
> Key: SPARK-23034
> URL: https://issues.apache.org/jira/browse/SPARK-23034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.2.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> For queries which scan multiple tables, it will be convenient if the DAG 
> shown in Spark UI also showed which table is being scanned. This will make 
> debugging easier. For this JIRA, I am scoping those for hive table scans 
> only. For Spark native tables, the table scan node is abstracted out as a 
> `WholeStageCodegen` node (which might be doing more things besides table 
> scan).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23034) Display tablename for `HiveTableScan` node in UI

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23034:


Assignee: Apache Spark

> Display tablename for `HiveTableScan` node in UI
> 
>
> Key: SPARK-23034
> URL: https://issues.apache.org/jira/browse/SPARK-23034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.2.1
>Reporter: Tejas Patil
>Assignee: Apache Spark
>Priority: Trivial
>
> For queries which scan multiple tables, it will be convenient if the DAG 
> shown in Spark UI also showed which table is being scanned. This will make 
> debugging easier. For this JIRA, I am scoping those for hive table scans 
> only. For Spark native tables, the table scan node is abstracted out as a 
> `WholeStageCodegen` node (which might be doing more things besides table 
> scan).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23034) Display tablename for `HiveTableScan` node in UI

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321419#comment-16321419
 ] 

Apache Spark commented on SPARK-23034:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/20226

> Display tablename for `HiveTableScan` node in UI
> 
>
> Key: SPARK-23034
> URL: https://issues.apache.org/jira/browse/SPARK-23034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.2.1
>Reporter: Tejas Patil
>Priority: Trivial
>
> For queries which scan multiple tables, it will be convenient if the DAG 
> shown in Spark UI also showed which table is being scanned. This will make 
> debugging easier. For this JIRA, I am scoping those for hive table scans 
> only. For Spark native tables, the table scan node is abstracted out as a 
> `WholeStageCodegen` node (which might be doing more things besides table 
> scan).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23034) Display tablename for `HiveTableScan` node in UI

2018-01-10 Thread Tejas Patil (JIRA)

Tejas Patil created SPARK-23034:
---

 Summary: Display tablename for `HiveTableScan` node in UI
 Key: SPARK-23034
 URL: https://issues.apache.org/jira/browse/SPARK-23034
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 2.2.1
Reporter: Tejas Patil
Priority: Trivial


For queries which scan multiple tables, it will be convenient if the DAG shown 
in Spark UI also showed which table is being scanned. This will make debugging 
easier. For this JIRA, I am scoping those for hive table scans only. For Spark 
native tables, the table scan node is abstracted out as a `WholeStageCodegen` 
node (which might be doing more things besides table scan).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23033) disable task-level retry for continuous execution

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23033:


Assignee: Apache Spark

> disable task-level retry for continuous execution
> -
>
> Key: SPARK-23033
> URL: https://issues.apache.org/jira/browse/SPARK-23033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>
> Continuous execution tasks shouldn't retry independently, but rather use the 
> global restart mechanism. Retrying will execute from the initial offset at 
> the time the task was created, which is very bad.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-10 Thread Kris Mok (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kris Mok updated SPARK-23032:
-
Description: 
Proposing to add a per-query ID to the codegen stages as represented by 
{{WholeStageCodegenExec}} operators. This ID will be used in
* the explain output of the physical plan, and in
* the generated class name.

Specifically, this ID will be stable within a query, counting up from 1 in 
depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
plan.
The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
objects, which may have been created for one-off purposes, e.g. for fallback 
handling of codegen stages that failed to codegen the whole stage and wishes to 
codegen a subset of the children operators.

Example: for the following query:
{code:none}
scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)

scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
'y).orderBy('x).select('x + 1 as 'z, 'y)
df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]

scala> val df2 = spark.range(5)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val query = df1.join(df2, 'z === 'id)
query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field]
{code}

The explain output before the change is:
{code:none}
scala> query.explain
== Physical Plan ==
*SortMergeJoin [z#9L], [id#13L], Inner
:- *Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *Project [(x#3L + 1) AS z#9L, y#4L]
:+- *Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *Range (0, 10, step=1, splits=8)
+- *Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *Range (0, 5, step=1, splits=8)
{code}
Note how codegen'd operators are annotated with a prefix {{"*"}}.

and after this change it'll be:
{code:none}
scala> query.explain
== Physical Plan ==
*(6) SortMergeJoin [z#9L], [id#13L], Inner
:- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
:+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *(1) Range (0, 10, step=1, splits=8)
+- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *(4) Range (0, 5, step=1, splits=8)
{code}
Note that the annotated prefix becomes {{"*(id) "}}

It'll also show up in the name of the generated class, as a suffix in the 
format of
{code:none}
GeneratedClass$GeneratedIterator$id
{code}

for example, note how {{GeneratedClass$GeneratedIteratorForCodegenStage3}} and 
{{GeneratedClass$GeneratedIteratorForCodegenStage6}} in the following stack 
trace corresponds to the IDs shown in the explain output above:
{code:none}
"Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA 
runnable
  java.lang.Thread.State: RUNNABLE
  at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
  at

[jira] [Assigned] (SPARK-23033) disable task-level retry for continuous execution

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23033:


Assignee: (was: Apache Spark)

> disable task-level retry for continuous execution
> -
>
> Key: SPARK-23033
> URL: https://issues.apache.org/jira/browse/SPARK-23033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Continuous execution tasks shouldn't retry independently, but rather use the 
> global restart mechanism. Retrying will execute from the initial offset at 
> the time the task was created, which is very bad.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23033) disable task-level retry for continuous execution

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321404#comment-16321404
 ] 

Apache Spark commented on SPARK-23033:
--

User 'jose-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/20225

> disable task-level retry for continuous execution
> -
>
> Key: SPARK-23033
> URL: https://issues.apache.org/jira/browse/SPARK-23033
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Jose Torres
>
> Continuous execution tasks shouldn't retry independently, but rather use the 
> global restart mechanism. Retrying will execute from the initial offset at 
> the time the task was created, which is very bad.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23033) disable task-level retry for continuous execution

2018-01-10 Thread Jose Torres (JIRA)

Jose Torres created SPARK-23033:
---

 Summary: disable task-level retry for continuous execution
 Key: SPARK-23033
 URL: https://issues.apache.org/jira/browse/SPARK-23033
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Jose Torres


Continuous execution tasks shouldn't retry independently, but rather use the 
global restart mechanism. Retrying will execute from the initial offset at the 
time the task was created, which is very bad.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321398#comment-16321398
 ] 

Apache Spark commented on SPARK-23032:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/20224

> Add a per-query codegenStageId to WholeStageCodegenExec
> ---
>
> Key: SPARK-23032
> URL: https://issues.apache.org/jira/browse/SPARK-23032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>
> Proposing to add a per-query ID to the codegen stages as represented by 
> {{WholeStageCodegenExec}} operators. This ID will be used in
> * the explain output of the physical plan, and in
> * the generated class name.
> Specifically, this ID will be stable within a query, counting up from 1 in 
> depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
> plan.
> The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
> objects, which may have been created for one-off purposes, e.g. for fallback 
> handling of codegen stages that failed to codegen the whole stage and wishes 
> to codegen a subset of the children operators.
> Example: for the following query:
> {code:none}
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
> scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
> 'y).orderBy('x).select('x + 1 as 'z, 'y)
> df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]
> scala> val df2 = spark.range(5)
> df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val query = df1.join(df2, 'z === 'id)
> query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more 
> field]
> {code}
> The explain output before the change is:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *SortMergeJoin [z#9L], [id#13L], Inner
> :- *Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *Range (0, 10, step=1, splits=8)
> +- *Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *Range (0, 5, step=1, splits=8)
> {code}
> Note how codegen'd operators are annotated with a prefix {{"*"}}.
> and after this change it'll be:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *(6) SortMergeJoin [z#9L], [id#13L], Inner
> :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *(1) Range (0, 10, step=1, splits=8)
> +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *(4) Range (0, 5, step=1, splits=8)
> {code}
> Note that the annotated prefix becomes {{"*(id) "}}
> It'll also show up in the name of the generated class, as a suffix in the 
> format of
> {code:none}
> GeneratedClass$GeneratedIterator$id
> {code}
> for example, note how {{GeneratedClass$GeneratedIterator$3}} and 
> {{GeneratedClass$GeneratedIterator$6}} in the following stack trace 
> corresponds to the IDs shown in the explain output above:
> {code:none}
> "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 
> nid=NA runnable
>   java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
>

[jira] [Assigned] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23032:


Assignee: (was: Apache Spark)

> Add a per-query codegenStageId to WholeStageCodegenExec
> ---
>
> Key: SPARK-23032
> URL: https://issues.apache.org/jira/browse/SPARK-23032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>
> Proposing to add a per-query ID to the codegen stages as represented by 
> {{WholeStageCodegenExec}} operators. This ID will be used in
> * the explain output of the physical plan, and in
> * the generated class name.
> Specifically, this ID will be stable within a query, counting up from 1 in 
> depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
> plan.
> The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
> objects, which may have been created for one-off purposes, e.g. for fallback 
> handling of codegen stages that failed to codegen the whole stage and wishes 
> to codegen a subset of the children operators.
> Example: for the following query:
> {code:none}
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
> scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
> 'y).orderBy('x).select('x + 1 as 'z, 'y)
> df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]
> scala> val df2 = spark.range(5)
> df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val query = df1.join(df2, 'z === 'id)
> query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more 
> field]
> {code}
> The explain output before the change is:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *SortMergeJoin [z#9L], [id#13L], Inner
> :- *Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *Range (0, 10, step=1, splits=8)
> +- *Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *Range (0, 5, step=1, splits=8)
> {code}
> Note how codegen'd operators are annotated with a prefix {{"*"}}.
> and after this change it'll be:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *(6) SortMergeJoin [z#9L], [id#13L], Inner
> :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *(1) Range (0, 10, step=1, splits=8)
> +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *(4) Range (0, 5, step=1, splits=8)
> {code}
> Note that the annotated prefix becomes {{"*(id) "}}
> It'll also show up in the name of the generated class, as a suffix in the 
> format of
> {code:none}
> GeneratedClass$GeneratedIterator$id
> {code}
> for example, note how {{GeneratedClass$GeneratedIterator$3}} and 
> {{GeneratedClass$GeneratedIterator$6}} in the following stack trace 
> corresponds to the IDs shown in the explain output above:
> {code:none}
> "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 
> nid=NA runnable
>   java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
> at 
>

[jira] [Assigned] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23032:


Assignee: Apache Spark

> Add a per-query codegenStageId to WholeStageCodegenExec
> ---
>
> Key: SPARK-23032
> URL: https://issues.apache.org/jira/browse/SPARK-23032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>
> Proposing to add a per-query ID to the codegen stages as represented by 
> {{WholeStageCodegenExec}} operators. This ID will be used in
> * the explain output of the physical plan, and in
> * the generated class name.
> Specifically, this ID will be stable within a query, counting up from 1 in 
> depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
> plan.
> The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
> objects, which may have been created for one-off purposes, e.g. for fallback 
> handling of codegen stages that failed to codegen the whole stage and wishes 
> to codegen a subset of the children operators.
> Example: for the following query:
> {code:none}
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
> scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
> 'y).orderBy('x).select('x + 1 as 'z, 'y)
> df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]
> scala> val df2 = spark.range(5)
> df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val query = df1.join(df2, 'z === 'id)
> query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more 
> field]
> {code}
> The explain output before the change is:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *SortMergeJoin [z#9L], [id#13L], Inner
> :- *Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *Range (0, 10, step=1, splits=8)
> +- *Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *Range (0, 5, step=1, splits=8)
> {code}
> Note how codegen'd operators are annotated with a prefix {{"*"}}.
> and after this change it'll be:
> {code:none}
> scala> query.explain
> == Physical Plan ==
> *(6) SortMergeJoin [z#9L], [id#13L], Inner
> :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
> :  +- Exchange hashpartitioning(z#9L, 200)
> : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
> :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
> :   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
> :  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
> : +- *(1) Range (0, 10, step=1, splits=8)
> +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
>+- Exchange hashpartitioning(id#13L, 200)
>   +- *(4) Range (0, 5, step=1, splits=8)
> {code}
> Note that the annotated prefix becomes {{"*(id) "}}
> It'll also show up in the name of the generated class, as a suffix in the 
> format of
> {code:none}
> GeneratedClass$GeneratedIterator$id
> {code}
> for example, note how {{GeneratedClass$GeneratedIterator$3}} and 
> {{GeneratedClass$GeneratedIterator$6}} in the following stack trace 
> corresponds to the IDs shown in the explain output above:
> {code:none}
> "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 
> nid=NA runnable
>   java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
> at 
>

[jira] [Created] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec

2018-01-10 Thread Kris Mok (JIRA)

Kris Mok created SPARK-23032:


 Summary: Add a per-query codegenStageId to WholeStageCodegenExec
 Key: SPARK-23032
 URL: https://issues.apache.org/jira/browse/SPARK-23032
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Kris Mok


Proposing to add a per-query ID to the codegen stages as represented by 
{{WholeStageCodegenExec}} operators. This ID will be used in
* the explain output of the physical plan, and in
* the generated class name.

Specifically, this ID will be stable within a query, counting up from 1 in 
depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a 
plan.
The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} 
objects, which may have been created for one-off purposes, e.g. for fallback 
handling of codegen stages that failed to codegen the whole stage and wishes to 
codegen a subset of the children operators.

Example: for the following query:
{code:none}
scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)

scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 
'y).orderBy('x).select('x + 1 as 'z, 'y)
df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint]

scala> val df2 = spark.range(5)
df2: org.apache.spark.sql.Dataset[Long] = [id: bigint]

scala> val query = df1.join(df2, 'z === 'id)
query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field]
{code}

The explain output before the change is:
{code:none}
scala> query.explain
== Physical Plan ==
*SortMergeJoin [z#9L], [id#13L], Inner
:- *Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *Project [(x#3L + 1) AS z#9L, y#4L]
:+- *Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *Range (0, 10, step=1, splits=8)
+- *Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *Range (0, 5, step=1, splits=8)
{code}
Note how codegen'd operators are annotated with a prefix {{"*"}}.

and after this change it'll be:
{code:none}
scala> query.explain
== Physical Plan ==
*(6) SortMergeJoin [z#9L], [id#13L], Inner
:- *(3) Sort [z#9L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(z#9L, 200)
: +- *(2) Project [(x#3L + 1) AS z#9L, y#4L]
:+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0
:   +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200)
:  +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L]
: +- *(1) Range (0, 10, step=1, splits=8)
+- *(5) Sort [id#13L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#13L, 200)
  +- *(4) Range (0, 5, step=1, splits=8)
{code}
Note that the annotated prefix becomes {{"*(id) "}}

It'll also show up in the name of the generated class, as a suffix in the 
format of
{code:none}
GeneratedClass$GeneratedIterator$id
{code}

for example, note how {{GeneratedClass$GeneratedIterator$3}} and 
{{GeneratedClass$GeneratedIterator$6}} in the following stack trace corresponds 
to the IDs shown in the explain output above:
{code:none}
"Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA 
runnable
  java.lang.Thread.State: RUNNABLE
  at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
  at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828)
  at

[jira] [Resolved] (SPARK-22989) sparkstreaming ui show 0 records when spark-streaming-kafka application restore from checkpoint

2018-01-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-22989.
--
Resolution: Duplicate

> sparkstreaming ui show 0 records when spark-streaming-kafka application 
> restore from checkpoint 
> 
>
> Key: SPARK-22989
> URL: https://issues.apache.org/jira/browse/SPARK-22989
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: zhaoshijie
>Priority: Minor
>
> when a spark-streaming-kafka application restore from checkpoint , I find 
> spark-streaming ui  Each batch records is 0.
> !https://raw.githubusercontent.com/smdfj/picture/master/spark/batch.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22991) High read latency with spark streaming 2.2.1 and kafka 0.10.0.1

2018-01-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22991:
-
Component/s: (was: Structured Streaming)
 (was: Spark Core)
 DStreams

> High read latency with spark streaming 2.2.1 and kafka 0.10.0.1
> ---
>
> Key: SPARK-22991
> URL: https://issues.apache.org/jira/browse/SPARK-22991
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.2.1
>Reporter: Kiran Shivappa Japannavar
>Priority: Critical
>
> Spark 2.2.1 + Kafka 0.10 + Spark streaming.
> Batch duration is 1s, Max rate per partition is 500, poll interval is 120 
> seconds, max poll records is 500 and no of partitions in Kafka is 500, 
> enabled cache consumer.
> While trying to read data from Kafka we are observing very high read 
> latencies intermittently.The high latencies results in Kafka consumer session 
> expiration and hence the Kafka brokers removes the consumer from the group. 
> The consumer keeps retrying and finally fails with the
> [org.apache.kafka.clients.NetworkClient] - Disconnecting from node 12 due to 
> request timeout
> [org.apache.kafka.clients.NetworkClient] - Cancelled request ClientRequest
> [org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient] - 
> Cancelled FETCH request ClientRequest.**
> Due to this a lot of batches are in the queued state.
> The high read latencies are occurring whenever multiple clients are 
> parallelly trying to read the data from the same Kafka cluster. The Kafka 
> cluster is having a large number of brokers and can support high network 
> bandwidth.
> When running with spark 1.5 and Kafka 0.8 consumer client against the same 
> Kafka cluster we are not seeing any read latencies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22975) MetricsReporter producing NullPointerException when there was no progress reported

2018-01-10 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-22975:
-
Component/s: (was: SQL)
 Structured Streaming

> MetricsReporter producing NullPointerException when there was no progress 
> reported
> --
>
> Key: SPARK-22975
> URL: https://issues.apache.org/jira/browse/SPARK-22975
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Yuriy Bondaruk
>
> The exception occurs in MetricsReporter when it tries to register gauges 
> using lastProgress of each stream.
> {code:java}
>   registerGauge("inputRate-total", () => 
> stream.lastProgress.inputRowsPerSecond)
>   registerGauge("processingRate-total", () => 
> stream.lastProgress.inputRowsPerSecond)
>   registerGauge("latency", () => 
> stream.lastProgress.durationMs.get("triggerExecution").longValue())
> {code}
> In case if a stream doesn't have any progress reported than following 
> exception occurs:
> {noformat}
> 18/01/05 17:45:57 ERROR ScheduledReporter: RuntimeException thrown from 
> CloudwatchReporter#report. Exception was suppressed.
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.execution.streaming.MetricsReporter$$anonfun$1.apply$mcD$sp(MetricsReporter.scala:42)
>   at 
> org.apache.spark.sql.execution.streaming.MetricsReporter$$anonfun$1.apply(MetricsReporter.scala:42)
>   at 
> org.apache.spark.sql.execution.streaming.MetricsReporter$$anonfun$1.apply(MetricsReporter.scala:42)
>   at 
> org.apache.spark.sql.execution.streaming.MetricsReporter$$anon$1.getValue(MetricsReporter.scala:49)
>   at 
> amazon.nexus.spark.metrics.cloudwatch.CloudwatchReporter.lambda$createNumericGaugeMetricDatumStream$0(CloudwatchReporter.java:146)
>   at 
> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174)
>   at 
> java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet.lambda$entryConsumer$0(Collections.java:1575)
>   at 
> java.util.TreeMap$EntrySpliterator.forEachRemaining(TreeMap.java:2969)
>   at 
> java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet$UnmodifiableEntrySetSpliterator.forEachRemaining(Collections.java:1600)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>   at 
> java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:312)
>   at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
>   at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
>   at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
>   at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
>   at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
>   at 
> java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:510)
>   at 
> amazon.nexus.spark.metrics.cloudwatch.CloudwatchReporter.partitionIntoSublists(CloudwatchReporter.java:390)
>   at 
> amazon.nexus.spark.metrics.cloudwatch.CloudwatchReporter.report(CloudwatchReporter.java:137)
>   at 
> com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162)
>   at 
> com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Resolved] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-22951.

   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 20174
[https://github.com/apache/spark/pull/20174]

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>  Labels: correctness
> Fix For: 2.3.0
>
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2018-01-10 Thread Fernando Pereira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321234#comment-16321234
 ] 

Fernando Pereira edited comment on SPARK-17998 at 1/10/18 10:20 PM:


It says spark.sql.files.maxPartitionBytes in this very Jira ticket. Try 
yourself. spark.files.maxPartitionBytes doesnt work, at least not for 2.2.1


was (Author: ferdonline):
It says spark.sql.files.maxPartitionBytes in this very Jira ticket. Try 
yourself. spark.files.maxPartitionBytes doesnt work

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2018-01-10 Thread Fernando Pereira (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321234#comment-16321234
 ] 

Fernando Pereira commented on SPARK-17998:
--

It says spark.sql.files.maxPartitionBytes in this very Jira ticket. Try 
yourself. spark.files.maxPartitionBytes doesnt work

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23031) Merge script should allow arbitrary assignees

2018-01-10 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-23031:
--

 Summary: Merge script should allow arbitrary assignees
 Key: SPARK-23031
 URL: https://issues.apache.org/jira/browse/SPARK-23031
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Minor


SPARK-22921 added a prompt in the merge script to ask for the assignee of the 
bug being resolved. But it only provides you with a list containing the 
reporter and commenters (plus the "Apache Spark" bot).

Sometimes the author of the PR is not on that list, and right now there doesn't 
seem to be a way to enter an arbitrary JIRA user name there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23020:


Assignee: Apache Spark

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16321087#comment-16321087
 ] 

Apache Spark commented on SPARK-23020:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20223

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23020:


Assignee: (was: Apache Spark)

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-01-10 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320942#comment-16320942
 ] 

Bryan Cutler commented on SPARK-23030:
--

I'm looking into this, will submit a WIP PR if I see an improvement

> Decrease memory consumption with toPandas() collection using Arrow
> --
>
> Key: SPARK-23030
> URL: https://issues.apache.org/jira/browse/SPARK-23030
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> Currently with Arrow enabled, calling {{toPandas()}} results in a collection 
> of all partitions in the JVM in the form of batches of Arrow file format.  
> Once collected in the JVM, they are served to the Python driver process. 
> I believe using the Arrow stream format can help to optimize this and reduce 
> memory consumption in the JVM by only loading one record batch at a time 
> before sending it to Python.  This might also reduce the latency between 
> making the initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow

2018-01-10 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-23030:


 Summary: Decrease memory consumption with toPandas() collection 
using Arrow
 Key: SPARK-23030
 URL: https://issues.apache.org/jira/browse/SPARK-23030
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 2.3.0
Reporter: Bryan Cutler


Currently with Arrow enabled, calling {{toPandas()}} results in a collection of 
all partitions in the JVM in the form of batches of Arrow file format.  Once 
collected in the JVM, they are served to the Python driver process. 

I believe using the Arrow stream format can help to optimize this and reduce 
memory consumption in the JVM by only loading one record batch at a time before 
sending it to Python.  This might also reduce the latency between making the 
initial call in Python and receiving the first batch of records.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters

2018-01-10 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-21187:
-
Description: 
This is to track adding the remaining type support in Arrow Converters.  
Currently, only primitive data types are supported.  '

Remaining types:

* -*Date*-
* -*Timestamp*-
* *Complex*: Struct, -Array-, Map
* -*Decimal*-


Some things to do before closing this out:

* -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)-
* Need to add some user docs
* -Make sure Python tests are thorough-
* Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?

  was:
This is to track adding the remaining type support in Arrow Converters.  
Currently, only primitive data types are supported.  '

Remaining types:

* -*Date*-
* -*Timestamp*-
* *Complex*: Struct, Array, Map
* *Decimal*


Some things to do before closing this out:

* Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
values as BigDecimal)
* Need to add some user docs
* Make sure Python tests are thorough
* Check into complex type support mentioned in comments by [~leif], should we 
support mulit-indexing?


> Complete support for remaining Spark data types in Arrow Converters
> ---
>
> Key: SPARK-21187
> URL: https://issues.apache.org/jira/browse/SPARK-21187
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 2.3.0
>Reporter: Bryan Cutler
>
> This is to track adding the remaining type support in Arrow Converters.  
> Currently, only primitive data types are supported.  '
> Remaining types:
> * -*Date*-
> * -*Timestamp*-
> * *Complex*: Struct, -Array-, Map
> * -*Decimal*-
> Some things to do before closing this out:
> * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write 
> values as BigDecimal)-
> * Need to add some user docs
> * -Make sure Python tests are thorough-
> * Check into complex type support mentioned in comments by [~leif], should we 
> support mulit-indexing?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16060) Vectorized Orc reader

2018-01-10 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-16060:
--
Affects Version/s: 1.6.3
   2.0.2
   2.1.2
   2.2.1

> Vectorized Orc reader
> -
>
> Key: SPARK-16060
> URL: https://issues.apache.org/jira/browse/SPARK-16060
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1
>Reporter: Liang-Chi Hsieh
>Assignee: Dongjoon Hyun
>  Labels: release-notes
> Fix For: 2.3.0
>
>
> Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive 
> Orc already support vectorization, we should add this support to improve Orc 
> reading performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-22951:
--

Assignee: Feng Liu

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>Assignee: Feng Liu
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23029) Setting spark.shuffle.file.buffer will make the shuffle fail

2018-01-10 Thread Fernando Pereira (JIRA)

Fernando Pereira created SPARK-23029:


 Summary: Setting spark.shuffle.file.buffer will make the shuffle 
fail
 Key: SPARK-23029
 URL: https://issues.apache.org/jira/browse/SPARK-23029
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Fernando Pereira


When setting the spark.shuffle.file.buffer setting, even to its default value, 
shuffles fail.
This appears to affect small to medium size partitions. Strangely the error 
message is OutOfMemoryError, but it works with large partitions (at least 
>32MB).

{code}
pyspark --conf "spark.shuffle.file.buffer=$((32*1024))"
/gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit
 pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768
version 2.2.1

>>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", 
>>> mode="overwrite")

[Stage 1:>(0 + 10) / 
10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 11)
java.lang.OutOfMemoryError: Java heap space
at java.io.BufferedOutputStream.(BufferedOutputStream.java:75)
at 
org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107)
at 
org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108)
at 
org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at 
org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-22951:
---
Target Version/s: 2.3.0

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value

2018-01-10 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-22951:
---
Labels: correctness  (was: )

> count() after dropDuplicates() on emptyDataFrame returns incorrect value
> 
>
> Key: SPARK-22951
> URL: https://issues.apache.org/jira/browse/SPARK-22951
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.2.0, 2.3.0
>Reporter: Michael Dreibelbis
>  Labels: correctness
>
> here is a minimal Spark Application to reproduce:
> {code}
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkConf, SparkContext}
> object DropDupesApp extends App {
>   
>   override def main(args: Array[String]): Unit = {
> val conf = new SparkConf()
>   .setAppName("test")
>   .setMaster("local")
> val sc = new SparkContext(conf)
> val sql = SQLContext.getOrCreate(sc)
> assert(sql.emptyDataFrame.count == 0) // expected
> assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected
>   }
>   
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher

2018-01-10 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320753#comment-16320753
 ] 

Marcelo Vanzin commented on SPARK-23020:


I think I found the race in the code, now need to figure out how to fix it... 
:-/

> Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
> --
>
> Key: SPARK-23020
> URL: https://issues.apache.org/jira/browse/SPARK-23020
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD

2018-01-10 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23019.

   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.3.0

> Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
> -
>
> Key: SPARK-23019
> URL: https://issues.apache.org/jira/browse/SPARK-23019
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Gengliang Wang
>Priority: Blocker
> Fix For: 2.3.0
>
>
> {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to 
> multiple spark contexts: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/
> {code}
> Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore 
> this error, set spark.driver.allowMultipleContexts = true. The currently 
> running SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:116)
> org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63)
> java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320667#comment-16320667
 ] 

Sean Owen commented on SPARK-22982:
---

Agreed. java.nio should be OK as that was introduced in Java 7. Spark 2.2 
requires Java 8, so that much is safe.

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-10 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320664#comment-16320664
 ] 

Josh Rosen commented on SPARK-22982:


In theory this affects all 1.6.0+ versions.

It's going to be much harder to trigger in 1.6.0 because we shouldn't have as 
many Janino-induced remote ClassNotFoundExceptions to trigger the error path.

We should probably port to other 2.x branches, with one caveat: we need to make 
sure that my fix isn't relying on JRE 8.x functionality because I think at 
least some of those older branches still have support for Java 7.

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22972) Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc.

2018-01-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-22972:

Fix Version/s: 2.2.2

> Couldn't find corresponding Hive SerDe for data source provider 
> org.apache.spark.sql.hive.orc.
> --
>
> Key: SPARK-22972
> URL: https://issues.apache.org/jira/browse/SPARK-22972
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: xubo245
>Assignee: xubo245
> Fix For: 2.2.2, 2.3.0
>
>
> *There is error when running test code:*
> {code:java}
>   test("create orc table") {
> spark.sql(
>   s"""CREATE TABLE normal_orc_as_source_hive
>  |USING org.apache.spark.sql.hive.orc
>  |OPTIONS (
>  |  PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}'
>  |)
>""".stripMargin)
> val df = spark.sql("select * from normal_orc_as_source_hive")
> spark.sql("desc formatted normal_orc_as_source_hive").show()
>   }
> {code}
> *warning:*
> {code:java}
> 05:00:44.038 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: 
> Couldn't find corresponding Hive SerDe for data source provider 
> org.apache.spark.sql.hive.orc. Persisting data source table 
> `default`.`normal_orc_as_source_hive` into Hive metastore in Spark SQL 
> specific format, which is NOT compatible with Hive.
> {code}
> Root cause analysis:   
> ORC related code is incorrect in HiveSerDe :
> {code:java}
> org.apache.spark.sql.internal.HiveSerDe#sourceToSerDe
> {code}
> {code:java}
>   def sourceToSerDe(source: String): Option[HiveSerDe] = {
> val key = source.toLowerCase(Locale.ROOT) match {
>   case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet"
>   case s if s.startsWith("org.apache.spark.sql.orc") => "orc"
>   case s if s.equals("orcfile") => "orc"
>   case s if s.equals("parquetfile") => "parquet"
>   case s if s.equals("avrofile") => "avro"
>   case s => s
> }
> {code}
> Solution:
> change "org.apache.spark.sql.orc“  to  "org.apache.spark.sql.hive.orc"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2018-01-10 Thread Neil Alexander McQuarrie (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320439#comment-16320439
 ] 

Neil Alexander McQuarrie commented on SPARK-21727:
--

Okay great. Implemented and tests are passing. I'll submit a PR shortly. Thanks 
for the help.

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil Alexander McQuarrie
>Assignee: Neil Alexander McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-22946) Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-10 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-22946:

Comment: was deleted

(was: I am unable to reproduce on master. If I remember correctly, this should 
have been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not 
doing it myself since I don't remember the ticket it duplicates...)

> Recursive withColumn calls cause 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22946
> URL: https://issues.apache.org/jira/browse/SPARK-22946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Harpreet Chopra
>
> Recursive calls to withColumn, for updating the same column causes 
> _org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> _
> This can be reproduced in Spark 1.x and also in the latest version we are 
> using (Spark 2.2.0) 
> code to reproduce: 
> import org.apache.spark.sql.functions._
> var df 
> =sc.parallelize(Seq(("123","CustOne"),("456","CustTwo"))).toDF("ID","CustName")
> (1 to 20).foreach(x => {
> df = df.withColumn("ID", when(col("ID") === "123", 
> lit("678")).otherwise(col("ID")))
> println(" "+x)
> df.show
> })
> Stack dump:
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$cata
> lyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
> r.scala:575)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
> r.scala:572)
> at 
> org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java
> :3599)
> at 
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> ... 28 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/cataly
> st/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
> " of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows
>  beyond 64 KB
> at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
> at 
> org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
> at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9912)
> at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9897)
> at 
> org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3317)
> at org.codehaus.janino.UnitCompiler.access$8500(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$10.visitLocalVariableAccess(UnitCompiler.java:3285)
> at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:3189)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
> at 
> org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3313)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$10.visitAmbiguousName(UnitCompiler.java:3280)
> at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
> at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669)
> at org.codehaus.janino.UnitCompiler.access$4500(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$7.visitAssignment(UnitCompiler.java:2619)
> at org.codehaus.janino.Java$Assignment.accept(Java.java:3405)
> at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
> at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936)
> at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097)
> at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
> at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1007)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:993)
> at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:185)
> at 
>

[jira] [Comment Edited] (SPARK-22946) Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-10 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320402#comment-16320402
 ] 

Marco Gaido edited comment on SPARK-22946 at 1/10/18 3:13 PM:
--

I am unable to reproduce on master. If I remember correctly, this should have 
been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not doing 
it myself since I don't remember the ticket it duplicates...


was (Author: mgaido):
I am unable to reproduce on master. If I remember correctly, this should have 
been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not doing 
it myself since I don't remember the ticket it duplicates...

> Recursive withColumn calls cause 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22946
> URL: https://issues.apache.org/jira/browse/SPARK-22946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Harpreet Chopra
>
> Recursive calls to withColumn, for updating the same column causes 
> _org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> _
> This can be reproduced in Spark 1.x and also in the latest version we are 
> using (Spark 2.2.0) 
> code to reproduce: 
> import org.apache.spark.sql.functions._
> var df 
> =sc.parallelize(Seq(("123","CustOne"),("456","CustTwo"))).toDF("ID","CustName")
> (1 to 20).foreach(x => {
> df = df.withColumn("ID", when(col("ID") === "123", 
> lit("678")).otherwise(col("ID")))
> println(" "+x)
> df.show
> })
> Stack dump:
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$cata
> lyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
> r.scala:575)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
> r.scala:572)
> at 
> org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java
> :3599)
> at 
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> ... 28 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/cataly
> st/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
> " of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows
>  beyond 64 KB
> at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
> at 
> org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
> at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9912)
> at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9897)
> at 
> org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3317)
> at org.codehaus.janino.UnitCompiler.access$8500(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$10.visitLocalVariableAccess(UnitCompiler.java:3285)
> at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:3189)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
> at 
> org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3313)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$10.visitAmbiguousName(UnitCompiler.java:3280)
> at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
> at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669)
> at org.codehaus.janino.UnitCompiler.access$4500(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$7.visitAssignment(UnitCompiler.java:2619)
> at org.codehaus.janino.Java$Assignment.accept(Java.java:3405)
> at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
> at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936)
> at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097)
> at

[jira] [Commented] (SPARK-22946) Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB

2018-01-10 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320402#comment-16320402
 ] 

Marco Gaido commented on SPARK-22946:
-

I am unable to reproduce on master. If I remember correctly, this should have 
been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not doing 
it myself since I don't remember the ticket it duplicates...

> Recursive withColumn calls cause 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> --
>
> Key: SPARK-22946
> URL: https://issues.apache.org/jira/browse/SPARK-22946
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Harpreet Chopra
>
> Recursive calls to withColumn, for updating the same column causes 
> _org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows beyond 64 KB
> _
> This can be reproduced in Spark 1.x and also in the latest version we are 
> using (Spark 2.2.0) 
> code to reproduce: 
> import org.apache.spark.sql.functions._
> var df 
> =sc.parallelize(Seq(("123","CustOne"),("456","CustTwo"))).toDF("ID","CustName")
> (1 to 20).foreach(x => {
> df = df.withColumn("ID", when(col("ID") === "123", 
> lit("678")).otherwise(col("ID")))
> println(" "+x)
> df.show
> })
> Stack dump:
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$cata
> lyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
> r.scala:575)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato
> r.scala:572)
> at 
> org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java
> :3599)
> at 
> org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
> ... 28 more
> Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/cataly
> st/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V
> " of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection"
>  grows
>  beyond 64 KB
> at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941)
> at org.codehaus.janino.CodeContext.write(CodeContext.java:854)
> at 
> org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242)
> at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9912)
> at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9897)
> at 
> org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3317)
> at org.codehaus.janino.UnitCompiler.access$8500(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$10.visitLocalVariableAccess(UnitCompiler.java:3285)
> at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:3189)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
> at 
> org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3313)
> at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$10.visitAmbiguousName(UnitCompiler.java:3280)
> at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138)
> at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290)
> at 
> org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669)
> at org.codehaus.janino.UnitCompiler.access$4500(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$7.visitAssignment(UnitCompiler.java:2619)
> at org.codehaus.janino.Java$Assignment.accept(Java.java:3405)
> at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643)
> at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185)
> at 
> org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936)
> at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097)
> at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958)
> at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1007)
> at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:993)
> at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:185)
> at

[jira] [Commented] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320393#comment-16320393
 ] 

Apache Spark commented on SPARK-23028:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20222

> Bump master branch version to 2.4.0-SNAPSHOT
> 
>
> Key: SPARK-23028
> URL: https://issues.apache.org/jira/browse/SPARK-23028
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23028:


Assignee: Apache Spark  (was: Xiao Li)

> Bump master branch version to 2.4.0-SNAPSHOT
> 
>
> Key: SPARK-23028
> URL: https://issues.apache.org/jira/browse/SPARK-23028
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23028:


Assignee: Xiao Li  (was: Apache Spark)

> Bump master branch version to 2.4.0-SNAPSHOT
> 
>
> Key: SPARK-23028
> URL: https://issues.apache.org/jira/browse/SPARK-23028
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT

2018-01-10 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23028:

Component/s: (was: SQL)
 Build

> Bump master branch version to 2.4.0-SNAPSHOT
> 
>
> Key: SPARK-23028
> URL: https://issues.apache.org/jira/browse/SPARK-23028
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT

2018-01-10 Thread Xiao Li (JIRA)

Xiao Li created SPARK-23028:
---

 Summary: Bump master branch version to 2.4.0-SNAPSHOT
 Key: SPARK-23028
 URL: https://issues.apache.org/jira/browse/SPARK-23028
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Xiao Li
Assignee: Xiao Li






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22991) High read latency with spark streaming 2.2.1 and kafka 0.10.0.1

2018-01-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320365#comment-16320365
 ] 

Sean Owen commented on SPARK-22991:
---

But that would also mean you were using Kafka 0.8 and not 0.10, and a lot 
changed in Kafka there. So far this still says it's a network or Kafka issue. 
Or, what change are you proposing or where do you think the issue is in Spark?

> High read latency with spark streaming 2.2.1 and kafka 0.10.0.1
> ---
>
> Key: SPARK-22991
> URL: https://issues.apache.org/jira/browse/SPARK-22991
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Structured Streaming
>Affects Versions: 2.2.1
>Reporter: Kiran Shivappa Japannavar
>Priority: Critical
>
> Spark 2.2.1 + Kafka 0.10 + Spark streaming.
> Batch duration is 1s, Max rate per partition is 500, poll interval is 120 
> seconds, max poll records is 500 and no of partitions in Kafka is 500, 
> enabled cache consumer.
> While trying to read data from Kafka we are observing very high read 
> latencies intermittently.The high latencies results in Kafka consumer session 
> expiration and hence the Kafka brokers removes the consumer from the group. 
> The consumer keeps retrying and finally fails with the
> [org.apache.kafka.clients.NetworkClient] - Disconnecting from node 12 due to 
> request timeout
> [org.apache.kafka.clients.NetworkClient] - Cancelled request ClientRequest
> [org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient] - 
> Cancelled FETCH request ClientRequest.**
> Due to this a lot of batches are in the queued state.
> The high read latencies are occurring whenever multiple clients are 
> parallelly trying to read the data from the same Kafka cluster. The Kafka 
> cluster is having a large number of brokers and can support high network 
> bandwidth.
> When running with spark 1.5 and Kafka 0.8 consumer client against the same 
> Kafka cluster we are not seeing any read latencies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23026) Add RegisterUDF to PySpark

2018-01-10 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-23026:
--
Issue Type: Improvement  (was: Bug)

> Add RegisterUDF to PySpark
> --
>
> Key: SPARK-23026
> URL: https://issues.apache.org/jira/browse/SPARK-23026
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add a new API for registering row-at-a-time or scalar vectorized UDFs. The 
> registered UDFs can be used in the SQL statement.
> {noformat}
> >>> from pyspark.sql.types import IntegerType
> >>> from pyspark.sql.functions import udf
> >>> slen = udf(lambda s: len(s), IntegerType())
> >>> _ = spark.udf.registerUDF("slen", slen)
> >>> spark.sql("SELECT slen('test')").collect()
> [Row(slen(test)=4)]
> >>> import random
> >>> from pyspark.sql.functions import udf
> >>> from pyspark.sql.types import IntegerType
> >>> random_udf = udf(lambda: random.randint(0, 100), 
> >>> IntegerType()).asNondeterministic()
> >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf)
> >>> spark.sql("SELECT random_udf()").collect()  
> [Row(random_udf()=82)]
> >>> spark.range(1).select(newRandom_udf()).collect()  
> [Row(random_udf()=62)]
> >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
> >>> @pandas_udf("integer", PandasUDFType.SCALAR)  
> ... def add_one(x):
> ... return x + 1
> ...
> >>> _ = spark.udf.registerUDF("add_one", add_one)  
> >>> spark.sql("SELECT add_one(id) FROM range(10)").collect()  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21157) Report Total Memory Used by Spark Executors

2018-01-10 Thread assia ydroudj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320293#comment-16320293
 ] 

assia ydroudj commented on SPARK-21157:
---



I m beginner in apache spark and have installed a prebuilt distribution of 
apache spark with hadoop. I look to get the consumption or the usage of memory 
while running the example PageRank implemented within spark. I have my cluster 
standalone mode with 1 maser and 4 workers (Virtual machines)

I have tried external tools like ganglia and graphite but they give the memory 
usage at resource or system level (more general) but what i need exactly is "to 
track the behavior of the memory (Storage, execution) while running the 
algorithm does it means, memory usage for a spark application-ID ". Is there 
anyway to get it into text-file for further exploitation? Please help me on 
this, Thanks


> Report Total Memory Used by Spark Executors
> ---
>
> Key: SPARK-21157
> URL: https://issues.apache.org/jira/browse/SPARK-21157
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Affects Versions: 2.1.1
>Reporter: Jose Soltren
> Attachments: TotalMemoryReportingDesignDoc.pdf
>
>
> Building on some of the core ideas of SPARK-9103, this JIRA proposes tracking 
> total memory used by Spark executors, and a means of broadcasting, 
> aggregating, and reporting memory usage data in the Spark UI.
> Here, "total memory used" refers to memory usage that is visible outside of 
> Spark, to an external observer such as YARN, Mesos, or the operating system. 
> The goal of this enhancement is to give Spark users more information about 
> how Spark clusters are using memory. Total memory will include non-Spark JVM 
> memory and all off-heap memory.
> Please consult the attached design document for further details.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23019:


Assignee: Apache Spark

> Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
> -
>
> Key: SPARK-23019
> URL: https://issues.apache.org/jira/browse/SPARK-23019
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Assignee: Apache Spark
>Priority: Blocker
>
> {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to 
> multiple spark contexts: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/
> {code}
> Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore 
> this error, set spark.driver.allowMultipleContexts = true. The currently 
> running SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:116)
> org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63)
> java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23019:


Assignee: (was: Apache Spark)

> Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
> -
>
> Key: SPARK-23019
> URL: https://issues.apache.org/jira/browse/SPARK-23019
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Blocker
>
> {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to 
> multiple spark contexts: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/
> {code}
> Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore 
> this error, set spark.driver.allowMultipleContexts = true. The currently 
> running SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:116)
> org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63)
> java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320225#comment-16320225
 ] 

Apache Spark commented on SPARK-23019:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/20221

> Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
> -
>
> Key: SPARK-23019
> URL: https://issues.apache.org/jira/browse/SPARK-23019
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Tests
>Affects Versions: 2.3.0
>Reporter: Sameer Agarwal
>Priority: Blocker
>
> {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to 
> multiple spark contexts: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/
> {code}
> Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore 
> this error, set spark.driver.allowMultipleContexts = true. The currently 
> running SparkContext was created at:
> org.apache.spark.SparkContext.(SparkContext.scala:116)
> org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879)
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197)
> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227)
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136)
> org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:498)
> org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63)
> java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel

2018-01-10 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320203#comment-16320203
 ] 

Sean Owen commented on SPARK-22982:
---

Does this affect earlier branches in the same way? Seems important to back port 
if so. 

> Remove unsafe asynchronous close() call from FileDownloadChannel
> 
>
> Key: SPARK-22982
> URL: https://issues.apache.org/jira/browse/SPARK-22982
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
>  Labels: correctness
> Fix For: 2.3.0
>
>
> Spark's Netty-based file transfer code contains an asynchronous IO bug which 
> may lead to incorrect query results.
> At a high-level, the problem is that an unsafe asynchronous `close()` of a 
> pipe's source channel creates a race condition where file transfer code 
> closes a file descriptor then attempts to read from it. If the closed file 
> descriptor's number has been reused by an `open()` call then this invalid 
> read may cause unrelated file operations to return incorrect results due to 
> reading different data than intended.
> I have a small, surgical fix for this bug and will submit a PR with more 
> description on the specific race condition / underlying bug.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-10 Thread Ken Tore Tallakstad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320201#comment-16320201
 ] 

Ken Tore Tallakstad commented on SPARK-21396:
-

Does the bug originate from this function inside
[https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala]
 ?

```
def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) {
dataTypes(ordinal) match {
  case StringType =>
to += from.getString(ordinal)
  case IntegerType =>
to += from.getInt(ordinal)
  case BooleanType =>
to += from.getBoolean(ordinal)
  case DoubleType =>
to += from.getDouble(ordinal)
  case FloatType =>
to += from.getFloat(ordinal)
  case DecimalType() =>
to += from.getDecimal(ordinal)
  case LongType =>
to += from.getLong(ordinal)
  case ByteType =>
to += from.getByte(ordinal)
  case ShortType =>
to += from.getShort(ordinal)
  case DateType =>
to += from.getAs[Date](ordinal)
  case TimestampType =>
to += from.getAs[Timestamp](ordinal)
  case BinaryType =>
to += from.getAs[Array[Byte]](ordinal)
  case _: ArrayType | _: StructType | _: MapType =>
val hiveString = HiveUtils.toHiveString((from.get(ordinal), 
dataTypes(ordinal)))
to += hiveString
}
  }
```

A UDT will not match any of these right??

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>  Labels: Hive, ThriftServer2, user-defined-type
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 
> (of class org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144)
>   at 
> org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
>   at 
> org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685)
>   at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>

[jira] [Comment Edited] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-10 Thread Ken Tore Tallakstad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320201#comment-16320201
 ] 

Ken Tore Tallakstad edited comment on SPARK-21396 at 1/10/18 1:15 PM:
--

Does the bug originate from this function inside
[https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala]
 ?


{code:scala}

def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) {
dataTypes(ordinal) match {
  case StringType =>
to += from.getString(ordinal)
  case IntegerType =>
to += from.getInt(ordinal)
  case BooleanType =>
to += from.getBoolean(ordinal)
  case DoubleType =>
to += from.getDouble(ordinal)
  case FloatType =>
to += from.getFloat(ordinal)
  case DecimalType() =>
to += from.getDecimal(ordinal)
  case LongType =>
to += from.getLong(ordinal)
  case ByteType =>
to += from.getByte(ordinal)
  case ShortType =>
to += from.getShort(ordinal)
  case DateType =>
to += from.getAs[Date](ordinal)
  case TimestampType =>
to += from.getAs[Timestamp](ordinal)
  case BinaryType =>
to += from.getAs[Array[Byte]](ordinal)
  case _: ArrayType | _: StructType | _: MapType =>
val hiveString = HiveUtils.toHiveString((from.get(ordinal), 
dataTypes(ordinal)))
to += hiveString
}
  }

{code}

A UDT will not match any of these right??


was (Author: kentore82):
Does the bug originate from this function inside
[https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala]
 ?

```
def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) {
dataTypes(ordinal) match {
  case StringType =>
to += from.getString(ordinal)
  case IntegerType =>
to += from.getInt(ordinal)
  case BooleanType =>
to += from.getBoolean(ordinal)
  case DoubleType =>
to += from.getDouble(ordinal)
  case FloatType =>
to += from.getFloat(ordinal)
  case DecimalType() =>
to += from.getDecimal(ordinal)
  case LongType =>
to += from.getLong(ordinal)
  case ByteType =>
to += from.getByte(ordinal)
  case ShortType =>
to += from.getShort(ordinal)
  case DateType =>
to += from.getAs[Date](ordinal)
  case TimestampType =>
to += from.getAs[Timestamp](ordinal)
  case BinaryType =>
to += from.getAs[Array[Byte]](ordinal)
  case _: ArrayType | _: StructType | _: MapType =>
val hiveString = HiveUtils.toHiveString((from.get(ordinal), 
dataTypes(ordinal)))
to += hiveString
}
  }
```

A UDT will not match any of these right??

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>  Labels: Hive, ThriftServer2, user-defined-type
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)

[jira] [Comment Edited] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field

2018-01-10 Thread Ken Tore Tallakstad (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320201#comment-16320201
 ] 

Ken Tore Tallakstad edited comment on SPARK-21396 at 1/10/18 1:15 PM:
--

Does the bug originate from this function inside
[https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala]
 ?


{code:java}

def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) {
dataTypes(ordinal) match {
  case StringType =>
to += from.getString(ordinal)
  case IntegerType =>
to += from.getInt(ordinal)
  case BooleanType =>
to += from.getBoolean(ordinal)
  case DoubleType =>
to += from.getDouble(ordinal)
  case FloatType =>
to += from.getFloat(ordinal)
  case DecimalType() =>
to += from.getDecimal(ordinal)
  case LongType =>
to += from.getLong(ordinal)
  case ByteType =>
to += from.getByte(ordinal)
  case ShortType =>
to += from.getShort(ordinal)
  case DateType =>
to += from.getAs[Date](ordinal)
  case TimestampType =>
to += from.getAs[Timestamp](ordinal)
  case BinaryType =>
to += from.getAs[Array[Byte]](ordinal)
  case _: ArrayType | _: StructType | _: MapType =>
val hiveString = HiveUtils.toHiveString((from.get(ordinal), 
dataTypes(ordinal)))
to += hiveString
}
  }

{code}

A UDT will not match any of these right??


was (Author: kentore82):
Does the bug originate from this function inside
[https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala]
 ?


{code:scala}

def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) {
dataTypes(ordinal) match {
  case StringType =>
to += from.getString(ordinal)
  case IntegerType =>
to += from.getInt(ordinal)
  case BooleanType =>
to += from.getBoolean(ordinal)
  case DoubleType =>
to += from.getDouble(ordinal)
  case FloatType =>
to += from.getFloat(ordinal)
  case DecimalType() =>
to += from.getDecimal(ordinal)
  case LongType =>
to += from.getLong(ordinal)
  case ByteType =>
to += from.getByte(ordinal)
  case ShortType =>
to += from.getShort(ordinal)
  case DateType =>
to += from.getAs[Date](ordinal)
  case TimestampType =>
to += from.getAs[Timestamp](ordinal)
  case BinaryType =>
to += from.getAs[Array[Byte]](ordinal)
  case _: ArrayType | _: StructType | _: MapType =>
val hiveString = HiveUtils.toHiveString((from.get(ordinal), 
dataTypes(ordinal)))
to += hiveString
}
  }

{code}

A UDT will not match any of these right??

> Spark Hive Thriftserver doesn't return UDT field
> 
>
> Key: SPARK-21396
> URL: https://issues.apache.org/jira/browse/SPARK-21396
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Haopu Wang
>  Labels: Hive, ThriftServer2, user-defined-type
>
> I want to query a table with a MLLib Vector field and get below exception.
> Can Spark Hive Thriftserver be enhanced to return UDT field?
> ==
> 2017-07-13 13:14:25,435 WARN  
> [org.apache.hive.service.cli.thrift.ThriftCLIService] 
> (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: 
> java.lang.RuntimeException: scala.MatchError: 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class 
> org.apache.spark.ml.linalg.VectorUDT)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
>   at 
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
>   at com.sun.proxy.$Proxy29.fetchResults(Unknown Source)
>   at 
> org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454)
>   at 
> org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
>   at 
> org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
>   at 
>

[jira] [Commented] (SPARK-18147) Broken Spark SQL Codegen

2018-01-10 Thread Alexander Chermenin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320159#comment-16320159
 ] 

Alexander Chermenin commented on SPARK-18147:
-

Is it resolved for the 2.2.1 version? I have a similar problem with PySpark and 
Structured Streaming:

{noformat}
18/01/10 15:12:37 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
174, Column 113: Expression "isNull3" is not an rvalue
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private UTF8String lastRegex;
/* 009 */   private java.util.regex.Pattern pattern;
/* 010 */   private String lastReplacement;
/* 011 */   private UTF8String lastReplacementInUTF8;
/* 012 */   private java.lang.StringBuffer result;
/* 013 */   private UTF8String lastRegex1;
/* 014 */   private java.util.regex.Pattern pattern1;
/* 015 */   private String lastReplacement1;
/* 016 */   private UTF8String lastReplacementInUTF81;
/* 017 */   private java.lang.StringBuffer result1;
/* 018 */   private UnsafeRow result2;
/* 019 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder;
/* 020 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter;
/* 021 */
/* 022 */   public SpecificUnsafeProjection(Object[] references) {
/* 023 */ this.references = references;
/* 024 */ lastRegex = null;
/* 025 */ pattern = null;
/* 026 */ lastReplacement = null;
/* 027 */ lastReplacementInUTF8 = null;
/* 028 */ result = new java.lang.StringBuffer();
/* 029 */ lastRegex1 = null;
/* 030 */ pattern1 = null;
/* 031 */ lastReplacement1 = null;
/* 032 */ lastReplacementInUTF81 = null;
/* 033 */ result1 = new java.lang.StringBuffer();
/* 034 */ result2 = new UnsafeRow(3);
/* 035 */ this.holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result2, 64);
/* 036 */ this.rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 3);
/* 037 */
/* 038 */   }
/* 039 */
/* 040 */   public void initialize(int partitionIndex) {
/* 041 */
/* 042 */   }
/* 043 */
/* 044 */
/* 045 */   private void apply_1(InternalRow i) {
/* 046 */
/* 047 */ UTF8String[] args1 = new UTF8String[3];
/* 048 */
/* 049 */
/* 050 */ if (!false) {
/* 051 */   args1[0] = ((UTF8String) references[6]);
/* 052 */ }
/* 053 */
/* 054 */
/* 055 */ if (!false) {
/* 056 */   args1[1] = ((UTF8String) references[7]);
/* 057 */ }
/* 058 */
/* 059 */
/* 060 */ boolean isNull13 = false;
/* 061 */
/* 062 */ UTF8String value14 = i.getUTF8String(2);
/* 063 */
/* 064 */
/* 065 */ UTF8String value13 = null;
/* 066 */
/* 067 */ if (!((UTF8String) references[8]).equals(lastRegex1)) {
/* 068 */   // regex value changed
/* 069 */   lastRegex1 = ((UTF8String) references[8]).clone();
/* 070 */   pattern1 = 
java.util.regex.Pattern.compile(lastRegex1.toString());
/* 071 */ }
/* 072 */ if (!((UTF8String) references[9]).equals(lastReplacementInUTF81)) 
{
/* 073 */   // replacement string changed
/* 074 */   lastReplacementInUTF81 = ((UTF8String) references[9]).clone();
/* 075 */   lastReplacement1 = lastReplacementInUTF81.toString();
/* 076 */ }
/* 077 */ result1.delete(0, result1.length());
/* 078 */ java.util.regex.Matcher matcher1 = 
pattern1.matcher(value14.toString());
/* 079 */
/* 080 */ while (matcher1.find()) {
/* 081 */   matcher1.appendReplacement(result1, lastReplacement1);
/* 082 */ }
/* 083 */ matcher1.appendTail(result1);
/* 084 */ value13 = UTF8String.fromString(result1.toString());
/* 085 */ if (!false) {
/* 086 */   args1[2] = value13;
/* 087 */ }
/* 088 */
/* 089 */ UTF8String value10 = UTF8String.concat(args1);
/* 090 */ boolean isNull10 = value10 == null;
/* 091 */   }
/* 092 */
/* 093 */
/* 094 */   private void apply1_1(InternalRow i) {
/* 095 */
/* 096 */
/* 097 */ final Object value20 = null;
/* 098 */ if (true) {
/* 099 */   rowWriter.setNullAt(2);
/* 100 */ } else {
/* 101 */
/* 102 */ }
/* 103 */
/* 104 */   }
/* 105 */
/* 106 */
/* 107 */   private void apply_0(InternalRow i) {
/* 108 */
/* 109 */ UTF8String[] args = new UTF8String[3];
/* 110 */
/* 111 */
/* 112 */ if (!false) {
/* 113 */   args[0] = ((UTF8String) references[2]);
/* 114 */ }
/* 115 */
/* 116 */
/* 117 */ if (!false) {
/* 118 */   args[1] = ((UTF8String) references[3]);
/* 119 */ }
/* 120 */
/* 121 */
/* 122 */ boolean isNull6 = true;
/* 123 */ UTF8String value6 = null;
/* 124 */
/* 125

[jira] [Assigned] (SPARK-23025) DataSet with scala.Null causes Exception

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23025:


Assignee: Apache Spark

> DataSet with scala.Null causes Exception
> 
>
> Key: SPARK-23025
> URL: https://issues.apache.org/jira/browse/SPARK-23025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel Davis
>Assignee: Apache Spark
>
> When creating a DataSet over a case class containing a field of type 
> scala.Null, there is an exception thrown. As far as I can see, spark sql 
> would support a Schema(NullType, true), but it fails inside the {{schemaFor}} 
> function with a {{MatchError}}.
> I would expect spark to return a DataSet with a NullType for that field. 
> h5. Minimal Exampe
> {code}
> case class Foo(foo: Int, bar: Null)
> val ds = Seq(Foo(42, null)).toDS()
> {code}
> h5. Exception
> {code}
> scala.MatchError: scala.Null (of class 
> scala.reflect.internal.Types$ClassNoArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:713)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:704)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:703)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:391)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:390)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:390)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:148)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:148)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:136)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
>   at 
> org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
>   ... 42 elided 
> {code}
> h5. Background Info
> To handle our data in a type-safe fashion, we have generated AVRO schemas and 
> corresponding scala case classes for our domain data. As some fields only 
> contain null values, this results in fields with scala.Null as a type. Moving 
> our pipeline to DataSets/structured streaming, case classes with Null types 
> begin to give problems, even trough NullType is known to spark SQL.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23025) DataSet with scala.Null causes Exception

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320148#comment-16320148
 ] 

Apache Spark commented on SPARK-23025:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/20219

> DataSet with scala.Null causes Exception
> 
>
> Key: SPARK-23025
> URL: https://issues.apache.org/jira/browse/SPARK-23025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel Davis
>
> When creating a DataSet over a case class containing a field of type 
> scala.Null, there is an exception thrown. As far as I can see, spark sql 
> would support a Schema(NullType, true), but it fails inside the {{schemaFor}} 
> function with a {{MatchError}}.
> I would expect spark to return a DataSet with a NullType for that field. 
> h5. Minimal Exampe
> {code}
> case class Foo(foo: Int, bar: Null)
> val ds = Seq(Foo(42, null)).toDS()
> {code}
> h5. Exception
> {code}
> scala.MatchError: scala.Null (of class 
> scala.reflect.internal.Types$ClassNoArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:713)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:704)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:703)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:391)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:390)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:390)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:148)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:148)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:136)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
>   at 
> org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
>   ... 42 elided 
> {code}
> h5. Background Info
> To handle our data in a type-safe fashion, we have generated AVRO schemas and 
> corresponding scala case classes for our domain data. As some fields only 
> contain null values, this results in fields with scala.Null as a type. Moving 
> our pipeline to DataSets/structured streaming, case classes with Null types 
> begin to give problems, even trough NullType is known to spark SQL.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23025) DataSet with scala.Null causes Exception

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23025:


Assignee: (was: Apache Spark)

> DataSet with scala.Null causes Exception
> 
>
> Key: SPARK-23025
> URL: https://issues.apache.org/jira/browse/SPARK-23025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Daniel Davis
>
> When creating a DataSet over a case class containing a field of type 
> scala.Null, there is an exception thrown. As far as I can see, spark sql 
> would support a Schema(NullType, true), but it fails inside the {{schemaFor}} 
> function with a {{MatchError}}.
> I would expect spark to return a DataSet with a NullType for that field. 
> h5. Minimal Exampe
> {code}
> case class Foo(foo: Int, bar: Null)
> val ds = Seq(Foo(42, null)).toDS()
> {code}
> h5. Exception
> {code}
> scala.MatchError: scala.Null (of class 
> scala.reflect.internal.Types$ClassNoArgsTypeRef)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:713)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:704)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:703)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:391)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:390)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.immutable.List.foreach(List.scala:381)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.immutable.List.map(List.scala:285)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:390)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:148)
>   at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:148)
>   at 
> org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:136)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72)
>   at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
>   at 
> org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
>   at 
> org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
>   ... 42 elided 
> {code}
> h5. Background Info
> To handle our data in a type-safe fashion, we have generated AVRO schemas and 
> corresponding scala case classes for our domain data. As some fields only 
> contain null values, this results in fields with scala.Null as a type. Moving 
> our pipeline to DataSets/structured streaming, case classes with Null types 
> begin to give problems, even trough NullType is known to spark SQL.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-23027) optimizer a simple query using a non-existent data is too slow

2018-01-10 Thread wangminfeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangminfeng updated SPARK-23027:

Summary: optimizer a simple query using a non-existent data is too slow  
(was: optimizer a simple query)

> optimizer a simple query using a non-existent data is too slow
> --
>
> Key: SPARK-23027
> URL: https://issues.apache.org/jira/browse/SPARK-23027
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.1
>Reporter: wangminfeng
>
> When i use spark sql to do ad-hoc query, i have data partitioned by 
> event_day, event_minute, event_hour, data is large enough，one day data is 
> about 3T， we saved 3 month data.
> But when query use a non-existent day， get optimizedPlan is too slow.
> i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get 
> optimized plan， for five minutes，i can not get it. Query is simple enough, 
> like:
> SELECT
> event_day
> FROM db.table t1
> WHERE (t1.event_day='20170104' and t1.event_hour='23' and 
> t1.event_minute='55')
> LIMIT 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-23027) optimizer a simple query

2018-01-10 Thread wangminfeng (JIRA)

wangminfeng created SPARK-23027:
---

 Summary: optimizer a simple query
 Key: SPARK-23027
 URL: https://issues.apache.org/jira/browse/SPARK-23027
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.0.1
Reporter: wangminfeng


When i use spark sql to do ad-hoc query, i have data partitioned by event_day, 
event_minute, event_hour, data is large enough，one day data is about 3T， we 
saved 3 month data.
But when query use a non-existent day， get optimizedPlan is too slow.
i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get 
optimized plan， for five minutes，i can not get it. Query is simple enough, like:
SELECT
event_day
FROM db.table t1
WHERE (t1.event_day='20170104' and t1.event_hour='23' and t1.event_minute='55')
LIMIT 1



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320110#comment-16320110
 ] 

Apache Spark commented on SPARK-23000:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20218

> Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
> -
>
> Key: SPARK-23000
> URL: https://issues.apache.org/jira/browse/SPARK-23000
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Blocker
> Fix For: 2.3.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/
> The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always 
> failed in hadoop 2.6 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.

2018-01-10 Thread Matthew Walton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320100#comment-16320100
 ] 

Matthew Walton edited comment on SPARK-21179 at 1/10/18 11:29 AM:
--

I ended up getting a resolution from Simba:

"I apologize for the delay, after thorough investigation we have confirmed that 
this is indeed a shell issue. We will be opening a JIRA with them. The reason 
it took a bit longer is because we were trying to collect more information to 
prove that this is a shell issue. The shell should be using a JDBC api to 
determine the correct string identifier which is not called so default " " is 
used... "

There is a workaround that can be used. Shell provides a JdbcDialect class that 
can be used to set the identifier to wanted one. 

For Hive you can use this before or after establishing a connection (but before 
calling show()): 


import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} 
import org.apache.spark.sql.types._ 

val HiveDialect = new JdbcDialect { 
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || 
url.contains("hive2") 

override def quoteIdentifier(colName: String): String = { s"$colName" 
} 
} 

JdbcDialects.registerDialect(HiveDialect) 


was (Author: mwalton_mstr):
I ended up getting a resolution from Simba:

"I apologize for the delay, after through investigation we have confirmed that 
this is indeed a shell issue. We will be opening a JIRA with them. The reason 
it took a bit longer is because we were trying to collect more information to 
prove that this is a shell issue. The shell should be using a JDBC api to 
determine the correct string identifier which is not called so default " " is 
used... "

There is a workaround that can be used. Shell provides a JdbcDialect class that 
can be used to set the identifier to wanted one. 

For Hive you can use this before or after establishing a connection (but before 
calling show()): 


import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} 
import org.apache.spark.sql.types._ 

val HiveDialect = new JdbcDialect { 
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || 
url.contains("hive2") 

override def quoteIdentifier(colName: String): String = { s"$colName" 
} 
} 

JdbcDialects.registerDialect(HiveDialect) 

> Unable to return Hive INT data type into Spark via Hive JDBC driver:  Caused 
> by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> -
>
> Key: SPARK-21179
> URL: https://issues.apache.org/jira/browse/SPARK-21179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> HDP version 2.5.0.1-60
> Hive version: 1.2.1
> Spark  version 2.0.0.2.5.0.1-60
> JDBC:  Download the latest Hortonworks JDBC driver
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive.  
> Unfortunately, when I try to query data that resides in an INT column I get 
> the following error:  
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> Steps to reproduce:
> 1) On Hive create a simple table with an INT column and insert some data (I 
> used SQuirreL Client with the Hortonworks JDBC driver):
> create table wh2.hivespark (country_id int, country_name string)
> insert into wh2.hivespark values (1, 'USA')
> 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files
> ./spark-shell --jars 
> /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar
> 4) In Spark shell load the data from Hive using the JDBC driver
> val hivespark = spark.read.format("jdbc").options(Map("url" -> 
>

[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t

2018-01-10 Thread Matthew Walton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320106#comment-16320106
 ] 

Matthew Walton commented on SPARK-21183:


I ended up getting an answer and work around from Simba here.  Their comments 
below

I apologize for the delay, after thorough investigation we have confirmed that 
this is indeed a shell issue. We will be opening a JIRA with them. The reason 
it took a bit longer is because we were trying to collect more information to 
prove that this is a shell issue. The shell should be using a JDBC api to 
determine the correct string identifier which is not called so default " " is 
used instead of back-tick which is a identifier for BigQuery. 

There is a workaround that can be used. Shell provides a JdbcDialect class that 
can be used to set the identifier to wanted one. 

For Hive and BigQuery you can use this before or after establishing a 
connection (but before calling show()): 

import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} 

val GoogleDialect = new JdbcDialect { 
override def canHandle(url: String): Boolean = url.startsWith("jdbc:bigquery") 
|| url.contains("bigquery") 
override def quoteIdentifier(colName: String): String = { 
s"`$colName`"}} 

JdbcDialects.registerDialect(GoogleDialect); 



> Unable to return Google BigQuery INTEGER data type into Spark via google 
> BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error 
> converting value to long.
> --
>
> Key: SPARK-21183
> URL: https://issues.apache.org/jira/browse/SPARK-21183
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> Spark  version 2.1.1
> JDBC:  Download the latest google BigQuery JDBC Driver from Google
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark using a JDBC connection to Google 
> BigQuery.  Unfortunately, when I try to query data that resides in an INTEGER 
> column I get the following error:  
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> long.  
> Steps to reproduce:
> 1) On Google BigQuery console create a simple table with an INT column and 
> insert some data 
> 2) Copy the Google BigQuery JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files
> ./spark-shell --jars 
> /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar
> 4) In Spark shell load the data from Google BigQuery using the JDBC driver
> val gbq = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable;
>  -> 
> "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load()
> 5) In Spark shell try to display the data
> gbq.show()
> At this point you should see the error:
> scala> gbq.show()
> 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, 
> ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: 
> [Simba][JDBC](10140) Error converting value to long.
> at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source)
> at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at 
>

[jira] [Comment Edited] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.

2018-01-10 Thread Matthew Walton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320100#comment-16320100
 ] 

Matthew Walton edited comment on SPARK-21179 at 1/10/18 11:27 AM:
--

I ended up getting a resolution from Simba:

"I apologize for the delay, after through investigation we have confirmed that 
this is indeed a shell issue. We will be opening a JIRA with them. The reason 
it took a bit longer is because we were trying to collect more information to 
prove that this is a shell issue. The shell should be using a JDBC api to 
determine the correct string identifier which is not called so default " " is 
used... "

There is a workaround that can be used. Shell provides a JdbcDialect class that 
can be used to set the identifier to wanted one. 

For Hive you can use this before or after establishing a connection (but before 
calling show()): 


import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} 
import org.apache.spark.sql.types._ 

val HiveDialect = new JdbcDialect { 
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || 
url.contains("hive2") 

override def quoteIdentifier(colName: String): String = { s"$colName" 
} 
} 

JdbcDialects.registerDialect(HiveDialect) 


was (Author: mwalton_mstr):
I ended up getting a resolution from Simba:

"I apologize for the delay, after through investigation we have confirmed that 
this is indeed a shell issue. We will be opening a JIRA with them. The reason 
it took a bit longer is because we were trying to collect more information to 
prove that this is a shell issue. The shell should be using a JDBC api to 
determine the correct string identifier which is not called so default " " is 
used... "

There is a workaround that can be used. Shell provides a JdbcDialect class that 
can be used to set the identifier to wanted one. 

For Hive you can use this before or after establishing a connection (but before 
calling show()): 


import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} 
import org.apache.spark.sql.types._ 

val HiveDialect = new JdbcDialect { 
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || 
url.contains("hive2") 

override def quoteIdentifier(colName: String): String = { s"$colName" 
} 
} 

> Unable to return Hive INT data type into Spark via Hive JDBC driver:  Caused 
> by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> -
>
> Key: SPARK-21179
> URL: https://issues.apache.org/jira/browse/SPARK-21179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> HDP version 2.5.0.1-60
> Hive version: 1.2.1
> Spark  version 2.0.0.2.5.0.1-60
> JDBC:  Download the latest Hortonworks JDBC driver
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive.  
> Unfortunately, when I try to query data that resides in an INT column I get 
> the following error:  
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> Steps to reproduce:
> 1) On Hive create a simple table with an INT column and insert some data (I 
> used SQuirreL Client with the Hortonworks JDBC driver):
> create table wh2.hivespark (country_id int, country_name string)
> insert into wh2.hivespark values (1, 'USA')
> 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files
> ./spark-shell --jars 
> /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar
> 4) In Spark shell load the data from Hive using the JDBC driver
> val hivespark = spark.read.format("jdbc").options(Map("url" -> 
>

[jira] [Commented] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.

2018-01-10 Thread Matthew Walton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320100#comment-16320100
 ] 

Matthew Walton commented on SPARK-21179:


I ended up getting a resolution from Simba:

"I apologize for the delay, after through investigation we have confirmed that 
this is indeed a shell issue. We will be opening a JIRA with them. The reason 
it took a bit longer is because we were trying to collect more information to 
prove that this is a shell issue. The shell should be using a JDBC api to 
determine the correct string identifier which is not called so default " " is 
used... "

There is a workaround that can be used. Shell provides a JdbcDialect class that 
can be used to set the identifier to wanted one. 

For Hive you can use this before or after establishing a connection (but before 
calling show()): 


import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} 
import org.apache.spark.sql.types._ 

val HiveDialect = new JdbcDialect { 
override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || 
url.contains("hive2") 

override def quoteIdentifier(colName: String): String = { s"$colName" 
} 
} 

> Unable to return Hive INT data type into Spark via Hive JDBC driver:  Caused 
> by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> -
>
> Key: SPARK-21179
> URL: https://issues.apache.org/jira/browse/SPARK-21179
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell, SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.1
> Environment: OS:  Linux
> HDP version 2.5.0.1-60
> Hive version: 1.2.1
> Spark  version 2.0.0.2.5.0.1-60
> JDBC:  Download the latest Hortonworks JDBC driver
>Reporter: Matthew Walton
>
> I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive.  
> Unfortunately, when I try to query data that resides in an INT column I get 
> the following error:  
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to 
> int.  
> Steps to reproduce:
> 1) On Hive create a simple table with an INT column and insert some data (I 
> used SQuirreL Client with the Hortonworks JDBC driver):
> create table wh2.hivespark (country_id int, country_name string)
> insert into wh2.hivespark values (1, 'USA')
> 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run 
> Spark Shell
> 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files
> ./spark-shell --jars 
> /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar
> 4) In Spark shell load the data from Hive using the JDBC driver
> val hivespark = spark.read.format("jdbc").options(Map("url" -> 
> "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable"
>  -> 
> "wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load()
> 5) In Spark shell try to display the data
> hivespark.show()
> At this point you should see the error:
> scala> hivespark.show()
> 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
> at 
> com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown 
> Source)
> at 
> com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source)
> at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown 
> Source)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437)
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
>

[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions

2018-01-10 Thread sam (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320041#comment-16320041
 ] 

sam commented on SPARK-17998:
-

[~srowen] Thanks, no idea where I got that from, cursed weakly typed silently 
failing config builders!!

Will try `spark.files.maxPartitionBytes`

> Reading Parquet files coalesces parts into too few in-memory partitions
> ---
>
> Key: SPARK-17998
> URL: https://issues.apache.org/jira/browse/SPARK-17998
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.0.0, 2.0.1
> Environment: Spark Standalone Cluster (not "local mode")
> Windows 10 and Windows 7
> Python 3.x
>Reporter: Shea Parkes
>
> Reading a parquet ~file into a DataFrame is resulting in far too few 
> in-memory partitions.  In prior versions of Spark, the resulting DataFrame 
> would have a number of partitions often equal to the number of parts in the 
> parquet folder.
> Here's a minimal reproducible sample:
> {quote}
> df_first = session.range(start=1, end=1, numPartitions=13)
> assert df_first.rdd.getNumPartitions() == 13
> assert session._sc.defaultParallelism == 6
> path_scrap = r"c:\scratch\scrap.parquet"
> df_first.write.parquet(path_scrap)
> df_second = session.read.parquet(path_scrap)
> print(df_second.rdd.getNumPartitions())
> {quote}
> The above shows only 7 partitions in the DataFrame that was created by 
> reading the Parquet back into memory for me.  Why is it no longer just the 
> number of part files in the Parquet folder?  (Which is 13 in the example 
> above.)
> I'm filing this as a bug because it has gotten so bad that we can't work with 
> the underlying RDD without first repartitioning the DataFrame, which is 
> costly and wasteful.  I really doubt this was the intended effect of moving 
> to Spark 2.0.
> I've tried to research where the number of in-memory partitions is 
> determined, but my Scala skills have proven in-adequate.  I'd be happy to dig 
> further if someone could point me in the right direction...



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23026) Add RegisterUDF to PySpark

2018-01-10 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16320022#comment-16320022
 ] 

Apache Spark commented on SPARK-23026:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20217

> Add RegisterUDF to PySpark
> --
>
> Key: SPARK-23026
> URL: https://issues.apache.org/jira/browse/SPARK-23026
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add a new API for registering row-at-a-time or scalar vectorized UDFs. The 
> registered UDFs can be used in the SQL statement.
> {noformat}
> >>> from pyspark.sql.types import IntegerType
> >>> from pyspark.sql.functions import udf
> >>> slen = udf(lambda s: len(s), IntegerType())
> >>> _ = spark.udf.registerUDF("slen", slen)
> >>> spark.sql("SELECT slen('test')").collect()
> [Row(slen(test)=4)]
> >>> import random
> >>> from pyspark.sql.functions import udf
> >>> from pyspark.sql.types import IntegerType
> >>> random_udf = udf(lambda: random.randint(0, 100), 
> >>> IntegerType()).asNondeterministic()
> >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf)
> >>> spark.sql("SELECT random_udf()").collect()  
> [Row(random_udf()=82)]
> >>> spark.range(1).select(newRandom_udf()).collect()  
> [Row(random_udf()=62)]
> >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
> >>> @pandas_udf("integer", PandasUDFType.SCALAR)  
> ... def add_one(x):
> ... return x + 1
> ...
> >>> _ = spark.udf.registerUDF("add_one", add_one)  
> >>> spark.sql("SELECT add_one(id) FROM range(10)").collect()  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23026) Add RegisterUDF to PySpark

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23026:


Assignee: Xiao Li  (was: Apache Spark)

> Add RegisterUDF to PySpark
> --
>
> Key: SPARK-23026
> URL: https://issues.apache.org/jira/browse/SPARK-23026
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> Add a new API for registering row-at-a-time or scalar vectorized UDFs. The 
> registered UDFs can be used in the SQL statement.
> {noformat}
> >>> from pyspark.sql.types import IntegerType
> >>> from pyspark.sql.functions import udf
> >>> slen = udf(lambda s: len(s), IntegerType())
> >>> _ = spark.udf.registerUDF("slen", slen)
> >>> spark.sql("SELECT slen('test')").collect()
> [Row(slen(test)=4)]
> >>> import random
> >>> from pyspark.sql.functions import udf
> >>> from pyspark.sql.types import IntegerType
> >>> random_udf = udf(lambda: random.randint(0, 100), 
> >>> IntegerType()).asNondeterministic()
> >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf)
> >>> spark.sql("SELECT random_udf()").collect()  
> [Row(random_udf()=82)]
> >>> spark.range(1).select(newRandom_udf()).collect()  
> [Row(random_udf()=62)]
> >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
> >>> @pandas_udf("integer", PandasUDFType.SCALAR)  
> ... def add_one(x):
> ... return x + 1
> ...
> >>> _ = spark.udf.registerUDF("add_one", add_one)  
> >>> spark.sql("SELECT add_one(id) FROM range(10)").collect()  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23026) Add RegisterUDF to PySpark

2018-01-10 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23026:


Assignee: Apache Spark  (was: Xiao Li)

> Add RegisterUDF to PySpark
> --
>
> Key: SPARK-23026
> URL: https://issues.apache.org/jira/browse/SPARK-23026
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> Add a new API for registering row-at-a-time or scalar vectorized UDFs. The 
> registered UDFs can be used in the SQL statement.
> {noformat}
> >>> from pyspark.sql.types import IntegerType
> >>> from pyspark.sql.functions import udf
> >>> slen = udf(lambda s: len(s), IntegerType())
> >>> _ = spark.udf.registerUDF("slen", slen)
> >>> spark.sql("SELECT slen('test')").collect()
> [Row(slen(test)=4)]
> >>> import random
> >>> from pyspark.sql.functions import udf
> >>> from pyspark.sql.types import IntegerType
> >>> random_udf = udf(lambda: random.randint(0, 100), 
> >>> IntegerType()).asNondeterministic()
> >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf)
> >>> spark.sql("SELECT random_udf()").collect()  
> [Row(random_udf()=82)]
> >>> spark.range(1).select(newRandom_udf()).collect()  
> [Row(random_udf()=62)]
> >>> from pyspark.sql.functions import pandas_udf, PandasUDFType
> >>> @pandas_udf("integer", PandasUDFType.SCALAR)  
> ... def add_one(x):
> ... return x + 1
> ...
> >>> _ = spark.udf.registerUDF("add_one", add_one)  
> >>> spark.sql("SELECT add_one(id) FROM range(10)").collect()  
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 110 matches

Mail list logo