[jira] [Created] (SPARK-23039) Fix the bug in alter table set location.
xubo245 created SPARK-23039: --- Summary: Fix the bug in alter table set location. Key: SPARK-23039 URL: https://issues.apache.org/jira/browse/SPARK-23039 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: xubo245 Priority: Critical TOBO work: Fix the bug in alter table set location. org.apache.spark.sql.execution.command.DDLSuite#testSetLocation {code:java} // TODO(gatorsmile): fix the bug in alter table set location. //if (isUsingHiveMetastore) { //assert(storageFormat.properties.get("path") === expected) // } {code} Analysis: because user add locationUri and erase path by {code:java} newPath = None {code} in org.apache.spark.sql.hive.HiveExternalCatalog#restoreDataSourceTable: {code:java} val storageWithLocation = { val tableLocation = getLocationFromStorageProps(table) // We pass None as `newPath` here, to remove the path option in storage properties. updateLocationInStorageProps(table, newPath = None).copy( locationUri = tableLocation.map(CatalogUtils.stringToURI(_))) } {code} because " We pass None as `newPath` here, to remove the path option in storage properties." And locationUri is obtain from path in storage properties {code:java} private def getLocationFromStorageProps(table: CatalogTable): Option[String] = { CaseInsensitiveMap(table.storage.properties).get("path") } {code} So we can use locationUri instead path -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
[ https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321818#comment-16321818 ] Abhishek Soni commented on SPARK-21179: --- Thanks [~mwalton_mstr], I was only overriding "getJDBCType" method prior to this. Overriding the "quoteIdentifier" solved the issue. > Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused > by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > - > > Key: SPARK-21179 > URL: https://issues.apache.org/jira/browse/SPARK-21179 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > HDP version 2.5.0.1-60 > Hive version: 1.2.1 > Spark version 2.0.0.2.5.0.1-60 > JDBC: Download the latest Hortonworks JDBC driver >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in an INT column I get > the following error: > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > Steps to reproduce: > 1) On Hive create a simple table with an INT column and insert some data (I > used SQuirreL Client with the Hortonworks JDBC driver): > create table wh2.hivespark (country_id int, country_name string) > insert into wh2.hivespark values (1, 'USA') > 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files > ./spark-shell --jars > /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar > 4) In Spark shell load the data from Hive using the JDBC driver > val hivespark = spark.read.format("jdbc").options(Map("url" -> > "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable" > -> > "wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load() > 5) In Spark shell try to display the data > hivespark.show() > At this point you should see the error: > scala> hivespark.show() > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int. > at > com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at > com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source) > at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown > Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(Resu
[jira] [Assigned] (SPARK-23038) Update docker/spark-test (JDK/OS)
[ https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23038: Assignee: Apache Spark > Update docker/spark-test (JDK/OS) > - > > Key: SPARK-23038 > URL: https://issues.apache.org/jira/browse/SPARK-23038 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > This issue aims to update the followings in `docker/spark-test`. > - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only. > - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life > of `precise` was April 28, 2017. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23038) Update docker/spark-test (JDK/OS)
[ https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321812#comment-16321812 ] Apache Spark commented on SPARK-23038: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/20230 > Update docker/spark-test (JDK/OS) > - > > Key: SPARK-23038 > URL: https://issues.apache.org/jira/browse/SPARK-23038 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to update the followings in `docker/spark-test`. > - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only. > - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life > of `precise` was April 28, 2017. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23038) Update docker/spark-test (JDK/OS)
[ https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23038: Assignee: (was: Apache Spark) > Update docker/spark-test (JDK/OS) > - > > Key: SPARK-23038 > URL: https://issues.apache.org/jira/browse/SPARK-23038 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to update the followings in `docker/spark-test`. > - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only. > - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life > of `precise` was April 28, 2017. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23038) Update docker/spark-test (JDK/OS)
[ https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23038: -- Summary: Update docker/spark-test (JDK/OS) (was: Update docker/spark-test) > Update docker/spark-test (JDK/OS) > - > > Key: SPARK-23038 > URL: https://issues.apache.org/jira/browse/SPARK-23038 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to update the followings in `docker/spark-test`. > - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only. > - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life > of `precise` was April 28, 2017. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23038) Update docker/spark-test
[ https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23038: -- Description: This issue aims to update the followings in `docker/spark-test`. - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only. - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life of `precise` was April 28, 2017. was:This PR updates JDK version in `docker/spark-test` because we support only JDK8 from Spark 2.2. > Update docker/spark-test > > > Key: SPARK-23038 > URL: https://issues.apache.org/jira/browse/SPARK-23038 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Priority: Minor > > This issue aims to update the followings in `docker/spark-test`. > - JDK7 -> JDK8: Spark 2.2+ supports JDK8 only. > - Ubuntu 12.04.5 LTS(precise) -> Ubuntu 16.04.3 LTS(xeniel): The end of life > of `precise` was April 28, 2017. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23038) Update docker/spark-test
[ https://issues.apache.org/jira/browse/SPARK-23038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23038: -- Summary: Update docker/spark-test (was: Update docker/spark-test to use JDK8) > Update docker/spark-test > > > Key: SPARK-23038 > URL: https://issues.apache.org/jira/browse/SPARK-23038 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.1 >Reporter: Dongjoon Hyun >Priority: Minor > > This PR updates JDK version in `docker/spark-test` because we support only > JDK8 from Spark 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23038) Update docker/spark-test to use JDK8
Dongjoon Hyun created SPARK-23038: - Summary: Update docker/spark-test to use JDK8 Key: SPARK-23038 URL: https://issues.apache.org/jira/browse/SPARK-23038 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.2.1 Reporter: Dongjoon Hyun Priority: Minor This PR updates JDK version in `docker/spark-test` because we support only JDK8 from Spark 2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
[ https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23037: Assignee: Apache Spark > RFormula should not use deprecated OneHotEncoder and should include > VectorSizeHint in pipeline > -- > > Key: SPARK-23037 > URL: https://issues.apache.org/jira/browse/SPARK-23037 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
[ https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23037: Assignee: (was: Apache Spark) > RFormula should not use deprecated OneHotEncoder and should include > VectorSizeHint in pipeline > -- > > Key: SPARK-23037 > URL: https://issues.apache.org/jira/browse/SPARK-23037 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
[ https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321740#comment-16321740 ] Apache Spark commented on SPARK-23037: -- User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/20229 > RFormula should not use deprecated OneHotEncoder and should include > VectorSizeHint in pipeline > -- > > Key: SPARK-23037 > URL: https://issues.apache.org/jira/browse/SPARK-23037 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
Bago Amirbekian created SPARK-23037: --- Summary: RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline Key: SPARK-23037 URL: https://issues.apache.org/jira/browse/SPARK-23037 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Bago Amirbekian -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22921) Merge script should prompt for assigning jiras
[ https://issues.apache.org/jira/browse/SPARK-22921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321669#comment-16321669 ] Saisai Shao commented on SPARK-22921: - Hi [~irashid], looks like the changes will throw an exception when the assignee is not yet a contributor. Please see the stack. {code} Traceback (most recent call last): File "./dev/merge_spark_pr.py", line 501, in main() File "./dev/merge_spark_pr.py", line 487, in main resolve_jira_issues(title, merged_refs, jira_comment) File "./dev/merge_spark_pr.py", line 327, in resolve_jira_issues resolve_jira_issue(merge_branches, comment, jira_id) File "./dev/merge_spark_pr.py", line 245, in resolve_jira_issue cur_assignee = choose_jira_assignee(issue, asf_jira) File "./dev/merge_spark_pr.py", line 317, in choose_jira_assignee asf_jira.assign_issue(issue.key, assignee.key) File "/Library/Python/2.7/site-packages/jira/client.py", line 108, in wrapper result = func(*arg_list, **kwargs) {code} > Merge script should prompt for assigning jiras > -- > > Key: SPARK-22921 > URL: https://issues.apache.org/jira/browse/SPARK-22921 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 2.3.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Trivial > Fix For: 2.3.0 > > > Its a bit of a nuisance to have to go into jira to assign the issue when you > merge a pr. In general you assign to either the original reporter or a > commentor, would be nice if the merge script made that easy to do. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao resolved SPARK-22587. - Resolution: Fixed > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Assignee: Mingjie Tang > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-22587: Fix Version/s: 2.3.0 > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Assignee: Mingjie Tang > Fix For: 2.3.0 > > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22587) Spark job fails if fs.defaultFS and application jar are different url
[ https://issues.apache.org/jira/browse/SPARK-22587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao reassigned SPARK-22587: --- Assignee: Mingjie Tang > Spark job fails if fs.defaultFS and application jar are different url > - > > Key: SPARK-22587 > URL: https://issues.apache.org/jira/browse/SPARK-22587 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.3 >Reporter: Prabhu Joseph >Assignee: Mingjie Tang > > Spark Job fails if the fs.defaultFs and url where application jar resides are > different and having same scheme, > spark-submit --conf spark.master=yarn-cluster wasb://XXX/tmp/test.py > core-site.xml fs.defaultFS is set to wasb:///YYY. Hadoop list works (hadoop > fs -ls) works for both the url XXX and YYY. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: > wasb://XXX/tmp/test.py, expected: wasb://YYY > at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:665) > at > org.apache.hadoop.fs.azure.NativeAzureFileSystem.checkPath(NativeAzureFileSystem.java:1251) > > at org.apache.hadoop.fs.FileSystem.makeQualified(FileSystem.java:485) > at org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:396) > at > org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:507) > > at > org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:660) > at > org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:912) > > at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:172) > at org.apache.spark.deploy.yarn.Client.run(Client.scala:1248) > at org.apache.spark.deploy.yarn.Client$.main(Client.scala:1307) > at org.apache.spark.deploy.yarn.Client.main(Client.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:751) > > at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > {code} > The code Client.copyFileToRemote tries to resolve the path of application jar > (XXX) from the FileSystem object created using fs.defaultFS url (YYY) instead > of the actual url of application jar. > val destFs = destDir.getFileSystem(hadoopConf) > val srcFs = srcPath.getFileSystem(hadoopConf) > getFileSystem will create the filesystem based on the url of the path and so > this is fine. But the below lines of code tries to get the srcPath (XXX url) > from the destFs (YYY url) and so it fails. > var destPath = srcPath > val qualifiedDestPath = destFs.makeQualified(destPath) -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23027) optimizer a simple query using a non-existent data is too slow
[ https://issues.apache.org/jira/browse/SPARK-23027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321645#comment-16321645 ] wangminfeng commented on SPARK-23027: - i read the doc you gived, it looks like useful,i will try spark 2.1 then tell you the perfermance. > optimizer a simple query using a non-existent data is too slow > -- > > Key: SPARK-23027 > URL: https://issues.apache.org/jira/browse/SPARK-23027 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.1 >Reporter: wangminfeng > > When i use spark sql to do ad-hoc query, i have data partitioned by > event_day, event_minute, event_hour, data is large enough,one day data is > about 3T, we saved 3 month data. > But when query use a non-existent day, get optimizedPlan is too slow. > i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get > optimized plan, for five minutes,i can not get it. Query is simple enough, > like: > SELECT > event_day > FROM db.table t1 > WHERE (t1.event_day='20170104' and t1.event_hour='23' and > t1.event_minute='55') > LIMIT 1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage
[ https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23036: Assignee: Apache Spark > Add withGlobalTempView for testing and correct some improper with view > related method usage > --- > > Key: SPARK-23036 > URL: https://issues.apache.org/jira/browse/SPARK-23036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: Apache Spark > > Add withGlobalTempView when create global temp view, like withTempView and > withView. > And correct some improper usage like: > {code:java} > test("list global temp views") { > try { > sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4") > sql("CREATE TEMP VIEW v2 AS SELECT 1, 2") > checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"), > Row(globalTempDB, "v1", true) :: > Row("", "v2", true) :: Nil) > > assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == > Seq("v1", "v2")) > } finally { > spark.catalog.dropTempView("v1") > spark.catalog.dropGlobalTempView("v2") > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage
[ https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23036: Assignee: (was: Apache Spark) > Add withGlobalTempView for testing and correct some improper with view > related method usage > --- > > Key: SPARK-23036 > URL: https://issues.apache.org/jira/browse/SPARK-23036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 > > Add withGlobalTempView when create global temp view, like withTempView and > withView. > And correct some improper usage like: > {code:java} > test("list global temp views") { > try { > sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4") > sql("CREATE TEMP VIEW v2 AS SELECT 1, 2") > checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"), > Row(globalTempDB, "v1", true) :: > Row("", "v2", true) :: Nil) > > assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == > Seq("v1", "v2")) > } finally { > spark.catalog.dropTempView("v1") > spark.catalog.dropGlobalTempView("v2") > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage
[ https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321638#comment-16321638 ] Apache Spark commented on SPARK-23036: -- User 'xubo245' has created a pull request for this issue: https://github.com/apache/spark/pull/20228 > Add withGlobalTempView for testing and correct some improper with view > related method usage > --- > > Key: SPARK-23036 > URL: https://issues.apache.org/jira/browse/SPARK-23036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 > > Add withGlobalTempView when create global temp view, like withTempView and > withView. > And correct some improper usage like: > {code:java} > test("list global temp views") { > try { > sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4") > sql("CREATE TEMP VIEW v2 AS SELECT 1, 2") > checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"), > Row(globalTempDB, "v1", true) :: > Row("", "v2", true) :: Nil) > > assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == > Seq("v1", "v2")) > } finally { > spark.catalog.dropTempView("v1") > spark.catalog.dropGlobalTempView("v2") > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23036) Add withGlobalTempView for testing and correct some improper with view related method usage
[ https://issues.apache.org/jira/browse/SPARK-23036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xubo245 updated SPARK-23036: Summary: Add withGlobalTempView for testing and correct some improper with view related method usage (was: Add withGlobalTempView for testing) > Add withGlobalTempView for testing and correct some improper with view > related method usage > --- > > Key: SPARK-23036 > URL: https://issues.apache.org/jira/browse/SPARK-23036 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 > > Add withGlobalTempView when create global temp view, like withTempView and > withView. > And correct some improper usage like: > {code:java} > test("list global temp views") { > try { > sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4") > sql("CREATE TEMP VIEW v2 AS SELECT 1, 2") > checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"), > Row(globalTempDB, "v1", true) :: > Row("", "v2", true) :: Nil) > > assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == > Seq("v1", "v2")) > } finally { > spark.catalog.dropTempView("v1") > spark.catalog.dropGlobalTempView("v2") > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23036) Add withGlobalTempView for testing
xubo245 created SPARK-23036: --- Summary: Add withGlobalTempView for testing Key: SPARK-23036 URL: https://issues.apache.org/jira/browse/SPARK-23036 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: xubo245 Add withGlobalTempView when create global temp view, like withTempView and withView. And correct some improper usage like: {code:java} test("list global temp views") { try { sql("CREATE GLOBAL TEMP VIEW v1 AS SELECT 3, 4") sql("CREATE TEMP VIEW v2 AS SELECT 1, 2") checkAnswer(sql(s"SHOW TABLES IN $globalTempDB"), Row(globalTempDB, "v1", true) :: Row("", "v2", true) :: Nil) assert(spark.catalog.listTables(globalTempDB).collect().toSeq.map(_.name) == Seq("v1", "v2")) } finally { spark.catalog.dropTempView("v1") spark.catalog.dropGlobalTempView("v2") } } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23024) Spark ui about the contents of the form need to have hidden and show features, when the table records very much.
[ https://issues.apache.org/jira/browse/SPARK-23024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] guoxiaolongzte updated SPARK-23024: --- Attachment: 1.png 2.png > Spark ui about the contents of the form need to have hidden and show > features, when the table records very much. > - > > Key: SPARK-23024 > URL: https://issues.apache.org/jira/browse/SPARK-23024 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 2.3.0 >Reporter: guoxiaolongzte >Priority: Minor > Attachments: 1.png, 2.png > > > Spark ui about the contents of the form need to have hidden and show > features, when the table records very much. Because sometimes you do not care > about the record of the table, you just want to see the contents of the next > table, but you have to scroll the scroll bar for a long time to see the > contents of the next table. > Currently we have about 500 workers, but I just wanted to see the logs for > the running applications table. I had to scroll through the scroll bars for a > long time to see the logs for the running applications table. > In order to ensure functional consistency, I modified the Master Page, Worker > Page, Job Page, Stage Page, Task Page, Configuration Page, Storage Page, Pool > Page. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23035: Assignee: Apache Spark > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > -- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: Apache Spark > > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > {code:java} > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > {code} > other test cases also have this warning > Another problem, it throw TempTableAlreadyExistsException and output > "Temporary table '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321553#comment-16321553 ] Apache Spark commented on SPARK-23035: -- User 'xubo245' has created a pull request for this issue: https://github.com/apache/spark/pull/20227 > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > -- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 > > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > {code:java} > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > {code} > other test cases also have this warning > Another problem, it throw TempTableAlreadyExistsException and output > "Temporary table '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view
[ https://issues.apache.org/jira/browse/SPARK-23035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23035: Assignee: (was: Apache Spark) > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > -- > > Key: SPARK-23035 > URL: https://issues.apache.org/jira/browse/SPARK-23035 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 > > Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use > TempViewAlreadyExistsException when create temp view > There are warning when run test: test("rename temporary view - destination > table with database name") > {code:java} > 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE > TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW > ... USING ... instead > {code} > other test cases also have this warning > Another problem, it throw TempTableAlreadyExistsException and output > "Temporary table '$table' already exists" when we create temp view by using > org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's > improper. > {code:java} > /** >* Creates a global temp view, or issue an exception if the view already > exists and >* `overrideIfExists` is false. >*/ > def create( > name: String, > viewDefinition: LogicalPlan, > overrideIfExists: Boolean): Unit = synchronized { > if (!overrideIfExists && viewDefinitions.contains(name)) { > throw new TempTableAlreadyExistsException(name) > } > viewDefinitions.put(name, viewDefinition) > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23035) Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view
xubo245 created SPARK-23035: --- Summary: Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view Key: SPARK-23035 URL: https://issues.apache.org/jira/browse/SPARK-23035 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.1 Reporter: xubo245 Fix warning: TEMPORARY TABLE ... USING ... is deprecated and use TempViewAlreadyExistsException when create temp view There are warning when run test: test("rename temporary view - destination table with database name") {code:java} 02:11:38.136 WARN org.apache.spark.sql.execution.SparkSqlAstBuilder: CREATE TEMPORARY TABLE ... USING ... is deprecated, please use CREATE TEMPORARY VIEW ... USING ... instead {code} other test cases also have this warning Another problem, it throw TempTableAlreadyExistsException and output "Temporary table '$table' already exists" when we create temp view by using org.apache.spark.sql.catalyst.catalog.GlobalTempViewManager#create, it's improper. {code:java} /** * Creates a global temp view, or issue an exception if the view already exists and * `overrideIfExists` is false. */ def create( name: String, viewDefinition: LogicalPlan, overrideIfExists: Boolean): Unit = synchronized { if (!overrideIfExists && viewDefinitions.contains(name)) { throw new TempTableAlreadyExistsException(name) } viewDefinitions.put(name, viewDefinition) } {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23034) Display tablename for `HiveTableScan` node in UI
[ https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321529#comment-16321529 ] Tejas Patil commented on SPARK-23034: - [~dongjoon] recommended that the scope of this JIRA could be extended to handle scans which are going through codegen. As pointed out in the jira description, for Spark native tables, the table scan node is abstracted out as a `WholeStageCodegen` node in the DAG. I speculate there would be a class of people who might say : A codegen node might be doing more things besides table scan so it is not good to label it as `tablescan-tablename` unlike in case of `HiveTableScan` where we know it doesn't do anything besides table scan. We would like to get feedback from the community about this. > Display tablename for `HiveTableScan` node in UI > > > Key: SPARK-23034 > URL: https://issues.apache.org/jira/browse/SPARK-23034 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.2.1 >Reporter: Tejas Patil >Priority: Trivial > > For queries which scan multiple tables, it will be convenient if the DAG > shown in Spark UI also showed which table is being scanned. This will make > debugging easier. For this JIRA, I am scoping those for hive table scans > only. For Spark native tables, the table scan node is abstracted out as a > `WholeStageCodegen` node (which might be doing more things besides table > scan). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23027) optimizer a simple query using a non-existent data is too slow
[ https://issues.apache.org/jira/browse/SPARK-23027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321487#comment-16321487 ] Takeshi Yamamuro commented on SPARK-23027: -- You tried v2.1? https://databricks.com/blog/2016/12/15/scalable-partition-handling-for-cloud-native-architecture-in-apache-spark-2-1.html Either way, you need to first ask questions in the spark-user mailing list. > optimizer a simple query using a non-existent data is too slow > -- > > Key: SPARK-23027 > URL: https://issues.apache.org/jira/browse/SPARK-23027 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.1 >Reporter: wangminfeng > > When i use spark sql to do ad-hoc query, i have data partitioned by > event_day, event_minute, event_hour, data is large enough,one day data is > about 3T, we saved 3 month data. > But when query use a non-existent day, get optimizedPlan is too slow. > i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get > optimized plan, for five minutes,i can not get it. Query is simple enough, > like: > SELECT > event_day > FROM db.table t1 > WHERE (t1.event_day='20170104' and t1.event_hour='23' and > t1.event_minute='55') > LIMIT 1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23034) Display tablename for `HiveTableScan` node in UI
[ https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23034: Assignee: (was: Apache Spark) > Display tablename for `HiveTableScan` node in UI > > > Key: SPARK-23034 > URL: https://issues.apache.org/jira/browse/SPARK-23034 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.2.1 >Reporter: Tejas Patil >Priority: Trivial > > For queries which scan multiple tables, it will be convenient if the DAG > shown in Spark UI also showed which table is being scanned. This will make > debugging easier. For this JIRA, I am scoping those for hive table scans > only. For Spark native tables, the table scan node is abstracted out as a > `WholeStageCodegen` node (which might be doing more things besides table > scan). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23034) Display tablename for `HiveTableScan` node in UI
[ https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23034: Assignee: Apache Spark > Display tablename for `HiveTableScan` node in UI > > > Key: SPARK-23034 > URL: https://issues.apache.org/jira/browse/SPARK-23034 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.2.1 >Reporter: Tejas Patil >Assignee: Apache Spark >Priority: Trivial > > For queries which scan multiple tables, it will be convenient if the DAG > shown in Spark UI also showed which table is being scanned. This will make > debugging easier. For this JIRA, I am scoping those for hive table scans > only. For Spark native tables, the table scan node is abstracted out as a > `WholeStageCodegen` node (which might be doing more things besides table > scan). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23034) Display tablename for `HiveTableScan` node in UI
[ https://issues.apache.org/jira/browse/SPARK-23034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321419#comment-16321419 ] Apache Spark commented on SPARK-23034: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/20226 > Display tablename for `HiveTableScan` node in UI > > > Key: SPARK-23034 > URL: https://issues.apache.org/jira/browse/SPARK-23034 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 2.2.1 >Reporter: Tejas Patil >Priority: Trivial > > For queries which scan multiple tables, it will be convenient if the DAG > shown in Spark UI also showed which table is being scanned. This will make > debugging easier. For this JIRA, I am scoping those for hive table scans > only. For Spark native tables, the table scan node is abstracted out as a > `WholeStageCodegen` node (which might be doing more things besides table > scan). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23034) Display tablename for `HiveTableScan` node in UI
Tejas Patil created SPARK-23034: --- Summary: Display tablename for `HiveTableScan` node in UI Key: SPARK-23034 URL: https://issues.apache.org/jira/browse/SPARK-23034 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 2.2.1 Reporter: Tejas Patil Priority: Trivial For queries which scan multiple tables, it will be convenient if the DAG shown in Spark UI also showed which table is being scanned. This will make debugging easier. For this JIRA, I am scoping those for hive table scans only. For Spark native tables, the table scan node is abstracted out as a `WholeStageCodegen` node (which might be doing more things besides table scan). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23033) disable task-level retry for continuous execution
[ https://issues.apache.org/jira/browse/SPARK-23033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23033: Assignee: Apache Spark > disable task-level retry for continuous execution > - > > Key: SPARK-23033 > URL: https://issues.apache.org/jira/browse/SPARK-23033 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres >Assignee: Apache Spark > > Continuous execution tasks shouldn't retry independently, but rather use the > global restart mechanism. Retrying will execute from the initial offset at > the time the task was created, which is very bad. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec
[ https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kris Mok updated SPARK-23032: - Description: Proposing to add a per-query ID to the codegen stages as represented by {{WholeStageCodegenExec}} operators. This ID will be used in * the explain output of the physical plan, and in * the generated class name. Specifically, this ID will be stable within a query, counting up from 1 in depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a plan. The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} objects, which may have been created for one-off purposes, e.g. for fallback handling of codegen stages that failed to codegen the whole stage and wishes to codegen a subset of the children operators. Example: for the following query: {code:none} scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 'y).orderBy('x).select('x + 1 as 'z, 'y) df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] scala> val df2 = spark.range(5) df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val query = df1.join(df2, 'z === 'id) query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field] {code} The explain output before the change is: {code:none} scala> query.explain == Physical Plan == *SortMergeJoin [z#9L], [id#13L], Inner :- *Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *Project [(x#3L + 1) AS z#9L, y#4L] :+- *Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *Range (0, 10, step=1, splits=8) +- *Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *Range (0, 5, step=1, splits=8) {code} Note how codegen'd operators are annotated with a prefix {{"*"}}. and after this change it'll be: {code:none} scala> query.explain == Physical Plan == *(6) SortMergeJoin [z#9L], [id#13L], Inner :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L] :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *(1) Range (0, 10, step=1, splits=8) +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *(4) Range (0, 5, step=1, splits=8) {code} Note that the annotated prefix becomes {{"*(id) "}} It'll also show up in the name of the generated class, as a suffix in the format of {code:none} GeneratedClass$GeneratedIterator$id {code} for example, note how {{GeneratedClass$GeneratedIteratorForCodegenStage3}} and {{GeneratedClass$GeneratedIteratorForCodegenStage6}} in the following stack trace corresponds to the IDs shown in the explain output above: {code:none} "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA runnable java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.sort_addToSorter$(generated.java:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(generated.java:41) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.findNextInnerJoinRows$(generated.java:42) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(generated.java:101) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apac
[jira] [Assigned] (SPARK-23033) disable task-level retry for continuous execution
[ https://issues.apache.org/jira/browse/SPARK-23033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23033: Assignee: (was: Apache Spark) > disable task-level retry for continuous execution > - > > Key: SPARK-23033 > URL: https://issues.apache.org/jira/browse/SPARK-23033 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > > Continuous execution tasks shouldn't retry independently, but rather use the > global restart mechanism. Retrying will execute from the initial offset at > the time the task was created, which is very bad. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23033) disable task-level retry for continuous execution
[ https://issues.apache.org/jira/browse/SPARK-23033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321404#comment-16321404 ] Apache Spark commented on SPARK-23033: -- User 'jose-torres' has created a pull request for this issue: https://github.com/apache/spark/pull/20225 > disable task-level retry for continuous execution > - > > Key: SPARK-23033 > URL: https://issues.apache.org/jira/browse/SPARK-23033 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Jose Torres > > Continuous execution tasks shouldn't retry independently, but rather use the > global restart mechanism. Retrying will execute from the initial offset at > the time the task was created, which is very bad. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23033) disable task-level retry for continuous execution
Jose Torres created SPARK-23033: --- Summary: disable task-level retry for continuous execution Key: SPARK-23033 URL: https://issues.apache.org/jira/browse/SPARK-23033 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Jose Torres Continuous execution tasks shouldn't retry independently, but rather use the global restart mechanism. Retrying will execute from the initial offset at the time the task was created, which is very bad. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec
[ https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321398#comment-16321398 ] Apache Spark commented on SPARK-23032: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/20224 > Add a per-query codegenStageId to WholeStageCodegenExec > --- > > Key: SPARK-23032 > URL: https://issues.apache.org/jira/browse/SPARK-23032 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok > > Proposing to add a per-query ID to the codegen stages as represented by > {{WholeStageCodegenExec}} operators. This ID will be used in > * the explain output of the physical plan, and in > * the generated class name. > Specifically, this ID will be stable within a query, counting up from 1 in > depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a > plan. > The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} > objects, which may have been created for one-off purposes, e.g. for fallback > handling of codegen stages that failed to codegen the whole stage and wishes > to codegen a subset of the children operators. > Example: for the following query: > {code:none} > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) > scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as > 'y).orderBy('x).select('x + 1 as 'z, 'y) > df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] > scala> val df2 = spark.range(5) > df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val query = df1.join(df2, 'z === 'id) > query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more > field] > {code} > The explain output before the change is: > {code:none} > scala> query.explain > == Physical Plan == > *SortMergeJoin [z#9L], [id#13L], Inner > :- *Sort [z#9L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(z#9L, 200) > : +- *Project [(x#3L + 1) AS z#9L, y#4L] > :+- *Sort [x#3L ASC NULLS FIRST], true, 0 > : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) > : +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] > : +- *Range (0, 10, step=1, splits=8) > +- *Sort [id#13L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(id#13L, 200) > +- *Range (0, 5, step=1, splits=8) > {code} > Note how codegen'd operators are annotated with a prefix {{"*"}}. > and after this change it'll be: > {code:none} > scala> query.explain > == Physical Plan == > *(6) SortMergeJoin [z#9L], [id#13L], Inner > :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(z#9L, 200) > : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L] > :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0 > : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) > : +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] > : +- *(1) Range (0, 10, step=1, splits=8) > +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(id#13L, 200) > +- *(4) Range (0, 5, step=1, splits=8) > {code} > Note that the annotated prefix becomes {{"*(id) "}} > It'll also show up in the name of the generated class, as a suffix in the > format of > {code:none} > GeneratedClass$GeneratedIterator$id > {code} > for example, note how {{GeneratedClass$GeneratedIterator$3}} and > {{GeneratedClass$GeneratedIterator$6}} in the following stack trace > corresponds to the IDs shown in the explain output above: > {code:none} > "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 > nid=NA runnable > java.lang.Thread.State: RUNNABLE > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:
[jira] [Assigned] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec
[ https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23032: Assignee: (was: Apache Spark) > Add a per-query codegenStageId to WholeStageCodegenExec > --- > > Key: SPARK-23032 > URL: https://issues.apache.org/jira/browse/SPARK-23032 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok > > Proposing to add a per-query ID to the codegen stages as represented by > {{WholeStageCodegenExec}} operators. This ID will be used in > * the explain output of the physical plan, and in > * the generated class name. > Specifically, this ID will be stable within a query, counting up from 1 in > depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a > plan. > The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} > objects, which may have been created for one-off purposes, e.g. for fallback > handling of codegen stages that failed to codegen the whole stage and wishes > to codegen a subset of the children operators. > Example: for the following query: > {code:none} > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) > scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as > 'y).orderBy('x).select('x + 1 as 'z, 'y) > df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] > scala> val df2 = spark.range(5) > df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val query = df1.join(df2, 'z === 'id) > query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more > field] > {code} > The explain output before the change is: > {code:none} > scala> query.explain > == Physical Plan == > *SortMergeJoin [z#9L], [id#13L], Inner > :- *Sort [z#9L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(z#9L, 200) > : +- *Project [(x#3L + 1) AS z#9L, y#4L] > :+- *Sort [x#3L ASC NULLS FIRST], true, 0 > : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) > : +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] > : +- *Range (0, 10, step=1, splits=8) > +- *Sort [id#13L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(id#13L, 200) > +- *Range (0, 5, step=1, splits=8) > {code} > Note how codegen'd operators are annotated with a prefix {{"*"}}. > and after this change it'll be: > {code:none} > scala> query.explain > == Physical Plan == > *(6) SortMergeJoin [z#9L], [id#13L], Inner > :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(z#9L, 200) > : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L] > :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0 > : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) > : +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] > : +- *(1) Range (0, 10, step=1, splits=8) > +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(id#13L, 200) > +- *(4) Range (0, 5, step=1, splits=8) > {code} > Note that the annotated prefix becomes {{"*(id) "}} > It'll also show up in the name of the generated class, as a suffix in the > format of > {code:none} > GeneratedClass$GeneratedIterator$id > {code} > for example, note how {{GeneratedClass$GeneratedIterator$3}} and > {{GeneratedClass$GeneratedIterator$6}} in the following stack trace > corresponds to the IDs shown in the explain output above: > {code:none} > "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 > nid=NA runnable > java.lang.Thread.State: RUNNABLE > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) > at > or
[jira] [Assigned] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec
[ https://issues.apache.org/jira/browse/SPARK-23032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23032: Assignee: Apache Spark > Add a per-query codegenStageId to WholeStageCodegenExec > --- > > Key: SPARK-23032 > URL: https://issues.apache.org/jira/browse/SPARK-23032 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Kris Mok >Assignee: Apache Spark > > Proposing to add a per-query ID to the codegen stages as represented by > {{WholeStageCodegenExec}} operators. This ID will be used in > * the explain output of the physical plan, and in > * the generated class name. > Specifically, this ID will be stable within a query, counting up from 1 in > depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a > plan. > The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} > objects, which may have been created for one-off purposes, e.g. for fallback > handling of codegen stages that failed to codegen the whole stage and wishes > to codegen a subset of the children operators. > Example: for the following query: > {code:none} > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) > scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as > 'y).orderBy('x).select('x + 1 as 'z, 'y) > df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] > scala> val df2 = spark.range(5) > df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val query = df1.join(df2, 'z === 'id) > query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more > field] > {code} > The explain output before the change is: > {code:none} > scala> query.explain > == Physical Plan == > *SortMergeJoin [z#9L], [id#13L], Inner > :- *Sort [z#9L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(z#9L, 200) > : +- *Project [(x#3L + 1) AS z#9L, y#4L] > :+- *Sort [x#3L ASC NULLS FIRST], true, 0 > : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) > : +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] > : +- *Range (0, 10, step=1, splits=8) > +- *Sort [id#13L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(id#13L, 200) > +- *Range (0, 5, step=1, splits=8) > {code} > Note how codegen'd operators are annotated with a prefix {{"*"}}. > and after this change it'll be: > {code:none} > scala> query.explain > == Physical Plan == > *(6) SortMergeJoin [z#9L], [id#13L], Inner > :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(z#9L, 200) > : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L] > :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0 > : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) > : +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] > : +- *(1) Range (0, 10, step=1, splits=8) > +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(id#13L, 200) > +- *(4) Range (0, 5, step=1, splits=8) > {code} > Note that the annotated prefix becomes {{"*(id) "}} > It'll also show up in the name of the generated class, as a suffix in the > format of > {code:none} > GeneratedClass$GeneratedIterator$id > {code} > for example, note how {{GeneratedClass$GeneratedIterator$3}} and > {{GeneratedClass$GeneratedIterator$6}} in the following stack trace > corresponds to the IDs shown in the explain output above: > {code:none} > "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 > nid=NA runnable > java.lang.Thread.State: RUNNABLE > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scal
[jira] [Created] (SPARK-23032) Add a per-query codegenStageId to WholeStageCodegenExec
Kris Mok created SPARK-23032: Summary: Add a per-query codegenStageId to WholeStageCodegenExec Key: SPARK-23032 URL: https://issues.apache.org/jira/browse/SPARK-23032 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Kris Mok Proposing to add a per-query ID to the codegen stages as represented by {{WholeStageCodegenExec}} operators. This ID will be used in * the explain output of the physical plan, and in * the generated class name. Specifically, this ID will be stable within a query, counting up from 1 in depth-first post-order for all the {{WholeStageCodegenExec}} inserted into a plan. The ID value 0 is reserved for "free-floating" {{WholeStageCodegenExec}} objects, which may have been created for one-off purposes, e.g. for fallback handling of codegen stages that failed to codegen the whole stage and wishes to codegen a subset of the children operators. Example: for the following query: {code:none} scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) scala> val df1 = spark.range(10).select('id as 'x, 'id + 1 as 'y).orderBy('x).select('x + 1 as 'z, 'y) df1: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint] scala> val df2 = spark.range(5) df2: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val query = df1.join(df2, 'z === 'id) query: org.apache.spark.sql.DataFrame = [z: bigint, y: bigint ... 1 more field] {code} The explain output before the change is: {code:none} scala> query.explain == Physical Plan == *SortMergeJoin [z#9L], [id#13L], Inner :- *Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *Project [(x#3L + 1) AS z#9L, y#4L] :+- *Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *Range (0, 10, step=1, splits=8) +- *Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *Range (0, 5, step=1, splits=8) {code} Note how codegen'd operators are annotated with a prefix {{"*"}}. and after this change it'll be: {code:none} scala> query.explain == Physical Plan == *(6) SortMergeJoin [z#9L], [id#13L], Inner :- *(3) Sort [z#9L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(z#9L, 200) : +- *(2) Project [(x#3L + 1) AS z#9L, y#4L] :+- *(2) Sort [x#3L ASC NULLS FIRST], true, 0 : +- Exchange rangepartitioning(x#3L ASC NULLS FIRST, 200) : +- *(1) Project [id#0L AS x#3L, (id#0L + 1) AS y#4L] : +- *(1) Range (0, 10, step=1, splits=8) +- *(5) Sort [id#13L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#13L, 200) +- *(4) Range (0, 5, step=1, splits=8) {code} Note that the annotated prefix becomes {{"*(id) "}} It'll also show up in the name of the generated class, as a suffix in the format of {code:none} GeneratedClass$GeneratedIterator$id {code} for example, note how {{GeneratedClass$GeneratedIterator$3}} and {{GeneratedClass$GeneratedIterator$6}} in the following stack trace corresponds to the IDs shown in the explain output above: {code:none} "Executor task launch worker for task 424@12957" daemon prio=5 tid=0x58 nid=NA runnable java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.sort_addToSorter$(generated.java:32) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$3.processNext(generated.java:41) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$9$$anon$1.hasNext(WholeStageCodegenExec.scala:494) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.findNextInnerJoinRows$(generated.java:42) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator$6.processNext(generated.java:101) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$2.hasNext(WholeStageCodegenExec.scala:513) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:828) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.sc
[jira] [Resolved] (SPARK-22989) sparkstreaming ui show 0 records when spark-streaming-kafka application restore from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-22989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-22989. -- Resolution: Duplicate > sparkstreaming ui show 0 records when spark-streaming-kafka application > restore from checkpoint > > > Key: SPARK-22989 > URL: https://issues.apache.org/jira/browse/SPARK-22989 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.0 >Reporter: zhaoshijie >Priority: Minor > > when a spark-streaming-kafka application restore from checkpoint , I find > spark-streaming ui Each batch records is 0. > !https://raw.githubusercontent.com/smdfj/picture/master/spark/batch.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22991) High read latency with spark streaming 2.2.1 and kafka 0.10.0.1
[ https://issues.apache.org/jira/browse/SPARK-22991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-22991: - Component/s: (was: Structured Streaming) (was: Spark Core) DStreams > High read latency with spark streaming 2.2.1 and kafka 0.10.0.1 > --- > > Key: SPARK-22991 > URL: https://issues.apache.org/jira/browse/SPARK-22991 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.2.1 >Reporter: Kiran Shivappa Japannavar >Priority: Critical > > Spark 2.2.1 + Kafka 0.10 + Spark streaming. > Batch duration is 1s, Max rate per partition is 500, poll interval is 120 > seconds, max poll records is 500 and no of partitions in Kafka is 500, > enabled cache consumer. > While trying to read data from Kafka we are observing very high read > latencies intermittently.The high latencies results in Kafka consumer session > expiration and hence the Kafka brokers removes the consumer from the group. > The consumer keeps retrying and finally fails with the > [org.apache.kafka.clients.NetworkClient] - Disconnecting from node 12 due to > request timeout > [org.apache.kafka.clients.NetworkClient] - Cancelled request ClientRequest > [org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient] - > Cancelled FETCH request ClientRequest.** > Due to this a lot of batches are in the queued state. > The high read latencies are occurring whenever multiple clients are > parallelly trying to read the data from the same Kafka cluster. The Kafka > cluster is having a large number of brokers and can support high network > bandwidth. > When running with spark 1.5 and Kafka 0.8 consumer client against the same > Kafka cluster we are not seeing any read latencies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22975) MetricsReporter producing NullPointerException when there was no progress reported
[ https://issues.apache.org/jira/browse/SPARK-22975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-22975: - Component/s: (was: SQL) Structured Streaming > MetricsReporter producing NullPointerException when there was no progress > reported > -- > > Key: SPARK-22975 > URL: https://issues.apache.org/jira/browse/SPARK-22975 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Yuriy Bondaruk > > The exception occurs in MetricsReporter when it tries to register gauges > using lastProgress of each stream. > {code:java} > registerGauge("inputRate-total", () => > stream.lastProgress.inputRowsPerSecond) > registerGauge("processingRate-total", () => > stream.lastProgress.inputRowsPerSecond) > registerGauge("latency", () => > stream.lastProgress.durationMs.get("triggerExecution").longValue()) > {code} > In case if a stream doesn't have any progress reported than following > exception occurs: > {noformat} > 18/01/05 17:45:57 ERROR ScheduledReporter: RuntimeException thrown from > CloudwatchReporter#report. Exception was suppressed. > java.lang.NullPointerException > at > org.apache.spark.sql.execution.streaming.MetricsReporter$$anonfun$1.apply$mcD$sp(MetricsReporter.scala:42) > at > org.apache.spark.sql.execution.streaming.MetricsReporter$$anonfun$1.apply(MetricsReporter.scala:42) > at > org.apache.spark.sql.execution.streaming.MetricsReporter$$anonfun$1.apply(MetricsReporter.scala:42) > at > org.apache.spark.sql.execution.streaming.MetricsReporter$$anon$1.getValue(MetricsReporter.scala:49) > at > amazon.nexus.spark.metrics.cloudwatch.CloudwatchReporter.lambda$createNumericGaugeMetricDatumStream$0(CloudwatchReporter.java:146) > at > java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:174) > at > java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet.lambda$entryConsumer$0(Collections.java:1575) > at > java.util.TreeMap$EntrySpliterator.forEachRemaining(TreeMap.java:2969) > at > java.util.Collections$UnmodifiableMap$UnmodifiableEntrySet$UnmodifiableEntrySetSpliterator.forEachRemaining(Collections.java:1600) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.StreamSpliterators$WrappingSpliterator.forEachRemaining(StreamSpliterators.java:312) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at > java.util.stream.Streams$ConcatSpliterator.forEachRemaining(Streams.java:742) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:510) > at > amazon.nexus.spark.metrics.cloudwatch.CloudwatchReporter.partitionIntoSublists(CloudwatchReporter.java:390) > at > amazon.nexus.spark.metrics.cloudwatch.CloudwatchReporter.report(CloudwatchReporter.java:137) > at > com.codahale.metrics.ScheduledReporter.report(ScheduledReporter.java:162) > at > com.codahale.metrics.ScheduledReporter$1.run(ScheduledReporter.java:117) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For ad
[jira] [Resolved] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-22951. Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 20174 [https://github.com/apache/spark/pull/20174] > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis >Assignee: Feng Liu > Labels: correctness > Fix For: 2.3.0 > > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321234#comment-16321234 ] Fernando Pereira edited comment on SPARK-17998 at 1/10/18 10:20 PM: It says spark.sql.files.maxPartitionBytes in this very Jira ticket. Try yourself. spark.files.maxPartitionBytes doesnt work, at least not for 2.2.1 was (Author: ferdonline): It says spark.sql.files.maxPartitionBytes in this very Jira ticket. Try yourself. spark.files.maxPartitionBytes doesnt work > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321234#comment-16321234 ] Fernando Pereira commented on SPARK-17998: -- It says spark.sql.files.maxPartitionBytes in this very Jira ticket. Try yourself. spark.files.maxPartitionBytes doesnt work > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23031) Merge script should allow arbitrary assignees
Marcelo Vanzin created SPARK-23031: -- Summary: Merge script should allow arbitrary assignees Key: SPARK-23031 URL: https://issues.apache.org/jira/browse/SPARK-23031 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 2.3.0 Reporter: Marcelo Vanzin Priority: Minor SPARK-22921 added a prompt in the merge script to ask for the assignee of the bug being resolved. But it only provides you with a list containing the reporter and commenters (plus the "Apache Spark" bot). Sometimes the author of the PR is not on that list, and right now there doesn't seem to be a way to enter an arbitrary JIRA user name there. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23020: Assignee: Apache Spark > Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > -- > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Apache Spark >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16321087#comment-16321087 ] Apache Spark commented on SPARK-23020: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/20223 > Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > -- > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23020: Assignee: (was: Apache Spark) > Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > -- > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow
[ https://issues.apache.org/jira/browse/SPARK-23030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320942#comment-16320942 ] Bryan Cutler commented on SPARK-23030: -- I'm looking into this, will submit a WIP PR if I see an improvement > Decrease memory consumption with toPandas() collection using Arrow > -- > > Key: SPARK-23030 > URL: https://issues.apache.org/jira/browse/SPARK-23030 > Project: Spark > Issue Type: Sub-task > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > Currently with Arrow enabled, calling {{toPandas()}} results in a collection > of all partitions in the JVM in the form of batches of Arrow file format. > Once collected in the JVM, they are served to the Python driver process. > I believe using the Arrow stream format can help to optimize this and reduce > memory consumption in the JVM by only loading one record batch at a time > before sending it to Python. This might also reduce the latency between > making the initial call in Python and receiving the first batch of records. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23030) Decrease memory consumption with toPandas() collection using Arrow
Bryan Cutler created SPARK-23030: Summary: Decrease memory consumption with toPandas() collection using Arrow Key: SPARK-23030 URL: https://issues.apache.org/jira/browse/SPARK-23030 Project: Spark Issue Type: Sub-task Components: PySpark, SQL Affects Versions: 2.3.0 Reporter: Bryan Cutler Currently with Arrow enabled, calling {{toPandas()}} results in a collection of all partitions in the JVM in the form of batches of Arrow file format. Once collected in the JVM, they are served to the Python driver process. I believe using the Arrow stream format can help to optimize this and reduce memory consumption in the JVM by only loading one record batch at a time before sending it to Python. This might also reduce the latency between making the initial call in Python and receiving the first batch of records. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21187) Complete support for remaining Spark data types in Arrow Converters
[ https://issues.apache.org/jira/browse/SPARK-21187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-21187: - Description: This is to track adding the remaining type support in Arrow Converters. Currently, only primitive data types are supported. ' Remaining types: * -*Date*- * -*Timestamp*- * *Complex*: Struct, -Array-, Map * -*Decimal*- Some things to do before closing this out: * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write values as BigDecimal)- * Need to add some user docs * -Make sure Python tests are thorough- * Check into complex type support mentioned in comments by [~leif], should we support mulit-indexing? was: This is to track adding the remaining type support in Arrow Converters. Currently, only primitive data types are supported. ' Remaining types: * -*Date*- * -*Timestamp*- * *Complex*: Struct, Array, Map * *Decimal* Some things to do before closing this out: * Look to upgrading to Arrow 0.7 for better Decimal support (can now write values as BigDecimal) * Need to add some user docs * Make sure Python tests are thorough * Check into complex type support mentioned in comments by [~leif], should we support mulit-indexing? > Complete support for remaining Spark data types in Arrow Converters > --- > > Key: SPARK-21187 > URL: https://issues.apache.org/jira/browse/SPARK-21187 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 2.3.0 >Reporter: Bryan Cutler > > This is to track adding the remaining type support in Arrow Converters. > Currently, only primitive data types are supported. ' > Remaining types: > * -*Date*- > * -*Timestamp*- > * *Complex*: Struct, -Array-, Map > * -*Decimal*- > Some things to do before closing this out: > * -Look to upgrading to Arrow 0.7 for better Decimal support (can now write > values as BigDecimal)- > * Need to add some user docs > * -Make sure Python tests are thorough- > * Check into complex type support mentioned in comments by [~leif], should we > support mulit-indexing? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16060) Vectorized Orc reader
[ https://issues.apache.org/jira/browse/SPARK-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-16060: -- Affects Version/s: 1.6.3 2.0.2 2.1.2 2.2.1 > Vectorized Orc reader > - > > Key: SPARK-16060 > URL: https://issues.apache.org/jira/browse/SPARK-16060 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1 >Reporter: Liang-Chi Hsieh >Assignee: Dongjoon Hyun > Labels: release-notes > Fix For: 2.3.0 > > > Currently Orc reader in Spark SQL doesn't support vectorized reading. As Hive > Orc already support vectorization, we should add this support to improve Orc > reading performance. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian reassigned SPARK-22951: -- Assignee: Feng Liu > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis >Assignee: Feng Liu > Labels: correctness > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23029) Setting spark.shuffle.file.buffer will make the shuffle fail
Fernando Pereira created SPARK-23029: Summary: Setting spark.shuffle.file.buffer will make the shuffle fail Key: SPARK-23029 URL: https://issues.apache.org/jira/browse/SPARK-23029 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.2.1 Reporter: Fernando Pereira When setting the spark.shuffle.file.buffer setting, even to its default value, shuffles fail. This appears to affect small to medium size partitions. Strangely the error message is OutOfMemoryError, but it works with large partitions (at least >32MB). {code} pyspark --conf "spark.shuffle.file.buffer=$((32*1024))" /gpfs/bbp.cscs.ch/scratch/gss/spykfunc/_sparkenv/lib/python2.7/site-packages/pyspark/bin/spark-submit pyspark-shell-main --name PySparkShell --conf spark.shuffle.file.buffer=32768 version 2.2.1 >>> spark.range(1e7, numPartitions=10).sort("id").write.parquet("a", >>> mode="overwrite") [Stage 1:>(0 + 10) / 10]18/01/10 19:34:21 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 11) java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.(BufferedOutputStream.java:75) at org.apache.spark.storage.DiskBlockObjectWriter$ManualCloseBufferedOutputStream$1.(DiskBlockObjectWriter.scala:107) at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:108) at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:237) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-22951: --- Target Version/s: 2.3.0 > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis > Labels: correctness > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22951) count() after dropDuplicates() on emptyDataFrame returns incorrect value
[ https://issues.apache.org/jira/browse/SPARK-22951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-22951: --- Labels: correctness (was: ) > count() after dropDuplicates() on emptyDataFrame returns incorrect value > > > Key: SPARK-22951 > URL: https://issues.apache.org/jira/browse/SPARK-22951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.2.0, 2.3.0 >Reporter: Michael Dreibelbis > Labels: correctness > > here is a minimal Spark Application to reproduce: > {code} > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkConf, SparkContext} > object DropDupesApp extends App { > > override def main(args: Array[String]): Unit = { > val conf = new SparkConf() > .setAppName("test") > .setMaster("local") > val sc = new SparkContext(conf) > val sql = SQLContext.getOrCreate(sc) > assert(sql.emptyDataFrame.count == 0) // expected > assert(sql.emptyDataFrame.dropDuplicates.count == 1) // unexpected > } > > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23020) Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-23020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320753#comment-16320753 ] Marcelo Vanzin commented on SPARK-23020: I think I found the race in the code, now need to figure out how to fix it... :-/ > Flaky Test: org.apache.spark.launcher.SparkLauncherSuite.testInProcessLauncher > -- > > Key: SPARK-23020 > URL: https://issues.apache.org/jira/browse/SPARK-23020 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Priority: Blocker > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.7/42/testReport/junit/org.apache.spark.launcher/SparkLauncherSuite/testInProcessLauncher/history/ -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23019. Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 2.3.0 > Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD > - > > Key: SPARK-23019 > URL: https://issues.apache.org/jira/browse/SPARK-23019 > Project: Spark > Issue Type: Bug > Components: Java API, Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Gengliang Wang >Priority: Blocker > Fix For: 2.3.0 > > > {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to > multiple spark contexts: > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/ > {code} > Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore > this error, set spark.driver.allowMultipleContexts = true. The currently > running SparkContext was created at: > org.apache.spark.SparkContext.(SparkContext.scala:116) > org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63) > java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320667#comment-16320667 ] Sean Owen commented on SPARK-22982: --- Agreed. java.nio should be OK as that was introduced in Java 7. Spark 2.2 requires Java 8, so that much is safe. > Remove unsafe asynchronous close() call from FileDownloadChannel > > > Key: SPARK-22982 > URL: https://issues.apache.org/jira/browse/SPARK-22982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > Fix For: 2.3.0 > > > Spark's Netty-based file transfer code contains an asynchronous IO bug which > may lead to incorrect query results. > At a high-level, the problem is that an unsafe asynchronous `close()` of a > pipe's source channel creates a race condition where file transfer code > closes a file descriptor then attempts to read from it. If the closed file > descriptor's number has been reused by an `open()` call then this invalid > read may cause unrelated file operations to return incorrect results due to > reading different data than intended. > I have a small, surgical fix for this bug and will submit a PR with more > description on the specific race condition / underlying bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320664#comment-16320664 ] Josh Rosen commented on SPARK-22982: In theory this affects all 1.6.0+ versions. It's going to be much harder to trigger in 1.6.0 because we shouldn't have as many Janino-induced remote ClassNotFoundExceptions to trigger the error path. We should probably port to other 2.x branches, with one caveat: we need to make sure that my fix isn't relying on JRE 8.x functionality because I think at least some of those older branches still have support for Java 7. > Remove unsafe asynchronous close() call from FileDownloadChannel > > > Key: SPARK-22982 > URL: https://issues.apache.org/jira/browse/SPARK-22982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > Fix For: 2.3.0 > > > Spark's Netty-based file transfer code contains an asynchronous IO bug which > may lead to incorrect query results. > At a high-level, the problem is that an unsafe asynchronous `close()` of a > pipe's source channel creates a race condition where file transfer code > closes a file descriptor then attempts to read from it. If the closed file > descriptor's number has been reused by an `open()` call then this invalid > read may cause unrelated file operations to return incorrect results due to > reading different data than intended. > I have a small, surgical fix for this bug and will submit a PR with more > description on the specific race condition / underlying bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22972) Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.hive.orc.
[ https://issues.apache.org/jira/browse/SPARK-22972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-22972: Fix Version/s: 2.2.2 > Couldn't find corresponding Hive SerDe for data source provider > org.apache.spark.sql.hive.orc. > -- > > Key: SPARK-22972 > URL: https://issues.apache.org/jira/browse/SPARK-22972 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: xubo245 >Assignee: xubo245 > Fix For: 2.2.2, 2.3.0 > > > *There is error when running test code:* > {code:java} > test("create orc table") { > spark.sql( > s"""CREATE TABLE normal_orc_as_source_hive > |USING org.apache.spark.sql.hive.orc > |OPTIONS ( > | PATH '${new File(orcTableAsDir.getAbsolutePath).toURI}' > |) >""".stripMargin) > val df = spark.sql("select * from normal_orc_as_source_hive") > spark.sql("desc formatted normal_orc_as_source_hive").show() > } > {code} > *warning:* > {code:java} > 05:00:44.038 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: > Couldn't find corresponding Hive SerDe for data source provider > org.apache.spark.sql.hive.orc. Persisting data source table > `default`.`normal_orc_as_source_hive` into Hive metastore in Spark SQL > specific format, which is NOT compatible with Hive. > {code} > Root cause analysis: > ORC related code is incorrect in HiveSerDe : > {code:java} > org.apache.spark.sql.internal.HiveSerDe#sourceToSerDe > {code} > {code:java} > def sourceToSerDe(source: String): Option[HiveSerDe] = { > val key = source.toLowerCase(Locale.ROOT) match { > case s if s.startsWith("org.apache.spark.sql.parquet") => "parquet" > case s if s.startsWith("org.apache.spark.sql.orc") => "orc" > case s if s.equals("orcfile") => "orc" > case s if s.equals("parquetfile") => "parquet" > case s if s.equals("avrofile") => "avro" > case s => s > } > {code} > Solution: > change "org.apache.spark.sql.orc“ to "org.apache.spark.sql.hive.orc" -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error
[ https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320439#comment-16320439 ] Neil Alexander McQuarrie commented on SPARK-21727: -- Okay great. Implemented and tests are passing. I'll submit a PR shortly. Thanks for the help. > Operating on an ArrayType in a SparkR DataFrame throws error > > > Key: SPARK-21727 > URL: https://issues.apache.org/jira/browse/SPARK-21727 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Neil Alexander McQuarrie >Assignee: Neil Alexander McQuarrie > > Previously > [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements] > this as a stack overflow question but it seems to be a bug. > If I have an R data.frame where one of the column data types is an integer > *list* -- i.e., each of the elements in the column embeds an entire R list of > integers -- then it seems I can convert this data.frame to a SparkR DataFrame > just fine... SparkR treats the column as ArrayType(Double). > However, any subsequent operation on this SparkR DataFrame appears to throw > an error. > Create an example R data.frame: > {code} > indices <- 1:4 > myDf <- data.frame(indices) > myDf$data <- list(rep(0, 20))}} > {code} > Examine it to make sure it looks okay: > {code} > > str(myDf) > 'data.frame': 4 obs. of 2 variables: > $ indices: int 1 2 3 4 > $ data :List of 4 >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... >..$ : num 0 0 0 0 0 0 0 0 0 0 ... > > head(myDf) > indices data > 1 1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 2 2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 3 3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > 4 4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 > {code} > Convert it to a SparkR DataFrame: > {code} > library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib")) > sparkR.session(master = "local[*]") > mySparkDf <- as.DataFrame(myDf) > {code} > Examine the SparkR DataFrame schema; notice that the list column was > successfully converted to ArrayType: > {code} > > schema(mySparkDf) > StructType > |-name = "indices", type = "IntegerType", nullable = TRUE > |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE > {code} > However, operating on the SparkR DataFrame throws an error: > {code} > > collect(mySparkDf) > 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 > (TID 1) > java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: > java.lang.Double is not a valid external type for schema of array > if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null > else validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0 > ... long stack trace ... > {code} > Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-22946) Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-22946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-22946: Comment: was deleted (was: I am unable to reproduce on master. If I remember correctly, this should have been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not doing it myself since I don't remember the ticket it duplicates...) > Recursive withColumn calls cause > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > -- > > Key: SPARK-22946 > URL: https://issues.apache.org/jira/browse/SPARK-22946 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Harpreet Chopra > > Recursive calls to withColumn, for updating the same column causes > _org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > _ > This can be reproduced in Spark 1.x and also in the latest version we are > using (Spark 2.2.0) > code to reproduce: > import org.apache.spark.sql.functions._ > var df > =sc.parallelize(Seq(("123","CustOne"),("456","CustTwo"))).toDF("ID","CustName") > (1 to 20).foreach(x => { > df = df.withColumn("ID", when(col("ID") === "123", > lit("678")).otherwise(col("ID"))) > println(" "+x) > df.show > }) > Stack dump: > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$cata > lyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato > r.scala:575) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato > r.scala:572) > at > org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java > :3599) > at > org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 28 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/cataly > st/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > " of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows > beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at > org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9912) > at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9897) > at > org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3317) > at org.codehaus.janino.UnitCompiler.access$8500(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitLocalVariableAccess(UnitCompiler.java:3285) > at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:3189) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3313) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitAmbiguousName(UnitCompiler.java:3280) > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669) > at org.codehaus.janino.UnitCompiler.access$4500(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$7.visitAssignment(UnitCompiler.java:2619) > at org.codehaus.janino.Java$Assignment.accept(Java.java:3405) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643) > at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1007) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:993) > at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:185) > at > org.co
[jira] [Comment Edited] (SPARK-22946) Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-22946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320402#comment-16320402 ] Marco Gaido edited comment on SPARK-22946 at 1/10/18 3:13 PM: -- I am unable to reproduce on master. If I remember correctly, this should have been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not doing it myself since I don't remember the ticket it duplicates... was (Author: mgaido): I am unable to reproduce on master. If I remember correctly, this should have been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not doing it myself since I don't remember the ticket it duplicates... > Recursive withColumn calls cause > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > -- > > Key: SPARK-22946 > URL: https://issues.apache.org/jira/browse/SPARK-22946 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Harpreet Chopra > > Recursive calls to withColumn, for updating the same column causes > _org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > _ > This can be reproduced in Spark 1.x and also in the latest version we are > using (Spark 2.2.0) > code to reproduce: > import org.apache.spark.sql.functions._ > var df > =sc.parallelize(Seq(("123","CustOne"),("456","CustTwo"))).toDF("ID","CustName") > (1 to 20).foreach(x => { > df = df.withColumn("ID", when(col("ID") === "123", > lit("678")).otherwise(col("ID"))) > println(" "+x) > df.show > }) > Stack dump: > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$cata > lyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato > r.scala:575) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato > r.scala:572) > at > org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java > :3599) > at > org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 28 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/cataly > st/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > " of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows > beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at > org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9912) > at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9897) > at > org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3317) > at org.codehaus.janino.UnitCompiler.access$8500(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitLocalVariableAccess(UnitCompiler.java:3285) > at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:3189) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3313) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitAmbiguousName(UnitCompiler.java:3280) > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669) > at org.codehaus.janino.UnitCompiler.access$4500(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$7.visitAssignment(UnitCompiler.java:2619) > at org.codehaus.janino.Java$Assignment.accept(Java.java:3405) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643) > at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097) > at org.codehaus.janin
[jira] [Commented] (SPARK-22946) Recursive withColumn calls cause org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-22946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320402#comment-16320402 ] Marco Gaido commented on SPARK-22946: - I am unable to reproduce on master. If I remember correctly, this should have been fixed by you, [~cloud_fan]. May you close this JIRA if so? I am not doing it myself since I don't remember the ticket it duplicates... > Recursive withColumn calls cause > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > -- > > Key: SPARK-22946 > URL: https://issues.apache.org/jira/browse/SPARK-22946 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Harpreet Chopra > > Recursive calls to withColumn, for updating the same column causes > _org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows beyond 64 KB > _ > This can be reproduced in Spark 1.x and also in the latest version we are > using (Spark 2.2.0) > code to reproduce: > import org.apache.spark.sql.functions._ > var df > =sc.parallelize(Seq(("123","CustOne"),("456","CustTwo"))).toDF("ID","CustName") > (1 to 20).foreach(x => { > df = df.withColumn("ID", when(col("ID") === "123", > lit("678")).otherwise(col("ID"))) > println(" "+x) > df.show > }) > Stack dump: > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.org$apache$spark$sql$cata > lyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:555) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato > r.scala:575) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerato > r.scala:572) > at > org.spark-project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java > :3599) > at > org.spark-project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) > ... 28 more > Caused by: org.codehaus.janino.JaninoRuntimeException: Code of method > "(Lorg/apache/spark/sql/cataly > st/expressions/GeneratedClass$SpecificUnsafeProjection;Lorg/apache/spark/sql/catalyst/InternalRow;)V > " of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" > grows > beyond 64 KB > at org.codehaus.janino.CodeContext.makeSpace(CodeContext.java:941) > at org.codehaus.janino.CodeContext.write(CodeContext.java:854) > at > org.codehaus.janino.UnitCompiler.writeShort(UnitCompiler.java:10242) > at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9912) > at org.codehaus.janino.UnitCompiler.load(UnitCompiler.java:9897) > at > org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3317) > at org.codehaus.janino.UnitCompiler.access$8500(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitLocalVariableAccess(UnitCompiler.java:3285) > at org.codehaus.janino.Java$LocalVariableAccess.accept(Java.java:3189) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:3313) > at org.codehaus.janino.UnitCompiler.access$8000(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$10.visitAmbiguousName(UnitCompiler.java:3280) > at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3138) > at org.codehaus.janino.UnitCompiler.compileGet(UnitCompiler.java:3290) > at > org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4368) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2669) > at org.codehaus.janino.UnitCompiler.access$4500(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$7.visitAssignment(UnitCompiler.java:2619) > at org.codehaus.janino.Java$Assignment.accept(Java.java:3405) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2654) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1643) > at org.codehaus.janino.UnitCompiler.access$1100(UnitCompiler.java:185) > at > org.codehaus.janino.UnitCompiler$4.visitExpressionStatement(UnitCompiler.java:936) > at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2097) > at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:958) > at > org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1007) > at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:993) > at org.codehaus.janino.UnitCompiler.access$1000(UnitCompiler.java:18
[jira] [Commented] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320393#comment-16320393 ] Apache Spark commented on SPARK-23028: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20222 > Bump master branch version to 2.4.0-SNAPSHOT > > > Key: SPARK-23028 > URL: https://issues.apache.org/jira/browse/SPARK-23028 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23028: Assignee: Apache Spark (was: Xiao Li) > Bump master branch version to 2.4.0-SNAPSHOT > > > Key: SPARK-23028 > URL: https://issues.apache.org/jira/browse/SPARK-23028 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23028: Assignee: Xiao Li (was: Apache Spark) > Bump master branch version to 2.4.0-SNAPSHOT > > > Key: SPARK-23028 > URL: https://issues.apache.org/jira/browse/SPARK-23028 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT
[ https://issues.apache.org/jira/browse/SPARK-23028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-23028: Component/s: (was: SQL) Build > Bump master branch version to 2.4.0-SNAPSHOT > > > Key: SPARK-23028 > URL: https://issues.apache.org/jira/browse/SPARK-23028 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Xiao Li > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23028) Bump master branch version to 2.4.0-SNAPSHOT
Xiao Li created SPARK-23028: --- Summary: Bump master branch version to 2.4.0-SNAPSHOT Key: SPARK-23028 URL: https://issues.apache.org/jira/browse/SPARK-23028 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Xiao Li Assignee: Xiao Li -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22991) High read latency with spark streaming 2.2.1 and kafka 0.10.0.1
[ https://issues.apache.org/jira/browse/SPARK-22991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320365#comment-16320365 ] Sean Owen commented on SPARK-22991: --- But that would also mean you were using Kafka 0.8 and not 0.10, and a lot changed in Kafka there. So far this still says it's a network or Kafka issue. Or, what change are you proposing or where do you think the issue is in Spark? > High read latency with spark streaming 2.2.1 and kafka 0.10.0.1 > --- > > Key: SPARK-22991 > URL: https://issues.apache.org/jira/browse/SPARK-22991 > Project: Spark > Issue Type: Bug > Components: Spark Core, Structured Streaming >Affects Versions: 2.2.1 >Reporter: Kiran Shivappa Japannavar >Priority: Critical > > Spark 2.2.1 + Kafka 0.10 + Spark streaming. > Batch duration is 1s, Max rate per partition is 500, poll interval is 120 > seconds, max poll records is 500 and no of partitions in Kafka is 500, > enabled cache consumer. > While trying to read data from Kafka we are observing very high read > latencies intermittently.The high latencies results in Kafka consumer session > expiration and hence the Kafka brokers removes the consumer from the group. > The consumer keeps retrying and finally fails with the > [org.apache.kafka.clients.NetworkClient] - Disconnecting from node 12 due to > request timeout > [org.apache.kafka.clients.NetworkClient] - Cancelled request ClientRequest > [org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient] - > Cancelled FETCH request ClientRequest.** > Due to this a lot of batches are in the queued state. > The high read latencies are occurring whenever multiple clients are > parallelly trying to read the data from the same Kafka cluster. The Kafka > cluster is having a large number of brokers and can support high network > bandwidth. > When running with spark 1.5 and Kafka 0.8 consumer client against the same > Kafka cluster we are not seeing any read latencies. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23026) Add RegisterUDF to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-23026: -- Issue Type: Improvement (was: Bug) > Add RegisterUDF to PySpark > -- > > Key: SPARK-23026 > URL: https://issues.apache.org/jira/browse/SPARK-23026 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add a new API for registering row-at-a-time or scalar vectorized UDFs. The > registered UDFs can be used in the SQL statement. > {noformat} > >>> from pyspark.sql.types import IntegerType > >>> from pyspark.sql.functions import udf > >>> slen = udf(lambda s: len(s), IntegerType()) > >>> _ = spark.udf.registerUDF("slen", slen) > >>> spark.sql("SELECT slen('test')").collect() > [Row(slen(test)=4)] > >>> import random > >>> from pyspark.sql.functions import udf > >>> from pyspark.sql.types import IntegerType > >>> random_udf = udf(lambda: random.randint(0, 100), > >>> IntegerType()).asNondeterministic() > >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf) > >>> spark.sql("SELECT random_udf()").collect() > [Row(random_udf()=82)] > >>> spark.range(1).select(newRandom_udf()).collect() > [Row(random_udf()=62)] > >>> from pyspark.sql.functions import pandas_udf, PandasUDFType > >>> @pandas_udf("integer", PandasUDFType.SCALAR) > ... def add_one(x): > ... return x + 1 > ... > >>> _ = spark.udf.registerUDF("add_one", add_one) > >>> spark.sql("SELECT add_one(id) FROM range(10)").collect() > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21157) Report Total Memory Used by Spark Executors
[ https://issues.apache.org/jira/browse/SPARK-21157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320293#comment-16320293 ] assia ydroudj commented on SPARK-21157: --- I m beginner in apache spark and have installed a prebuilt distribution of apache spark with hadoop. I look to get the consumption or the usage of memory while running the example PageRank implemented within spark. I have my cluster standalone mode with 1 maser and 4 workers (Virtual machines) I have tried external tools like ganglia and graphite but they give the memory usage at resource or system level (more general) but what i need exactly is "to track the behavior of the memory (Storage, execution) while running the algorithm does it means, memory usage for a spark application-ID ". Is there anyway to get it into text-file for further exploitation? Please help me on this, Thanks > Report Total Memory Used by Spark Executors > --- > > Key: SPARK-21157 > URL: https://issues.apache.org/jira/browse/SPARK-21157 > Project: Spark > Issue Type: Improvement > Components: Input/Output >Affects Versions: 2.1.1 >Reporter: Jose Soltren > Attachments: TotalMemoryReportingDesignDoc.pdf > > > Building on some of the core ideas of SPARK-9103, this JIRA proposes tracking > total memory used by Spark executors, and a means of broadcasting, > aggregating, and reporting memory usage data in the Spark UI. > Here, "total memory used" refers to memory usage that is visible outside of > Spark, to an external observer such as YARN, Mesos, or the operating system. > The goal of this enhancement is to give Spark users more information about > how Spark clusters are using memory. Total memory will include non-Spark JVM > memory and all off-heap memory. > Please consult the attached design document for further details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23019: Assignee: Apache Spark > Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD > - > > Key: SPARK-23019 > URL: https://issues.apache.org/jira/browse/SPARK-23019 > Project: Spark > Issue Type: Bug > Components: Java API, Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Assignee: Apache Spark >Priority: Blocker > > {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to > multiple spark contexts: > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/ > {code} > Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore > this error, set spark.driver.allowMultipleContexts = true. The currently > running SparkContext was created at: > org.apache.spark.SparkContext.(SparkContext.scala:116) > org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63) > java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23019: Assignee: (was: Apache Spark) > Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD > - > > Key: SPARK-23019 > URL: https://issues.apache.org/jira/browse/SPARK-23019 > Project: Spark > Issue Type: Bug > Components: Java API, Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Priority: Blocker > > {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to > multiple spark contexts: > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/ > {code} > Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore > this error, set spark.driver.allowMultipleContexts = true. The currently > running SparkContext was created at: > org.apache.spark.SparkContext.(SparkContext.scala:116) > org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63) > java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23019) Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD
[ https://issues.apache.org/jira/browse/SPARK-23019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320225#comment-16320225 ] Apache Spark commented on SPARK-23019: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/20221 > Flaky Test: org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD > - > > Key: SPARK-23019 > URL: https://issues.apache.org/jira/browse/SPARK-23019 > Project: Spark > Issue Type: Bug > Components: Java API, Tests >Affects Versions: 2.3.0 >Reporter: Sameer Agarwal >Priority: Blocker > > {{org.apache.spark.JavaJdbcRDDSuite.testJavaJdbcRDD}} has been failing due to > multiple spark contexts: > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-maven-hadoop-2.6/ > {code} > Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore > this error, set spark.driver.allowMultipleContexts = true. The currently > running SparkContext was created at: > org.apache.spark.SparkContext.(SparkContext.scala:116) > org.apache.spark.launcher.SparkLauncherSuite$InProcessTestApp.main(SparkLauncherSuite.java:182) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:879) > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:197) > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:227) > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:136) > org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:498) > org.apache.spark.launcher.InProcessAppHandle.lambda$start$0(InProcessAppHandle.java:63) > java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22982) Remove unsafe asynchronous close() call from FileDownloadChannel
[ https://issues.apache.org/jira/browse/SPARK-22982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320203#comment-16320203 ] Sean Owen commented on SPARK-22982: --- Does this affect earlier branches in the same way? Seems important to back port if so. > Remove unsafe asynchronous close() call from FileDownloadChannel > > > Key: SPARK-22982 > URL: https://issues.apache.org/jira/browse/SPARK-22982 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > Labels: correctness > Fix For: 2.3.0 > > > Spark's Netty-based file transfer code contains an asynchronous IO bug which > may lead to incorrect query results. > At a high-level, the problem is that an unsafe asynchronous `close()` of a > pipe's source channel creates a race condition where file transfer code > closes a file descriptor then attempts to read from it. If the closed file > descriptor's number has been reused by an `open()` call then this invalid > read may cause unrelated file operations to return incorrect results due to > reading different data than intended. > I have a small, surgical fix for this bug and will submit a PR with more > description on the specific race condition / underlying bug. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320201#comment-16320201 ] Ken Tore Tallakstad commented on SPARK-21396: - Does the bug originate from this function inside [https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala] ? ``` def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) { dataTypes(ordinal) match { case StringType => to += from.getString(ordinal) case IntegerType => to += from.getInt(ordinal) case BooleanType => to += from.getBoolean(ordinal) case DoubleType => to += from.getDouble(ordinal) case FloatType => to += from.getFloat(ordinal) case DecimalType() => to += from.getDecimal(ordinal) case LongType => to += from.getLong(ordinal) case ByteType => to += from.getByte(ordinal) case ShortType => to += from.getShort(ordinal) case DateType => to += from.getAs[Date](ordinal) case TimestampType => to += from.getAs[Timestamp](ordinal) case BinaryType => to += from.getAs[Array[Byte]](ordinal) case _: ArrayType | _: StructType | _: MapType => val hiveString = HiveUtils.toHiveString((from.get(ordinal), dataTypes(ordinal))) to += hiveString } } ``` A UDT will not match any of these right?? > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang > Labels: Hive, ThriftServer2, user-defined-type > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538) > at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) > at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) > at > org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) > at > org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: scala.MatchError: org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 > (of class org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.addNonNullColumnValue(SparkExecuteStatementOperation.scala:80) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:144) > at > org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220) > at > org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:685) > at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
[jira] [Comment Edited] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320201#comment-16320201 ] Ken Tore Tallakstad edited comment on SPARK-21396 at 1/10/18 1:15 PM: -- Does the bug originate from this function inside [https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala] ? {code:scala} def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) { dataTypes(ordinal) match { case StringType => to += from.getString(ordinal) case IntegerType => to += from.getInt(ordinal) case BooleanType => to += from.getBoolean(ordinal) case DoubleType => to += from.getDouble(ordinal) case FloatType => to += from.getFloat(ordinal) case DecimalType() => to += from.getDecimal(ordinal) case LongType => to += from.getLong(ordinal) case ByteType => to += from.getByte(ordinal) case ShortType => to += from.getShort(ordinal) case DateType => to += from.getAs[Date](ordinal) case TimestampType => to += from.getAs[Timestamp](ordinal) case BinaryType => to += from.getAs[Array[Byte]](ordinal) case _: ArrayType | _: StructType | _: MapType => val hiveString = HiveUtils.toHiveString((from.get(ordinal), dataTypes(ordinal))) to += hiveString } } {code} A UDT will not match any of these right?? was (Author: kentore82): Does the bug originate from this function inside [https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala] ? ``` def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) { dataTypes(ordinal) match { case StringType => to += from.getString(ordinal) case IntegerType => to += from.getInt(ordinal) case BooleanType => to += from.getBoolean(ordinal) case DoubleType => to += from.getDouble(ordinal) case FloatType => to += from.getFloat(ordinal) case DecimalType() => to += from.getDecimal(ordinal) case LongType => to += from.getLong(ordinal) case ByteType => to += from.getByte(ordinal) case ShortType => to += from.getShort(ordinal) case DateType => to += from.getAs[Date](ordinal) case TimestampType => to += from.getAs[Timestamp](ordinal) case BinaryType => to += from.getAs[Array[Byte]](ordinal) case _: ArrayType | _: StructType | _: MapType => val hiveString = HiveUtils.toHiveString((from.get(ordinal), dataTypes(ordinal))) to += hiveString } } ``` A UDT will not match any of these right?? > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang > Labels: Hive, ThriftServer2, user-defined-type > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLISer
[jira] [Comment Edited] (SPARK-21396) Spark Hive Thriftserver doesn't return UDT field
[ https://issues.apache.org/jira/browse/SPARK-21396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320201#comment-16320201 ] Ken Tore Tallakstad edited comment on SPARK-21396 at 1/10/18 1:15 PM: -- Does the bug originate from this function inside [https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala] ? {code:java} def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) { dataTypes(ordinal) match { case StringType => to += from.getString(ordinal) case IntegerType => to += from.getInt(ordinal) case BooleanType => to += from.getBoolean(ordinal) case DoubleType => to += from.getDouble(ordinal) case FloatType => to += from.getFloat(ordinal) case DecimalType() => to += from.getDecimal(ordinal) case LongType => to += from.getLong(ordinal) case ByteType => to += from.getByte(ordinal) case ShortType => to += from.getShort(ordinal) case DateType => to += from.getAs[Date](ordinal) case TimestampType => to += from.getAs[Timestamp](ordinal) case BinaryType => to += from.getAs[Array[Byte]](ordinal) case _: ArrayType | _: StructType | _: MapType => val hiveString = HiveUtils.toHiveString((from.get(ordinal), dataTypes(ordinal))) to += hiveString } } {code} A UDT will not match any of these right?? was (Author: kentore82): Does the bug originate from this function inside [https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala] ? {code:scala} def addNonNullColumnValue(from: SparkRow, to: ArrayBuffer[Any], ordinal: Int) { dataTypes(ordinal) match { case StringType => to += from.getString(ordinal) case IntegerType => to += from.getInt(ordinal) case BooleanType => to += from.getBoolean(ordinal) case DoubleType => to += from.getDouble(ordinal) case FloatType => to += from.getFloat(ordinal) case DecimalType() => to += from.getDecimal(ordinal) case LongType => to += from.getLong(ordinal) case ByteType => to += from.getByte(ordinal) case ShortType => to += from.getShort(ordinal) case DateType => to += from.getAs[Date](ordinal) case TimestampType => to += from.getAs[Timestamp](ordinal) case BinaryType => to += from.getAs[Array[Byte]](ordinal) case _: ArrayType | _: StructType | _: MapType => val hiveString = HiveUtils.toHiveString((from.get(ordinal), dataTypes(ordinal))) to += hiveString } } {code} A UDT will not match any of these right?? > Spark Hive Thriftserver doesn't return UDT field > > > Key: SPARK-21396 > URL: https://issues.apache.org/jira/browse/SPARK-21396 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.1 >Reporter: Haopu Wang > Labels: Hive, ThriftServer2, user-defined-type > > I want to query a table with a MLLib Vector field and get below exception. > Can Spark Hive Thriftserver be enhanced to return UDT field? > == > 2017-07-13 13:14:25,435 WARN > [org.apache.hive.service.cli.thrift.ThriftCLIService] > (HiveServer2-Handler-Pool: Thread-18537;) Error fetching results: > java.lang.RuntimeException: scala.MatchError: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 (of class > org.apache.spark.ml.linalg.VectorUDT) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83) > at > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) > at > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556) > at > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) > at com.sun.proxy.$Proxy29.fetchResults(Unknown Source) > at > org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:454) > at > org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553) > at > org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.get
[jira] [Commented] (SPARK-18147) Broken Spark SQL Codegen
[ https://issues.apache.org/jira/browse/SPARK-18147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320159#comment-16320159 ] Alexander Chermenin commented on SPARK-18147: - Is it resolved for the 2.2.1 version? I have a similar problem with PySpark and Structured Streaming: {noformat} 18/01/10 15:12:37 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 174, Column 113: Expression "isNull3" is not an rvalue /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificUnsafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private UTF8String lastRegex; /* 009 */ private java.util.regex.Pattern pattern; /* 010 */ private String lastReplacement; /* 011 */ private UTF8String lastReplacementInUTF8; /* 012 */ private java.lang.StringBuffer result; /* 013 */ private UTF8String lastRegex1; /* 014 */ private java.util.regex.Pattern pattern1; /* 015 */ private String lastReplacement1; /* 016 */ private UTF8String lastReplacementInUTF81; /* 017 */ private java.lang.StringBuffer result1; /* 018 */ private UnsafeRow result2; /* 019 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder holder; /* 020 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter; /* 021 */ /* 022 */ public SpecificUnsafeProjection(Object[] references) { /* 023 */ this.references = references; /* 024 */ lastRegex = null; /* 025 */ pattern = null; /* 026 */ lastReplacement = null; /* 027 */ lastReplacementInUTF8 = null; /* 028 */ result = new java.lang.StringBuffer(); /* 029 */ lastRegex1 = null; /* 030 */ pattern1 = null; /* 031 */ lastReplacement1 = null; /* 032 */ lastReplacementInUTF81 = null; /* 033 */ result1 = new java.lang.StringBuffer(); /* 034 */ result2 = new UnsafeRow(3); /* 035 */ this.holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(result2, 64); /* 036 */ this.rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(holder, 3); /* 037 */ /* 038 */ } /* 039 */ /* 040 */ public void initialize(int partitionIndex) { /* 041 */ /* 042 */ } /* 043 */ /* 044 */ /* 045 */ private void apply_1(InternalRow i) { /* 046 */ /* 047 */ UTF8String[] args1 = new UTF8String[3]; /* 048 */ /* 049 */ /* 050 */ if (!false) { /* 051 */ args1[0] = ((UTF8String) references[6]); /* 052 */ } /* 053 */ /* 054 */ /* 055 */ if (!false) { /* 056 */ args1[1] = ((UTF8String) references[7]); /* 057 */ } /* 058 */ /* 059 */ /* 060 */ boolean isNull13 = false; /* 061 */ /* 062 */ UTF8String value14 = i.getUTF8String(2); /* 063 */ /* 064 */ /* 065 */ UTF8String value13 = null; /* 066 */ /* 067 */ if (!((UTF8String) references[8]).equals(lastRegex1)) { /* 068 */ // regex value changed /* 069 */ lastRegex1 = ((UTF8String) references[8]).clone(); /* 070 */ pattern1 = java.util.regex.Pattern.compile(lastRegex1.toString()); /* 071 */ } /* 072 */ if (!((UTF8String) references[9]).equals(lastReplacementInUTF81)) { /* 073 */ // replacement string changed /* 074 */ lastReplacementInUTF81 = ((UTF8String) references[9]).clone(); /* 075 */ lastReplacement1 = lastReplacementInUTF81.toString(); /* 076 */ } /* 077 */ result1.delete(0, result1.length()); /* 078 */ java.util.regex.Matcher matcher1 = pattern1.matcher(value14.toString()); /* 079 */ /* 080 */ while (matcher1.find()) { /* 081 */ matcher1.appendReplacement(result1, lastReplacement1); /* 082 */ } /* 083 */ matcher1.appendTail(result1); /* 084 */ value13 = UTF8String.fromString(result1.toString()); /* 085 */ if (!false) { /* 086 */ args1[2] = value13; /* 087 */ } /* 088 */ /* 089 */ UTF8String value10 = UTF8String.concat(args1); /* 090 */ boolean isNull10 = value10 == null; /* 091 */ } /* 092 */ /* 093 */ /* 094 */ private void apply1_1(InternalRow i) { /* 095 */ /* 096 */ /* 097 */ final Object value20 = null; /* 098 */ if (true) { /* 099 */ rowWriter.setNullAt(2); /* 100 */ } else { /* 101 */ /* 102 */ } /* 103 */ /* 104 */ } /* 105 */ /* 106 */ /* 107 */ private void apply_0(InternalRow i) { /* 108 */ /* 109 */ UTF8String[] args = new UTF8String[3]; /* 110 */ /* 111 */ /* 112 */ if (!false) { /* 113 */ args[0] = ((UTF8String) references[2]); /* 114 */ } /* 115 */ /* 116 */ /* 117 */ if (!false) { /* 118 */ args[1] = ((UTF8String) references[3]); /* 119 */ } /* 120 */ /* 121 */ /* 122 */ boolean isNull6 = true; /* 123 */ UTF8String value6 = null; /*
[jira] [Assigned] (SPARK-23025) DataSet with scala.Null causes Exception
[ https://issues.apache.org/jira/browse/SPARK-23025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23025: Assignee: Apache Spark > DataSet with scala.Null causes Exception > > > Key: SPARK-23025 > URL: https://issues.apache.org/jira/browse/SPARK-23025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel Davis >Assignee: Apache Spark > > When creating a DataSet over a case class containing a field of type > scala.Null, there is an exception thrown. As far as I can see, spark sql > would support a Schema(NullType, true), but it fails inside the {{schemaFor}} > function with a {{MatchError}}. > I would expect spark to return a DataSet with a NullType for that field. > h5. Minimal Exampe > {code} > case class Foo(foo: Int, bar: Null) > val ds = Seq(Foo(42, null)).toDS() > {code} > h5. Exception > {code} > scala.MatchError: scala.Null (of class > scala.reflect.internal.Types$ClassNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:713) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:704) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:703) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:391) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:390) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:390) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:148) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:148) > at > org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:136) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72) > at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) > at > org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233) > at > org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33) > ... 42 elided > {code} > h5. Background Info > To handle our data in a type-safe fashion, we have generated AVRO schemas and > corresponding scala case classes for our domain data. As some fields only > contain null values, this results in fields with scala.Null as a type. Moving > our pipeline to DataSets/structured streaming, case classes with Null types > begin to give problems, even trough NullType is known to spark SQL. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23025) DataSet with scala.Null causes Exception
[ https://issues.apache.org/jira/browse/SPARK-23025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320148#comment-16320148 ] Apache Spark commented on SPARK-23025: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/20219 > DataSet with scala.Null causes Exception > > > Key: SPARK-23025 > URL: https://issues.apache.org/jira/browse/SPARK-23025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel Davis > > When creating a DataSet over a case class containing a field of type > scala.Null, there is an exception thrown. As far as I can see, spark sql > would support a Schema(NullType, true), but it fails inside the {{schemaFor}} > function with a {{MatchError}}. > I would expect spark to return a DataSet with a NullType for that field. > h5. Minimal Exampe > {code} > case class Foo(foo: Int, bar: Null) > val ds = Seq(Foo(42, null)).toDS() > {code} > h5. Exception > {code} > scala.MatchError: scala.Null (of class > scala.reflect.internal.Types$ClassNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:713) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:704) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:703) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:391) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:390) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:390) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:148) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:148) > at > org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:136) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72) > at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) > at > org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233) > at > org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33) > ... 42 elided > {code} > h5. Background Info > To handle our data in a type-safe fashion, we have generated AVRO schemas and > corresponding scala case classes for our domain data. As some fields only > contain null values, this results in fields with scala.Null as a type. Moving > our pipeline to DataSets/structured streaming, case classes with Null types > begin to give problems, even trough NullType is known to spark SQL. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23025) DataSet with scala.Null causes Exception
[ https://issues.apache.org/jira/browse/SPARK-23025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23025: Assignee: (was: Apache Spark) > DataSet with scala.Null causes Exception > > > Key: SPARK-23025 > URL: https://issues.apache.org/jira/browse/SPARK-23025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Daniel Davis > > When creating a DataSet over a case class containing a field of type > scala.Null, there is an exception thrown. As far as I can see, spark sql > would support a Schema(NullType, true), but it fails inside the {{schemaFor}} > function with a {{MatchError}}. > I would expect spark to return a DataSet with a NullType for that field. > h5. Minimal Exampe > {code} > case class Foo(foo: Int, bar: Null) > val ds = Seq(Foo(42, null)).toDS() > {code} > h5. Exception > {code} > scala.MatchError: scala.Null (of class > scala.reflect.internal.Types$ClassNoArgsTypeRef) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:713) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:704) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:703) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:391) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$9.apply(ScalaReflection.scala:390) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:381) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:285) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:390) > at > org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:148) > at > scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56) > at > org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:809) > at > org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39) > at > org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:148) > at > org.apache.spark.sql.catalyst.ScalaReflection$.deserializerFor(ScalaReflection.scala:136) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:72) > at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) > at > org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233) > at > org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33) > ... 42 elided > {code} > h5. Background Info > To handle our data in a type-safe fashion, we have generated AVRO schemas and > corresponding scala case classes for our domain data. As some fields only > contain null values, this results in fields with scala.Null as a type. Moving > our pipeline to DataSets/structured streaming, case classes with Null types > begin to give problems, even trough NullType is known to spark SQL. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23027) optimizer a simple query using a non-existent data is too slow
[ https://issues.apache.org/jira/browse/SPARK-23027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangminfeng updated SPARK-23027: Summary: optimizer a simple query using a non-existent data is too slow (was: optimizer a simple query) > optimizer a simple query using a non-existent data is too slow > -- > > Key: SPARK-23027 > URL: https://issues.apache.org/jira/browse/SPARK-23027 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 2.0.1 >Reporter: wangminfeng > > When i use spark sql to do ad-hoc query, i have data partitioned by > event_day, event_minute, event_hour, data is large enough,one day data is > about 3T, we saved 3 month data. > But when query use a non-existent day, get optimizedPlan is too slow. > i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get > optimized plan, for five minutes,i can not get it. Query is simple enough, > like: > SELECT > event_day > FROM db.table t1 > WHERE (t1.event_day='20170104' and t1.event_hour='23' and > t1.event_minute='55') > LIMIT 1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23027) optimizer a simple query
wangminfeng created SPARK-23027: --- Summary: optimizer a simple query Key: SPARK-23027 URL: https://issues.apache.org/jira/browse/SPARK-23027 Project: Spark Issue Type: Bug Components: Optimizer Affects Versions: 2.0.1 Reporter: wangminfeng When i use spark sql to do ad-hoc query, i have data partitioned by event_day, event_minute, event_hour, data is large enough,one day data is about 3T, we saved 3 month data. But when query use a non-existent day, get optimizedPlan is too slow. i use “sparkSession.sessionState.executePlan(logicalPlan).optimizedPlan” get optimized plan, for five minutes,i can not get it. Query is simple enough, like: SELECT event_day FROM db.table t1 WHERE (t1.event_day='20170104' and t1.event_hour='23' and t1.event_minute='55') LIMIT 1 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23000) Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3
[ https://issues.apache.org/jira/browse/SPARK-23000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320110#comment-16320110 ] Apache Spark commented on SPARK-23000: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20218 > Flaky test suite DataSourceWithHiveMetastoreCatalogSuite in Spark 2.3 > - > > Key: SPARK-23000 > URL: https://issues.apache.org/jira/browse/SPARK-23000 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Blocker > Fix For: 2.3.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-2.3-test-sbt-hadoop-2.6/ > The test suite DataSourceWithHiveMetastoreCatalogSuite of Branch 2.3 always > failed in hadoop 2.6 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
[ https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320100#comment-16320100 ] Matthew Walton edited comment on SPARK-21179 at 1/10/18 11:29 AM: -- I ended up getting a resolution from Simba: "I apologize for the delay, after thorough investigation we have confirmed that this is indeed a shell issue. We will be opening a JIRA with them. The reason it took a bit longer is because we were trying to collect more information to prove that this is a shell issue. The shell should be using a JDBC api to determine the correct string identifier which is not called so default " " is used... " There is a workaround that can be used. Shell provides a JdbcDialect class that can be used to set the identifier to wanted one. For Hive you can use this before or after establishing a connection (but before calling show()): import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} import org.apache.spark.sql.types._ val HiveDialect = new JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2") override def quoteIdentifier(colName: String): String = { s"$colName" } } JdbcDialects.registerDialect(HiveDialect) was (Author: mwalton_mstr): I ended up getting a resolution from Simba: "I apologize for the delay, after through investigation we have confirmed that this is indeed a shell issue. We will be opening a JIRA with them. The reason it took a bit longer is because we were trying to collect more information to prove that this is a shell issue. The shell should be using a JDBC api to determine the correct string identifier which is not called so default " " is used... " There is a workaround that can be used. Shell provides a JdbcDialect class that can be used to set the identifier to wanted one. For Hive you can use this before or after establishing a connection (but before calling show()): import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} import org.apache.spark.sql.types._ val HiveDialect = new JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2") override def quoteIdentifier(colName: String): String = { s"$colName" } } JdbcDialects.registerDialect(HiveDialect) > Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused > by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > - > > Key: SPARK-21179 > URL: https://issues.apache.org/jira/browse/SPARK-21179 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > HDP version 2.5.0.1-60 > Hive version: 1.2.1 > Spark version 2.0.0.2.5.0.1-60 > JDBC: Download the latest Hortonworks JDBC driver >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in an INT column I get > the following error: > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > Steps to reproduce: > 1) On Hive create a simple table with an INT column and insert some data (I > used SQuirreL Client with the Hortonworks JDBC driver): > create table wh2.hivespark (country_id int, country_name string) > insert into wh2.hivespark values (1, 'USA') > 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files > ./spark-shell --jars > /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar > 4) In Spark shell load the data from Hive using the JDBC driver > val hivespark = spark.read.format("jdbc").options(Map("url" -> > "jdbc:hive2://localhost:1000
[jira] [Commented] (SPARK-21183) Unable to return Google BigQuery INTEGER data type into Spark via google BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value t
[ https://issues.apache.org/jira/browse/SPARK-21183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320106#comment-16320106 ] Matthew Walton commented on SPARK-21183: I ended up getting an answer and work around from Simba here. Their comments below I apologize for the delay, after thorough investigation we have confirmed that this is indeed a shell issue. We will be opening a JIRA with them. The reason it took a bit longer is because we were trying to collect more information to prove that this is a shell issue. The shell should be using a JDBC api to determine the correct string identifier which is not called so default " " is used instead of back-tick which is a identifier for BigQuery. There is a workaround that can be used. Shell provides a JdbcDialect class that can be used to set the identifier to wanted one. For Hive and BigQuery you can use this before or after establishing a connection (but before calling show()): import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} val GoogleDialect = new JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:bigquery") || url.contains("bigquery") override def quoteIdentifier(colName: String): String = { s"`$colName`"}} JdbcDialects.registerDialect(GoogleDialect); > Unable to return Google BigQuery INTEGER data type into Spark via google > BigQuery JDBC driver: java.sql.SQLDataException: [Simba][JDBC](10140) Error > converting value to long. > -- > > Key: SPARK-21183 > URL: https://issues.apache.org/jira/browse/SPARK-21183 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > Spark version 2.1.1 > JDBC: Download the latest google BigQuery JDBC Driver from Google >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark using a JDBC connection to Google > BigQuery. Unfortunately, when I try to query data that resides in an INTEGER > column I get the following error: > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > long. > Steps to reproduce: > 1) On Google BigQuery console create a simple table with an INT column and > insert some data > 2) Copy the Google BigQuery JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the GoogleBigQuery JDBC driver jar files > ./spark-shell --jars > /home/ec2-user/jdbc/gbq/GoogleBigQueryJDBC42.jar,/home/ec2-user/jdbc/gbq/google-api-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-api-services-bigquery-v2-rev320-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-http-client-jackson2-1.22.0.jar,/home/ec2-user/jdbc/gbq/google-oauth-client-1.22.0.jar,/home/ec2-user/jdbc/gbq/jackson-core-2.1.3.jar > 4) In Spark shell load the data from Google BigQuery using the JDBC driver > val gbq = spark.read.format("jdbc").options(Map("url" -> > "jdbc:bigquery://https://www.googleapis.com/bigquery/v2;ProjectId=your-project-name-here;OAuthType=0;OAuthPvtKeyPath=/usr/lib/spark/YourProjectPrivateKey.json;OAuthServiceAcctEmail=YourEmail@gmail.comAllowLargeResults=1;LargeResultDataset=_bqodbc_temp_tables;LargeResultTable=_matthew;Timeout=600","dbtable"; > -> > "test.lu_test_integer")).option("driver","com.simba.googlebigquery.jdbc42.Driver").option("user","").option("password","").load() > 5) In Spark shell try to display the data > gbq.show() > At this point you should see the error: > scala> gbq.show() > 17/06/22 19:34:57 WARN TaskSetManager: Lost task 0.0 in stage 3.0 (TID 6, > ip-172-31-37-165.ec2.internal, executor 3): java.sql.SQLDataException: > [Simba][JDBC](10140) Error converting value to long. > at com.simba.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at com.simba.utilities.conversion.TypeConverter.toLong(Unknown Source) > at com.simba.jdbc.common.SForwardResultSet.getLong(Unknown Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:365) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$8.apply(JdbcUtils.scala:364) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:286) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:268) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apa
[jira] [Comment Edited] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
[ https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320100#comment-16320100 ] Matthew Walton edited comment on SPARK-21179 at 1/10/18 11:27 AM: -- I ended up getting a resolution from Simba: "I apologize for the delay, after through investigation we have confirmed that this is indeed a shell issue. We will be opening a JIRA with them. The reason it took a bit longer is because we were trying to collect more information to prove that this is a shell issue. The shell should be using a JDBC api to determine the correct string identifier which is not called so default " " is used... " There is a workaround that can be used. Shell provides a JdbcDialect class that can be used to set the identifier to wanted one. For Hive you can use this before or after establishing a connection (but before calling show()): import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} import org.apache.spark.sql.types._ val HiveDialect = new JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2") override def quoteIdentifier(colName: String): String = { s"$colName" } } JdbcDialects.registerDialect(HiveDialect) was (Author: mwalton_mstr): I ended up getting a resolution from Simba: "I apologize for the delay, after through investigation we have confirmed that this is indeed a shell issue. We will be opening a JIRA with them. The reason it took a bit longer is because we were trying to collect more information to prove that this is a shell issue. The shell should be using a JDBC api to determine the correct string identifier which is not called so default " " is used... " There is a workaround that can be used. Shell provides a JdbcDialect class that can be used to set the identifier to wanted one. For Hive you can use this before or after establishing a connection (but before calling show()): import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} import org.apache.spark.sql.types._ val HiveDialect = new JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2") override def quoteIdentifier(colName: String): String = { s"$colName" } } > Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused > by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > - > > Key: SPARK-21179 > URL: https://issues.apache.org/jira/browse/SPARK-21179 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > HDP version 2.5.0.1-60 > Hive version: 1.2.1 > Spark version 2.0.0.2.5.0.1-60 > JDBC: Download the latest Hortonworks JDBC driver >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in an INT column I get > the following error: > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > Steps to reproduce: > 1) On Hive create a simple table with an INT column and insert some data (I > used SQuirreL Client with the Hortonworks JDBC driver): > create table wh2.hivespark (country_id int, country_name string) > insert into wh2.hivespark values (1, 'USA') > 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files > ./spark-shell --jars > /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar > 4) In Spark shell load the data from Hive using the JDBC driver > val hivespark = spark.read.format("jdbc").options(Map("url" -> > "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;pa
[jira] [Commented] (SPARK-21179) Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int.
[ https://issues.apache.org/jira/browse/SPARK-21179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320100#comment-16320100 ] Matthew Walton commented on SPARK-21179: I ended up getting a resolution from Simba: "I apologize for the delay, after through investigation we have confirmed that this is indeed a shell issue. We will be opening a JIRA with them. The reason it took a bit longer is because we were trying to collect more information to prove that this is a shell issue. The shell should be using a JDBC api to determine the correct string identifier which is not called so default " " is used... " There is a workaround that can be used. Shell provides a JdbcDialect class that can be used to set the identifier to wanted one. For Hive you can use this before or after establishing a connection (but before calling show()): import org.apache.spark.sql.jdbc.{JdbcDialects, JdbcType, JdbcDialect} import org.apache.spark.sql.types._ val HiveDialect = new JdbcDialect { override def canHandle(url: String): Boolean = url.startsWith("jdbc:hive2") || url.contains("hive2") override def quoteIdentifier(colName: String): String = { s"$colName" } } > Unable to return Hive INT data type into Spark via Hive JDBC driver: Caused > by: java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > - > > Key: SPARK-21179 > URL: https://issues.apache.org/jira/browse/SPARK-21179 > Project: Spark > Issue Type: Bug > Components: Spark Shell, SQL >Affects Versions: 1.6.0, 2.0.0, 2.1.1 > Environment: OS: Linux > HDP version 2.5.0.1-60 > Hive version: 1.2.1 > Spark version 2.0.0.2.5.0.1-60 > JDBC: Download the latest Hortonworks JDBC driver >Reporter: Matthew Walton > > I'm trying to fetch back data in Spark SQL using a JDBC connection to Hive. > Unfortunately, when I try to query data that resides in an INT column I get > the following error: > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to > int. > Steps to reproduce: > 1) On Hive create a simple table with an INT column and insert some data (I > used SQuirreL Client with the Hortonworks JDBC driver): > create table wh2.hivespark (country_id int, country_name string) > insert into wh2.hivespark values (1, 'USA') > 2) Copy the Hortonworks Hive JDBC driver to the machine where you will run > Spark Shell > 3) Start Spark shell loading the Hortonworks Hive JDBC driver jar files > ./spark-shell --jars > /home/spark/jdbc/hortonworkshive/HiveJDBC41.jar,/home/spark/jdbc/hortonworkshive/TCLIServiceClient.jar,/home/spark/jdbc/hortonworkshive/commons-codec-1.3.jar,/home/spark/jdbc/hortonworkshive/commons-logging-1.1.1.jar,/home/spark/jdbc/hortonworkshive/hive_metastore.jar,/home/spark/jdbc/hortonworkshive/hive_service.jar,/home/spark/jdbc/hortonworkshive/httpclient-4.1.3.jar,/home/spark/jdbc/hortonworkshive/httpcore-4.1.3.jar,/home/spark/jdbc/hortonworkshive/libfb303-0.9.0.jar,/home/spark/jdbc/hortonworkshive/libthrift-0.9.0.jar,/home/spark/jdbc/hortonworkshive/log4j-1.2.14.jar,/home/spark/jdbc/hortonworkshive/ql.jar,/home/spark/jdbc/hortonworkshive/slf4j-api-1.5.11.jar,/home/spark/jdbc/hortonworkshive/slf4j-log4j12-1.5.11.jar,/home/spark/jdbc/hortonworkshive/zookeeper-3.4.6.jar > 4) In Spark shell load the data from Hive using the JDBC driver > val hivespark = spark.read.format("jdbc").options(Map("url" -> > "jdbc:hive2://localhost:1/wh2;AuthMech=3;UseNativeQuery=1;user=hfs;password=hdfs","dbtable" > -> > "wh2.hivespark")).option("driver","com.simba.hive.jdbc41.HS2Driver").option("user","hdfs").option("password","hdfs").load() > 5) In Spark shell try to display the data > hivespark.show() > At this point you should see the error: > scala> hivespark.show() > 17/06/22 12:14:37 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.sql.SQLDataException: [Simba][JDBC](10140) Error converting value to int. > at > com.simba.hiveserver2.exceptions.ExceptionConverter.toSQLException(Unknown > Source) > at > com.simba.hiveserver2.utilities.conversion.TypeConverter.toInt(Unknown Source) > at com.simba.hiveserver2.jdbc.common.SForwardResultSet.getInt(Unknown > Source) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:437) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:535) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowItera
[jira] [Commented] (SPARK-17998) Reading Parquet files coalesces parts into too few in-memory partitions
[ https://issues.apache.org/jira/browse/SPARK-17998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320041#comment-16320041 ] sam commented on SPARK-17998: - [~srowen] Thanks, no idea where I got that from, cursed weakly typed silently failing config builders!! Will try `spark.files.maxPartitionBytes` > Reading Parquet files coalesces parts into too few in-memory partitions > --- > > Key: SPARK-17998 > URL: https://issues.apache.org/jira/browse/SPARK-17998 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.0, 2.0.1 > Environment: Spark Standalone Cluster (not "local mode") > Windows 10 and Windows 7 > Python 3.x >Reporter: Shea Parkes > > Reading a parquet ~file into a DataFrame is resulting in far too few > in-memory partitions. In prior versions of Spark, the resulting DataFrame > would have a number of partitions often equal to the number of parts in the > parquet folder. > Here's a minimal reproducible sample: > {quote} > df_first = session.range(start=1, end=1, numPartitions=13) > assert df_first.rdd.getNumPartitions() == 13 > assert session._sc.defaultParallelism == 6 > path_scrap = r"c:\scratch\scrap.parquet" > df_first.write.parquet(path_scrap) > df_second = session.read.parquet(path_scrap) > print(df_second.rdd.getNumPartitions()) > {quote} > The above shows only 7 partitions in the DataFrame that was created by > reading the Parquet back into memory for me. Why is it no longer just the > number of part files in the Parquet folder? (Which is 13 in the example > above.) > I'm filing this as a bug because it has gotten so bad that we can't work with > the underlying RDD without first repartitioning the DataFrame, which is > costly and wasteful. I really doubt this was the intended effect of moving > to Spark 2.0. > I've tried to research where the number of in-memory partitions is > determined, but my Scala skills have proven in-adequate. I'd be happy to dig > further if someone could point me in the right direction... -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23026) Add RegisterUDF to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320022#comment-16320022 ] Apache Spark commented on SPARK-23026: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20217 > Add RegisterUDF to PySpark > -- > > Key: SPARK-23026 > URL: https://issues.apache.org/jira/browse/SPARK-23026 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add a new API for registering row-at-a-time or scalar vectorized UDFs. The > registered UDFs can be used in the SQL statement. > {noformat} > >>> from pyspark.sql.types import IntegerType > >>> from pyspark.sql.functions import udf > >>> slen = udf(lambda s: len(s), IntegerType()) > >>> _ = spark.udf.registerUDF("slen", slen) > >>> spark.sql("SELECT slen('test')").collect() > [Row(slen(test)=4)] > >>> import random > >>> from pyspark.sql.functions import udf > >>> from pyspark.sql.types import IntegerType > >>> random_udf = udf(lambda: random.randint(0, 100), > >>> IntegerType()).asNondeterministic() > >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf) > >>> spark.sql("SELECT random_udf()").collect() > [Row(random_udf()=82)] > >>> spark.range(1).select(newRandom_udf()).collect() > [Row(random_udf()=62)] > >>> from pyspark.sql.functions import pandas_udf, PandasUDFType > >>> @pandas_udf("integer", PandasUDFType.SCALAR) > ... def add_one(x): > ... return x + 1 > ... > >>> _ = spark.udf.registerUDF("add_one", add_one) > >>> spark.sql("SELECT add_one(id) FROM range(10)").collect() > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23026) Add RegisterUDF to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23026: Assignee: Xiao Li (was: Apache Spark) > Add RegisterUDF to PySpark > -- > > Key: SPARK-23026 > URL: https://issues.apache.org/jira/browse/SPARK-23026 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Xiao Li > > Add a new API for registering row-at-a-time or scalar vectorized UDFs. The > registered UDFs can be used in the SQL statement. > {noformat} > >>> from pyspark.sql.types import IntegerType > >>> from pyspark.sql.functions import udf > >>> slen = udf(lambda s: len(s), IntegerType()) > >>> _ = spark.udf.registerUDF("slen", slen) > >>> spark.sql("SELECT slen('test')").collect() > [Row(slen(test)=4)] > >>> import random > >>> from pyspark.sql.functions import udf > >>> from pyspark.sql.types import IntegerType > >>> random_udf = udf(lambda: random.randint(0, 100), > >>> IntegerType()).asNondeterministic() > >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf) > >>> spark.sql("SELECT random_udf()").collect() > [Row(random_udf()=82)] > >>> spark.range(1).select(newRandom_udf()).collect() > [Row(random_udf()=62)] > >>> from pyspark.sql.functions import pandas_udf, PandasUDFType > >>> @pandas_udf("integer", PandasUDFType.SCALAR) > ... def add_one(x): > ... return x + 1 > ... > >>> _ = spark.udf.registerUDF("add_one", add_one) > >>> spark.sql("SELECT add_one(id) FROM range(10)").collect() > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23026) Add RegisterUDF to PySpark
[ https://issues.apache.org/jira/browse/SPARK-23026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23026: Assignee: Apache Spark (was: Xiao Li) > Add RegisterUDF to PySpark > -- > > Key: SPARK-23026 > URL: https://issues.apache.org/jira/browse/SPARK-23026 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Xiao Li >Assignee: Apache Spark > > Add a new API for registering row-at-a-time or scalar vectorized UDFs. The > registered UDFs can be used in the SQL statement. > {noformat} > >>> from pyspark.sql.types import IntegerType > >>> from pyspark.sql.functions import udf > >>> slen = udf(lambda s: len(s), IntegerType()) > >>> _ = spark.udf.registerUDF("slen", slen) > >>> spark.sql("SELECT slen('test')").collect() > [Row(slen(test)=4)] > >>> import random > >>> from pyspark.sql.functions import udf > >>> from pyspark.sql.types import IntegerType > >>> random_udf = udf(lambda: random.randint(0, 100), > >>> IntegerType()).asNondeterministic() > >>> newRandom_udf = spark.catalog.registerUDF("random_udf", random_udf) > >>> spark.sql("SELECT random_udf()").collect() > [Row(random_udf()=82)] > >>> spark.range(1).select(newRandom_udf()).collect() > [Row(random_udf()=62)] > >>> from pyspark.sql.functions import pandas_udf, PandasUDFType > >>> @pandas_udf("integer", PandasUDFType.SCALAR) > ... def add_one(x): > ... return x + 1 > ... > >>> _ = spark.udf.registerUDF("add_one", add_one) > >>> spark.sql("SELECT add_one(id) FROM range(10)").collect() > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org