[jira] [Assigned] (SPARK-19706) add Column.contains in pyspark
[ https://issues.apache.org/jira/browse/SPARK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19706: Assignee: Apache Spark (was: Wenchen Fan) > add Column.contains in pyspark > -- > > Key: SPARK-19706 > URL: https://issues.apache.org/jira/browse/SPARK-19706 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19706) add Column.contains in pyspark
[ https://issues.apache.org/jira/browse/SPARK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19706: Assignee: Wenchen Fan (was: Apache Spark) > add Column.contains in pyspark > -- > > Key: SPARK-19706 > URL: https://issues.apache.org/jira/browse/SPARK-19706 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19706) add Column.contains in pyspark
[ https://issues.apache.org/jira/browse/SPARK-19706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880062#comment-15880062 ] Apache Spark commented on SPARK-19706: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17036 > add Column.contains in pyspark > -- > > Key: SPARK-19706 > URL: https://issues.apache.org/jira/browse/SPARK-19706 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880058#comment-15880058 ] Apache Spark commented on SPARK-19705: -- User 'tanejagagan' has created a pull request for this issue: https://github.com/apache/spark/pull/17035 > Preferred location supporting HDFS Cache for FileScanRDD > > > Key: SPARK-19705 > URL: https://issues.apache.org/jira/browse/SPARK-19705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating > preferredLocations, FileScanRDD do not take into account HDFS cache while > calculating preferredLocations > The enhancement can be easily implemented for large files where FilePartition > only contains single HDFS file > The enhancement will also result in significant performance improvement for > cached hdfs partitions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19705: Assignee: (was: Apache Spark) > Preferred location supporting HDFS Cache for FileScanRDD > > > Key: SPARK-19705 > URL: https://issues.apache.org/jira/browse/SPARK-19705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja > > Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating > preferredLocations, FileScanRDD do not take into account HDFS cache while > calculating preferredLocations > The enhancement can be easily implemented for large files where FilePartition > only contains single HDFS file > The enhancement will also result in significant performance improvement for > cached hdfs partitions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD
[ https://issues.apache.org/jira/browse/SPARK-19705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19705: Assignee: Apache Spark > Preferred location supporting HDFS Cache for FileScanRDD > > > Key: SPARK-19705 > URL: https://issues.apache.org/jira/browse/SPARK-19705 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja >Assignee: Apache Spark > > Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating > preferredLocations, FileScanRDD do not take into account HDFS cache while > calculating preferredLocations > The enhancement can be easily implemented for large files where FilePartition > only contains single HDFS file > The enhancement will also result in significant performance improvement for > cached hdfs partitions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19706) add Column.contains in pyspark
Wenchen Fan created SPARK-19706: --- Summary: add Column.contains in pyspark Key: SPARK-19706 URL: https://issues.apache.org/jira/browse/SPARK-19706 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19705) Preferred location supporting HDFS Cache for FileScanRDD
gagan taneja created SPARK-19705: Summary: Preferred location supporting HDFS Cache for FileScanRDD Key: SPARK-19705 URL: https://issues.apache.org/jira/browse/SPARK-19705 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: gagan taneja Although NewHadoopRDD and HadoopRdd considers HDFS cache while calculating preferredLocations, FileScanRDD do not take into account HDFS cache while calculating preferredLocations The enhancement can be easily implemented for large files where FilePartition only contains single HDFS file The enhancement will also result in significant performance improvement for cached hdfs partitions -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol
zhengruifeng created SPARK-19704: Summary: AFTSurvivalRegression should support numeric censorCol Key: SPARK-19704 URL: https://issues.apache.org/jira/browse/SPARK-19704 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: zhengruifeng Priority: Minor AFTSurvivalRegression should support numeric censorCol -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19704: Assignee: (was: Apache Spark) > AFTSurvivalRegression should support numeric censorCol > -- > > Key: SPARK-19704 > URL: https://issues.apache.org/jira/browse/SPARK-19704 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Minor > > AFTSurvivalRegression should support numeric censorCol -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19704: Assignee: Apache Spark > AFTSurvivalRegression should support numeric censorCol > -- > > Key: SPARK-19704 > URL: https://issues.apache.org/jira/browse/SPARK-19704 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > > AFTSurvivalRegression should support numeric censorCol -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19704) AFTSurvivalRegression should support numeric censorCol
[ https://issues.apache.org/jira/browse/SPARK-19704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879997#comment-15879997 ] Apache Spark commented on SPARK-19704: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/17034 > AFTSurvivalRegression should support numeric censorCol > -- > > Key: SPARK-19704 > URL: https://issues.apache.org/jira/browse/SPARK-19704 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Minor > > AFTSurvivalRegression should support numeric censorCol -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements in Json formats
[ https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19695: --- Assignee: Takeshi Yamamuro > Throw an exception if a `columnNameOfCorruptRecord` field violates > requirements in Json formats > --- > > Key: SPARK-19695 > URL: https://issues.apache.org/jira/browse/SPARK-19695 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.2.0 > > > This ticket comes from https://github.com/apache/spark/pull/16928, and fixes > a json behaviour along with the CSV one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19695) Throw an exception if a `columnNameOfCorruptRecord` field violates requirements in Json formats
[ https://issues.apache.org/jira/browse/SPARK-19695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19695. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17023 [https://github.com/apache/spark/pull/17023] > Throw an exception if a `columnNameOfCorruptRecord` field violates > requirements in Json formats > --- > > Key: SPARK-19695 > URL: https://issues.apache.org/jira/browse/SPARK-19695 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > Fix For: 2.2.0 > > > This ticket comes from https://github.com/apache/spark/pull/16928, and fixes > a json behaviour along with the CSV one. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory
[ https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879893#comment-15879893 ] Saisai Shao commented on SPARK-19688: - I see. So what issue did you encounter when you restart the application manually, or you just saw the abnormal credential configuration? >From my understanding, this credential configuration will be overwritten when >you restart the application, so it should be fine. > Spark on Yarn Credentials File set to different application directory > - > > Key: SPARK-19688 > URL: https://issues.apache.org/jira/browse/SPARK-19688 > Project: Spark > Issue Type: Bug > Components: DStreams, YARN >Affects Versions: 1.6.3 >Reporter: Devaraj Jonnadula >Priority: Minor > > spark.yarn.credentials.file property is set to different application Id > instead of actual Application Id -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory
[ https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879888#comment-15879888 ] Devaraj Jonnadula commented on SPARK-19688: --- [~jerryshao] I did not check for Yarn's reattempt. I am seeing this behavior for manual restarts. > Spark on Yarn Credentials File set to different application directory > - > > Key: SPARK-19688 > URL: https://issues.apache.org/jira/browse/SPARK-19688 > Project: Spark > Issue Type: Bug > Components: DStreams, YARN >Affects Versions: 1.6.3 >Reporter: Devaraj Jonnadula >Priority: Minor > > spark.yarn.credentials.file property is set to different application Id > instead of actual Application Id -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory
[ https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879883#comment-15879883 ] Saisai Shao commented on SPARK-19688: - [~j.devaraj], when you say Spark application is restarted, are you pointing to yarn's reattempt mechanism or you manually restart the application? > Spark on Yarn Credentials File set to different application directory > - > > Key: SPARK-19688 > URL: https://issues.apache.org/jira/browse/SPARK-19688 > Project: Spark > Issue Type: Bug > Components: DStreams, YARN >Affects Versions: 1.6.3 >Reporter: Devaraj Jonnadula >Priority: Minor > > spark.yarn.credentials.file property is set to different application Id > instead of actual Application Id -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true
[ https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luciano Resende reopened SPARK-5159: Reopened due to comments above > Thrift server does not respect hive.server2.enable.doAs=true > > > Key: SPARK-5159 > URL: https://issues.apache.org/jira/browse/SPARK-5159 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: Andrew Ray > Attachments: spark_thrift_server_log.txt > > > I'm currently testing the spark sql thrift server on a kerberos secured > cluster in YARN mode. Currently any user can access any table regardless of > HDFS permissions as all data is read as the hive user. In HiveServer2 the > property hive.server2.enable.doAs=true causes all access to be done as the > submitting user. We should do the same. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879823#comment-15879823 ] Deenbandhu Agarwal commented on SPARK-19644: I am using scala 2.11 > Memory leak in Spark Streaming > -- > > Key: SPARK-19644 > URL: https://issues.apache.org/jira/browse/SPARK-19644 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: 3 AWS EC2 c3.xLarge > Number of cores - 3 > Number of executors 3 > Memory to each executor 2GB >Reporter: Deenbandhu Agarwal >Priority: Critical > Labels: memory_leak, performance > Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png > > > I am using streaming on the production for some aggregation and fetching data > from cassandra and saving data back to cassandra. > I see a gradual increase in old generation heap capacity from 1161216 Bytes > to 1397760 Bytes over a period of six hours. > After 50 hours of processing instances of class > scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a > huge number. > I think this is a clear case of memory leak -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16122) Spark History Server REST API missing an environment endpoint per application
[ https://issues.apache.org/jira/browse/SPARK-16122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-16122. Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.2.0 > Spark History Server REST API missing an environment endpoint per application > - > > Key: SPARK-16122 > URL: https://issues.apache.org/jira/browse/SPARK-16122 > Project: Spark > Issue Type: New Feature > Components: Documentation, Web UI >Affects Versions: 1.6.1 >Reporter: Neelesh Srinivas Salian >Assignee: Genmao Yu >Priority: Minor > Labels: Docs, WebUI > Fix For: 2.2.0 > > > The WebUI for the Spark History Server has the Environment tab that allows > you to view the Environment for that job. > With Runtime , Spark properties...etc. > How about adding an endpoint to the REST API that looks and points to this > environment tab for that application? > /applications/[app-id]/environment > Added Docs too so that we can spawn a subsequent Documentation addition to > get it included in the API. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19490) Hive partition columns are case-sensitive
[ https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-19490: -- Description: The real partitions columns are lower case (year, month, day) {code} Caused by: java.lang.RuntimeException: Expected only partition pruning predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85) at org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213) at org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) {code} Use these sql can reproduce this bug: CREATE TABLE partition_test (key Int) partitioned by (date string) SELECT * FROM partition_test where DATE = '20170101' was: The real partitions columns are lower case (year, month, day) {code} Caused by: java.lang.RuntimeException: Expected only partition pruning predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) at
[jira] [Updated] (SPARK-19490) Hive partition columns are case-sensitive
[ https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-19490: -- Description: The real partitions columns are lower case (year, month, day) {code} Caused by: java.lang.RuntimeException: Expected only partition pruning predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85) at org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213) at org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) {code} Use this sql can reproduce this bug: CREATE TABLE partition_test (key Int) partitioned by (date string) SELECT * FROM partition_test where DATE = '20170101' was: The real partitions columns are lower case (year, month, day) {code} Caused by: java.lang.RuntimeException: Expected only partition pruning predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) at
[jira] [Updated] (SPARK-19490) Hive partition columns are case-sensitive
[ https://issues.apache.org/jira/browse/SPARK-19490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] cen yuhai updated SPARK-19490: -- Description: The real partitions columns are lower case (year, month, day) {code} Caused by: java.lang.RuntimeException: Expected only partition pruning predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:150) at org.apache.spark.util.Utils$.withDummyCallSite(Utils.scala:2472) at org.apache.spark.sql.hive.execution.HiveTableScanExec.doExecute(HiveTableScanExec.scala:149) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:235) at org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:124) at org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:42) at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141) at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:368) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:85) at org.apache.spark.sql.execution.exchange.ExchangeCoordinator.doEstimationIfNecessary(ExchangeCoordinator.scala:213) at org.apache.spark.sql.execution.exchange.ExchangeCoordinator.postShuffleRDD(ExchangeCoordinator.scala:261) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:117) at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:112) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) {code} use this sql can reproduce this bug: CREATE TABLE partition_test (key Int) partitioned by (date string) SELECT * FROM partition_test where DATE = '20170101' was: The real partitions columns are lower case (year, month, day) {code} Caused by: java.lang.RuntimeException: Expected only partition pruning predicates: (concat(YEAR#22, MONTH#23, DAY#24) = 20170202) at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:985) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:95) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:976) at org.apache.spark.sql.hive.MetastoreRelation.getHiveQlPartitions(MetastoreRelation.scala:161) at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$10.apply(HiveTableScanExec.scala:151) at
[jira] [Assigned] (SPARK-15615) Support for creating a dataframe from JSON in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-15615: --- Assignee: PJ Fanning > Support for creating a dataframe from JSON in Dataset[String] > -- > > Key: SPARK-15615 > URL: https://issues.apache.org/jira/browse/SPARK-15615 > Project: Spark > Issue Type: Bug >Reporter: PJ Fanning >Assignee: PJ Fanning > Fix For: 2.2.0 > > > We should deprecate DataFrameReader.scala json(rdd: RDD[String]) and support > json(ds: Dataset[String]) instead -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15615) Support for creating a dataframe from JSON in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-15615. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16895 [https://github.com/apache/spark/pull/16895] > Support for creating a dataframe from JSON in Dataset[String] > -- > > Key: SPARK-15615 > URL: https://issues.apache.org/jira/browse/SPARK-15615 > Project: Spark > Issue Type: Bug >Reporter: PJ Fanning > Fix For: 2.2.0 > > > We should deprecate DataFrameReader.scala json(rdd: RDD[String]) and support > json(ds: Dataset[String]) instead -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19658) Set NumPartitions of RepartitionByExpression In Analyzer
[ https://issues.apache.org/jira/browse/SPARK-19658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19658. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16988 [https://github.com/apache/spark/pull/16988] > Set NumPartitions of RepartitionByExpression In Analyzer > > > Key: SPARK-19658 > URL: https://issues.apache.org/jira/browse/SPARK-19658 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.2.0 > > > Currently, if {{NumPartitions}} is not set, we will set it using > `spark.sql.shuffle.partitions` in Planner. However, this is not following > general resolution process. We should do it in Analyzer and then Optimizer > can use the value for optimization. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14480) Remove meaningless StringIteratorReader for CSV data source for better performance
[ https://issues.apache.org/jira/browse/SPARK-14480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879629#comment-15879629 ] Hyukjin Kwon commented on SPARK-14480: -- This seems not blocked by any of those [~pes2009k]. I sent a PR for multiple line support https://github.com/apache/spark/pull/16976 > Remove meaningless StringIteratorReader for CSV data source for better > performance > -- > > Key: SPARK-14480 > URL: https://issues.apache.org/jira/browse/SPARK-14480 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon > Fix For: 2.1.0 > > > Currently, CSV data source reads and parses CSV data bytes by bytes (not line > by line). > In {{CSVParser.scala}}, there is an {{Reader}} wrapping {{Iterator}}. I think > is made like this for better performance. However, it looks there are two > problems. > Firstly, it was actually not faster than processing line by line with > {{Iterator}} due to additional logics to wrap {{Iterator}} to {{Reader}}. > Secondly, this brought a bit of complexity because it needs additional logics > to allow every line to be read bytes by bytes. So, it was pretty difficult to > figure out issues about parsing, (eg. SPARK-14103). Actually almost all codes > in {{CSVParser}} might not be needed. > I made a rough patch and tested this. The test results for the first problem > are below: > h4. Results > - Original codes with {{Reader}} wrapping {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 14116265034 | 2008277960 | > - New codes with {{Iterator}} > ||End-to-end (ns)||Parse Time (ns)|| > | 13451699644 | 1549050564 | > In more details, > h4. Method > - TCP-H lineitem table is being tested. > - The results are collected only by 100. > - End-to-end tests and parsing time tests are performed 10 times and averages > are calculated for each. > h4. Environment > - Machine: MacBook Pro Retina > - CPU: 4 > - Memory: 8GB > h4. Dataset > - [TPC-H|http://www.tpc.org/tpch/] Lineitem Table created with factor 1 > ([generate data|https://github.com/rxin/TPC-H-Hive/tree/master/dbgen)]) > - Size : 724.66 MB > h4. Test Codes > - Function to measure time > {code} > def time[A](f: => A) = { > val s = System.nanoTime > val ret = f > println("time: "+(System.nanoTime-s)/1e6+"ms") > ret > } > {code} > - End-to-end test > {code} > val path = "lineitem.tbl" > val df = sqlContext > .read > .format("csv") > .option("header", "false") > .option("delimiter", "|") > .load(path) > time(df.take(100)) > {code} > - Parsing time test for original (in {{BulkCsvParser}}) > {code} > ... > // `reader` is a wrapper for an Iterator. > private val reader = new StringIteratorReader(iter) > parser.beginParsing(reader) > ... > time(parser.parseNext()) > ... > {code} > - Parsing time test for new (in {{BulkCsvParser}}) > {code} > ... > time(parser.parseLine(iter.next())) > ... > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions
[ https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879614#comment-15879614 ] Apache Spark commented on SPARK-19460: -- User 'wangmiao1981' has created a pull request for this issue: https://github.com/apache/spark/pull/17032 > Update dataset used in R documentation, examples to reduce warning noise and > confusions > --- > > Key: SPARK-19460 > URL: https://issues.apache.org/jira/browse/SPARK-19460 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > Running build we have a bunch of warnings from using the `iris` dataset, for > example. > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > These are the results of having `.` in the column name. For reference, see > SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't > support that there then we should strongly consider using other dataset > without `.`, eg. `cars` > And we should update this in API doc (roxygen2 doc string), vignettes, > programming guide, R code example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions
[ https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19460: Assignee: Apache Spark > Update dataset used in R documentation, examples to reduce warning noise and > confusions > --- > > Key: SPARK-19460 > URL: https://issues.apache.org/jira/browse/SPARK-19460 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung >Assignee: Apache Spark > > Running build we have a bunch of warnings from using the `iris` dataset, for > example. > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > These are the results of having `.` in the column name. For reference, see > SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't > support that there then we should strongly consider using other dataset > without `.`, eg. `cars` > And we should update this in API doc (roxygen2 doc string), vignettes, > programming guide, R code example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19460) Update dataset used in R documentation, examples to reduce warning noise and confusions
[ https://issues.apache.org/jira/browse/SPARK-19460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19460: Assignee: (was: Apache Spark) > Update dataset used in R documentation, examples to reduce warning noise and > confusions > --- > > Key: SPARK-19460 > URL: https://issues.apache.org/jira/browse/SPARK-19460 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.1.0 >Reporter: Felix Cheung > > Running build we have a bunch of warnings from using the `iris` dataset, for > example. > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > Warning in FUN(X[[4L]], ...) : > Use Petal_Width instead of Petal.Width as column name > Warning in FUN(X[[1L]], ...) : > Use Sepal_Length instead of Sepal.Length as column name > Warning in FUN(X[[2L]], ...) : > Use Sepal_Width instead of Sepal.Width as column name > Warning in FUN(X[[3L]], ...) : > Use Petal_Length instead of Petal.Length as column name > These are the results of having `.` in the column name. For reference, see > SPARK-12191, SPARK-11976. Since it involves changing SQL, if we couldn't > support that there then we should strongly consider using other dataset > without `.`, eg. `cars` > And we should update this in API doc (roxygen2 doc string), vignettes, > programming guide, R code example. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18452) Support History Server UI to use SPNEGO
[ https://issues.apache.org/jira/browse/SPARK-18452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15875400#comment-15875400 ] Shi Wang edited comment on SPARK-18452 at 2/23/17 12:42 AM: Both Spark Thrift Server UI and History server UI could be configured to use SPNEGO, using "spark.ui.filters" and "spark.${filtername}.params". was (Author: wancy): Both Spark Thrift Server UI and History server UI could be configured to use SPNEGO, using "spark.ui.filters" and "spark.${filtername}.params". > Support History Server UI to use SPNEGO > --- > > Key: SPARK-18452 > URL: https://issues.apache.org/jira/browse/SPARK-18452 > Project: Spark > Issue Type: Task >Affects Versions: 2.0.2 >Reporter: Shi Wang > > Currently almost all the hadoop component UI support SPNEGO, HADOOP, HBASE, > OOIZE. > SPARK UI should also support SPNEGO for security concern. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19702: Assignee: Apache Spark > Add Suppress/Revive support to the Mesos Spark Dispatcher > - > > Key: SPARK-19702 > URL: https://issues.apache.org/jira/browse/SPARK-19702 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Michael Gummelt >Assignee: Apache Spark > > Due to the problem described here: > https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos > frameworks concurrently can result in starvation. For example, running 10 > dispatchers could result in 5 of them getting all the offers, even if they > have no jobs to launch. We must implement explicit SUPPRESS and REVIVE calls > in the Spark Dispatcher to solve this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879602#comment-15879602 ] Apache Spark commented on SPARK-19702: -- User 'mgummelt' has created a pull request for this issue: https://github.com/apache/spark/pull/17031 > Add Suppress/Revive support to the Mesos Spark Dispatcher > - > > Key: SPARK-19702 > URL: https://issues.apache.org/jira/browse/SPARK-19702 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Michael Gummelt > > Due to the problem described here: > https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos > frameworks concurrently can result in starvation. For example, running 10 > dispatchers could result in 5 of them getting all the offers, even if they > have no jobs to launch. We must implement explicit SUPPRESS and REVIVE calls > in the Spark Dispatcher to solve this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19702: Assignee: (was: Apache Spark) > Add Suppress/Revive support to the Mesos Spark Dispatcher > - > > Key: SPARK-19702 > URL: https://issues.apache.org/jira/browse/SPARK-19702 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Michael Gummelt > > Due to the problem described here: > https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos > frameworks concurrently can result in starvation. For example, running 10 > dispatchers could result in 5 of them getting all the offers, even if they > have no jobs to launch. We must implement explicit SUPPRESS and REVIVE calls > in the Spark Dispatcher to solve this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory
[ https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879564#comment-15879564 ] Saisai Shao commented on SPARK-19688: - I see, so we should exclude this configuration in checkpoint and make it re-configured after restarted. > Spark on Yarn Credentials File set to different application directory > - > > Key: SPARK-19688 > URL: https://issues.apache.org/jira/browse/SPARK-19688 > Project: Spark > Issue Type: Bug > Components: DStreams, YARN >Affects Versions: 1.6.3 >Reporter: Devaraj Jonnadula >Priority: Minor > > spark.yarn.credentials.file property is set to different application Id > instead of actual Application Id -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19688) Spark on Yarn Credentials File set to different application directory
[ https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-19688: Component/s: DStreams > Spark on Yarn Credentials File set to different application directory > - > > Key: SPARK-19688 > URL: https://issues.apache.org/jira/browse/SPARK-19688 > Project: Spark > Issue Type: Bug > Components: DStreams, YARN >Affects Versions: 1.6.3 >Reporter: Devaraj Jonnadula >Priority: Minor > > spark.yarn.credentials.file property is set to different application Id > instead of actual Application Id -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18454) Changes to improve Nearest Neighbor Search for LSH
[ https://issues.apache.org/jira/browse/SPARK-18454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879539#comment-15879539 ] Mingjie Tang commented on SPARK-18454: -- [~yunn] I leave some comments on the document. the build index over the input data would be very useful, if we do not shuffle the input data table. > Changes to improve Nearest Neighbor Search for LSH > -- > > Key: SPARK-18454 > URL: https://issues.apache.org/jira/browse/SPARK-18454 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yun Ni > > We all agree to do the following improvement to Multi-Probe NN Search: > (1) Use approxQuantile to get the {{hashDistance}} threshold instead of doing > full sort on the whole dataset > Currently we are still discussing the following: > (1) What {{hashDistance}} (or Probing Sequence) we should use for {{MinHash}} > (2) What are the issues and how we should change the current Nearest Neighbor > implementation -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19652) REST API does not perform user auth for individual apps
[ https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19652. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.2.0 2.1.1 2.0.3 > REST API does not perform user auth for individual apps > --- > > Key: SPARK-19652 > URL: https://issues.apache.org/jira/browse/SPARK-19652 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.0.3, 2.1.1, 2.2.0 > > > (This goes back further than 2.0.0, btw.) > The REST API currently only performs authorization at the root of the UI; > this works for live UIs, but not for the history server, where the root > allows everybody to read data. That means that currently any user can see any > application in the SHS through the REST API, when auth is enabled. > Instead, the REST API should behave like the regular UI and perform > authentication at the app level too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19702) Add Suppress/Revive support to the Mesos Spark Dispatcher
Michael Gummelt created SPARK-19702: --- Summary: Add Suppress/Revive support to the Mesos Spark Dispatcher Key: SPARK-19702 URL: https://issues.apache.org/jira/browse/SPARK-19702 Project: Spark Issue Type: New Feature Components: Mesos Affects Versions: 2.1.0 Reporter: Michael Gummelt Due to the problem described here: https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos frameworks concurrently can result in starvation. For example, running 10 dispatchers could result in 5 of them getting all the offers, even if they have no jobs to launch. We must implement explicit SUPPRESS and REVIVE calls in the Spark Dispatcher to solve this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19703) Add Suppress/Revive support to the Mesos Spark Driver
Michael Gummelt created SPARK-19703: --- Summary: Add Suppress/Revive support to the Mesos Spark Driver Key: SPARK-19703 URL: https://issues.apache.org/jira/browse/SPARK-19703 Project: Spark Issue Type: New Feature Components: Mesos Affects Versions: 2.1.0 Reporter: Michael Gummelt Due to the problem described here: https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos frameworks concurrently can result in starvation. For example, running 10 jobs could result in 5 of them getting all the offers, even after they've launched all their executors. This leads to starvation of the other jobs. We must implement explicit SUPPRESS and REVIVE calls in the Spark Dispatcher to solve this problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19701) the `in` operator in pyspark is broken
[ https://issues.apache.org/jira/browse/SPARK-19701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19701: Description: {code} >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") >>> linesWithSpark = textFile.filter("Spark" in textFile.value) Traceback (most recent call last): File "", line 1, in File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions. {code} > the `in` operator in pyspark is broken > -- > > Key: SPARK-19701 > URL: https://issues.apache.org/jira/browse/SPARK-19701 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Wenchen Fan > > {code} > >>> textFile = spark.read.text("/Users/cloud/dev/spark/README.md") > >>> linesWithSpark = textFile.filter("Spark" in textFile.value) > Traceback (most recent call last): > File "", line 1, in > File "/Users/cloud/product/spark/python/pyspark/sql/column.py", line 426, > in __nonzero__ > raise ValueError("Cannot convert column into bool: please use '&' for > 'and', '|' for 'or', " > ValueError: Cannot convert column into bool: please use '&' for 'and', '|' > for 'or', '~' for 'not' when building DataFrame boolean expressions. > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19701) the `in` operator in pyspark is broken
Wenchen Fan created SPARK-19701: --- Summary: the `in` operator in pyspark is broken Key: SPARK-19701 URL: https://issues.apache.org/jira/browse/SPARK-19701 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19459) ORC tables cannot be read when they contain char/varchar columns
[ https://issues.apache.org/jira/browse/SPARK-19459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879441#comment-15879441 ] Apache Spark commented on SPARK-19459: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17030 > ORC tables cannot be read when they contain char/varchar columns > > > Key: SPARK-19459 > URL: https://issues.apache.org/jira/browse/SPARK-19459 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > Fix For: 2.2.0 > > > Reading from an ORC table which contains char/varchar columns can fail if the > table has been created using Spark. This is caused by the fact that spark > internally replaces char and varchar columns with a string column, this > causes the ORC reader to use the wrong reader, and that eventually causes a > ClassCastException. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16617) Upgrade to Avro 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-16617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879431#comment-15879431 ] Michael Heuer commented on SPARK-16617: --- Any thoughts as to what the Fix Version/s for this should be? >From what I can see Apache Spark git HEAD has already bumped to parquet >version 1.8.2, and this may force the issue, since parquet 1.8.2 calls the new >method Schema.getLogicalType not present in 1.7.x versions of avro. > Upgrade to Avro 1.8.x > - > > Key: SPARK-16617 > URL: https://issues.apache.org/jira/browse/SPARK-16617 > Project: Spark > Issue Type: Improvement >Reporter: Ben McCann > > Avro 1.8 makes Avro objects serializable so that you can easily have an RDD > containing Avro objects. > See https://issues.apache.org/jira/browse/AVRO-1502 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879428#comment-15879428 ] Nicolas Drizard commented on SPARK-12664: - I upvote for it as an important feature, is anyone currently working on it? [~yanboliang]? Thanks! > Expose raw prediction scores in MultilayerPerceptronClassificationModel > --- > > Key: SPARK-12664 > URL: https://issues.apache.org/jira/browse/SPARK-12664 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Robert Dodier >Assignee: Yanbo Liang > > In > org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, > there isn't any way to get raw prediction scores; only an integer output > (from 0 to #classes - 1) is available via the `predict` method. > `mplModel.predict` is called within the class to get the raw score, but > `mlpModel` is private so that isn't available to outside callers. > The raw score is useful when the user wants to interpret the classifier > output as a probability. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19644) Memory leak in Spark Streaming
[ https://issues.apache.org/jira/browse/SPARK-19644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879397#comment-15879397 ] Shixiong Zhu commented on SPARK-19644: -- [~deenbandhu] Do you use Scala 2.10 or Scala 2.11? > Memory leak in Spark Streaming > -- > > Key: SPARK-19644 > URL: https://issues.apache.org/jira/browse/SPARK-19644 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.0.2 > Environment: 3 AWS EC2 c3.xLarge > Number of cores - 3 > Number of executors 3 > Memory to each executor 2GB >Reporter: Deenbandhu Agarwal >Priority: Critical > Labels: memory_leak, performance > Attachments: Dominator_tree.png, heapdump.png, Path2GCRoot.png > > > I am using streaming on the production for some aggregation and fetching data > from cassandra and saving data back to cassandra. > I see a gradual increase in old generation heap capacity from 1161216 Bytes > to 1397760 Bytes over a period of six hours. > After 50 hours of processing instances of class > scala.collection.immutable.$colon$colon incresed to 12,811,793 which is a > huge number. > I think this is a clear case of memory leak -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19554) YARN backend should use history server URL for tracking when UI is disabled
[ https://issues.apache.org/jira/browse/SPARK-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19554. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.2.0 > YARN backend should use history server URL for tracking when UI is disabled > --- > > Key: SPARK-19554 > URL: https://issues.apache.org/jira/browse/SPARK-19554 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.2.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin >Priority: Minor > Fix For: 2.2.0 > > > Currently, if the app has disabled its UI, Spark does not set a tracking URL > in YARN. The UI is still available, even if with a lag, in the history > server, if it's configured. We should use that as the tracking URL in these > cases, instead of letting YARN show its default page for applications without > a UI. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile
[ https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879387#comment-15879387 ] Timothy Hunter commented on SPARK-19573: I do not have too strong an opinion, as long as: 1. we are consistent within Spark, or 2. we follow the standard for numerical stuff (IEEE-754) I am not sure what the standard is for SQL, though. > Make NaN/null handling consistent in approxQuantile > --- > > Key: SPARK-19573 > URL: https://issues.apache.org/jira/browse/SPARK-19573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: zhengruifeng > > As discussed in https://github.com/apache/spark/pull/16776, this jira is used > to track the following issue: > Multi-column version of approxQuantile drop the rows containing *any* > NaN/null, the results are not consistent with outputs of the single-version. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19652) REST API does not perform user auth for individual apps
[ https://issues.apache.org/jira/browse/SPARK-19652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879283#comment-15879283 ] Apache Spark commented on SPARK-19652: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/17029 > REST API does not perform user auth for individual apps > --- > > Key: SPARK-19652 > URL: https://issues.apache.org/jira/browse/SPARK-19652 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.0.0, 2.1.0 >Reporter: Marcelo Vanzin > > (This goes back further than 2.0.0, btw.) > The REST API currently only performs authorization at the root of the UI; > this works for live UIs, but not for the history server, where the root > allows everybody to read data. That means that currently any user can see any > application in the SHS through the REST API, when auth is enabled. > Instead, the REST API should behave like the regular UI and perform > authentication at the app level too. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19648) Unable to access column containing '.' for approxQuantile function on DataFrame
[ https://issues.apache.org/jira/browse/SPARK-19648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Compitello updated SPARK-19648: Affects Version/s: 2.1.0 > Unable to access column containing '.' for approxQuantile function on > DataFrame > --- > > Key: SPARK-19648 > URL: https://issues.apache.org/jira/browse/SPARK-19648 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.2, 2.1.0 > Environment: Running spark in an ipython prompt on Mac OSX. >Reporter: John Compitello > > It seems that the approx quantiles method does not offer any way to access a > column with a period in string name. I am aware of the backtick solution, but > it does not work in this scenario. > For example, let's say I have a column named 'va.x'. Passing approx quantiles > this string without backticks results in the following error: > 'Cannot resolve column name '`va.x`' given input columns: ' > Note that backticks seem to have been automatically inserted, but it cannot > find column name regardless. > If I do include backticks, I get a different error. An > IllegalArgumentException is thrown as follows: > "IllegalArgumentException: 'Field "`va.x`" does not exist." -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18278) Support native submission of spark jobs to a kubernetes cluster
[ https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879205#comment-15879205 ] Matt Cheah commented on SPARK-18278: [~hkothari] I created SPARK-19700 to track the pluggable scheduler API design. > Support native submission of spark jobs to a kubernetes cluster > --- > > Key: SPARK-18278 > URL: https://issues.apache.org/jira/browse/SPARK-18278 > Project: Spark > Issue Type: Umbrella > Components: Build, Deploy, Documentation, Scheduler, Spark Core >Reporter: Erik Erlandson > Attachments: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf > > > A new Apache Spark sub-project that enables native support for submitting > Spark applications to a kubernetes cluster. The submitted application runs > in a driver executing on a kubernetes pod, and executors lifecycles are also > managed as pods. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19700) Design an API for pluggable scheduler implementations
Matt Cheah created SPARK-19700: -- Summary: Design an API for pluggable scheduler implementations Key: SPARK-19700 URL: https://issues.apache.org/jira/browse/SPARK-19700 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Matt Cheah One point that was brought up in discussing SPARK-18278 was that schedulers cannot easily be added to Spark without forking the whole project. The main reason is that much of the scheduler's behavior fundamentally depends on the CoarseGrainedSchedulerBackend class, which is not part of the public API of Spark and is in fact quite a complex module. As resource management and allocation continues evolves, Spark will need to be integrated with more cluster managers, but maintaining support for all possible allocators in the Spark project would be untenable. Furthermore, it would be impossible for Spark to support proprietary frameworks that are developed by specific users for their other particular use cases. Therefore, this ticket proposes making scheduler implementations fully pluggable. The idea is that Spark will provide a Java/Scala interface that is to be implemented by a scheduler that is backed by the cluster manager of interest. The user can compile their scheduler's code into a JAR that is placed on the driver's classpath. Finally, as is the case in the current world, the scheduler implementation is selected and dynamically loaded depending on the user's provided master URL. Determining the correct API is the most challenging problem. The current CoarseGrainedSchedulerBackend handles many responsibilities, some of which will be common across all cluster managers, and some which will be specific to a particular cluster manager. For example, the particular mechanism for creating the executor processes will differ between YARN and Mesos, but, once these executors have started running, the means to submit tasks to them over the Netty RPC is identical across the board. We must also consider a plugin model and interface for submitting the application as well, because different cluster managers support different configuration options, and thus the driver must be bootstrapped accordingly. For example, in YARN mode the application and Hadoop configuration must be packaged and shipped to the distributed cache prior to launching the job. A prototype of a Kubernetes implementation starts a Kubernetes pod that runs the driver in cluster mode. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19666) Exception when calling createDataFrame with typed RDD
[ https://issues.apache.org/jira/browse/SPARK-19666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19666. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17013 [https://github.com/apache/spark/pull/17013] > Exception when calling createDataFrame with typed RDD > - > > Key: SPARK-19666 > URL: https://issues.apache.org/jira/browse/SPARK-19666 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Colin Breame > Fix For: 2.2.0 > > > The following code: > {code} > var tmp = sc.parallelize(Seq(new __Message())) > val spark = SparkSession.builder().getOrCreate() > var df = spark.createDataFrame(tmp, classOf[__Message]) > {code} > Produces this error message. > {code} > Exception in thread "main" java.lang.NullPointerException > at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
[jira] [Assigned] (SPARK-19666) Exception when calling createDataFrame with typed RDD
[ https://issues.apache.org/jira/browse/SPARK-19666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19666: --- Assignee: Hyukjin Kwon > Exception when calling createDataFrame with typed RDD > - > > Key: SPARK-19666 > URL: https://issues.apache.org/jira/browse/SPARK-19666 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Colin Breame >Assignee: Hyukjin Kwon > Fix For: 2.2.0 > > > The following code: > {code} > var tmp = sc.parallelize(Seq(new __Message())) > val spark = SparkSession.builder().getOrCreate() > var df = spark.createDataFrame(tmp, classOf[__Message]) > {code} > Produces this error message. > {code} > Exception in thread "main" java.lang.NullPointerException > at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at > org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127) > at > org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at >
[jira] [Commented] (SPARK-19688) Spark on Yarn Credentials File set to different application directory
[ https://issues.apache.org/jira/browse/SPARK-19688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879175#comment-15879175 ] Devaraj Jonnadula commented on SPARK-19688: --- When Spark Application is restarted spark.yarn.credentials.file is set to hdfs://node/user/*/.sparkStaging/application_someotherApplicationId/credentials-d8c33609-72f9-4770-9e50-aab848424e62 Streaming Application with check-pointing enabled. > Spark on Yarn Credentials File set to different application directory > - > > Key: SPARK-19688 > URL: https://issues.apache.org/jira/browse/SPARK-19688 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.6.3 >Reporter: Devaraj Jonnadula >Priority: Minor > > spark.yarn.credentials.file property is set to different application Id > instead of actual Application Id -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19699) createOrReplaceTable does not always replace an existing table of the same name
Barry Becker created SPARK-19699: Summary: createOrReplaceTable does not always replace an existing table of the same name Key: SPARK-19699 URL: https://issues.apache.org/jira/browse/SPARK-19699 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.1.0 Reporter: Barry Becker Priority: Minor There are cases when dataframe.createOrReplaceTempView does not replace an existing table with the same name. Please also refer to my [related stack-overflow post|http://stackoverflow.com/questions/42371690/in-spark-2-1-how-come-the-dataframe-createoreplacetemptable-does-not-replace-an]. To reproduce, do {code} df.collect() df.createOrReplaceTempView("foo1") df.sqlContext.cacheTable("foo1") {code} with one dataframe, and then do exactly the same thing with a different dataframe. Then look in the storage tab in the spark UI and see multiple entries for "foo1" in the "RDD Name" column. Maybe I am misunderstanding, but this causes 2 apparent problems 1) How do you know which table will be retrieved with sqlContext.table("foo1") ? 2) The duplicate entries represent a memory leak. I have tried calling dropTempTable(existingName) first, but then have occasionally seen a FAILFAST error when trying to use the table. It's as if the dropTempTable is not synchronous, but maybe I am doing something wrong. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19616) weightCol and aggregationDepth should be improved for some SparkR APIs
[ https://issues.apache.org/jira/browse/SPARK-19616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung resolved SPARK-19616. -- Resolution: Fixed Assignee: Miao Wang Fix Version/s: 2.2.0 Target Version/s: 2.2.0 > weightCol and aggregationDepth should be improved for some SparkR APIs > --- > > Key: SPARK-19616 > URL: https://issues.apache.org/jira/browse/SPARK-19616 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.1.0, 2.2.0 >Reporter: Miao Wang >Assignee: Miao Wang >Priority: Minor > Fix For: 2.2.0 > > > When doing SPARK-19456, we found that "" should be consider a NULL column > name and should not be set. aggregationDepth should be exposed as an expert > parameter. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879068#comment-15879068 ] Jisoo Kim commented on SPARK-19698: --- I don't think it's only limited to Mesos coarse-grained executor. https://github.com/metamx/spark/pull/25 this might be a solution, and we're doing more investigating/testing. > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879068#comment-15879068 ] Jisoo Kim edited comment on SPARK-19698 at 2/22/17 7:43 PM: I don't think it's only limited to Mesos coarse-grained executor. https://github.com/metamx/spark/pull/25 might be a solution, and we're doing more investigating/testing. was (Author: jisookim0...@gmail.com): I don't think it's only limited to Mesos coarse-grained executor. https://github.com/metamx/spark/pull/25 this might be a solution, and we're doing more investigating/testing. > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878959#comment-15878959 ] Charles Allen edited comment on SPARK-19698 at 2/22/17 6:46 PM: I *think* this is due to the driver not having the concept of a "critical section" for code being executed, meaning that you can't declare a portion of the code being run as "I'm in a non-atomic or critical command region, please let me finish" was (Author: drcrallen): I *think* this is due to the driver not having the concept of a "critical section" for code being executed, meaning that you can't declare a portion of the code being run as "I'm in a non-idempotent command region, please let me finish" > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878959#comment-15878959 ] Charles Allen commented on SPARK-19698: --- I *think* this is due to the driver not having the concept of a "critical section" for code being executed, meaning that you can't declare a portion of the code being run as "I'm in a non-idempotent command region, please let me finish" > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Allen updated SPARK-19698: -- Summary: Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes (was: Race condition in stale attempt task completion vs current attempt task completion) > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878927#comment-15878927 ] Charles Allen commented on SPARK-19698: --- [~jisookim0...@gmail.com] has been investigating this on our side. > Race condition in stale attempt task completion vs current attempt task > completion > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion
Charles Allen created SPARK-19698: - Summary: Race condition in stale attempt task completion vs current attempt task completion Key: SPARK-19698 URL: https://issues.apache.org/jira/browse/SPARK-19698 Project: Spark Issue Type: Bug Components: Mesos, Spark Core Affects Versions: 2.0.0 Reporter: Charles Allen We have encountered a strange scenario in our production environment. Below is the best guess we have right now as to what's going on. Potentially, the final stage of a job has a failure in one of the tasks (such as OOME on the executor) which can cause tasks for that stage to be relaunched in a second attempt. https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 keeps track of which tasks have been completed, but does NOT keep track of which attempt those tasks were completed in. As such, we have encountered a scenario where a particular task gets executed twice in different stage attempts, and the DAGScheduler does not consider if the second attempt is still running. This means if the first task attempt succeeded, the second attempt can be cancelled part-way through its run cycle if all other tasks (including the prior failed) are completed successfully. What this means is that if a task is manipulating some state somewhere (for example: a upload-to-temporary-file-location, then delete-then-move on an underlying s3n storage implementation) the driver can improperly shutdown the running (2nd attempt) task between state manipulations, leaving the persistent state in a bad state since the 2nd attempt never got to complete its manipulations, and was terminated prematurely at some arbitrary point in its state change logic (ex: finished the delete but not the move). This is using the mesos coarse grained executor. It is unclear if this behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878884#comment-15878884 ] Michael Heuer commented on SPARK-19697: --- Sorry about all the description edits. Thank you for linking this duplicate issue to a parent issue. > NoSuchMethodError: org.apache.avro.Schema.getLogicalType() > -- > > Key: SPARK-19697 > URL: https://issues.apache.org/jira/browse/SPARK-19697 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 2.1.0 > Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java > HotSpot(TM) 64-Bit Server VM, 1.8.0_60 >Reporter: Michael Heuer > > In a downstream project (https://github.com/bigdatagenomics/adam), adding a > dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at > runtime on various Spark versions, including 2.1.0. > pom.xml: > {code:xml} > > 1.8 > 1.8.1 > 2.11.8 > 2.11 > 2.1.0 > 1.8.2 > > > > > org.apache.parquet > parquet-avro > ${parquet.version} > > {code} > Example using spark-submit (called via adam-submit below): > {code} > $ ./bin/adam-submit vcf2adam \ > adam-core/src/test/resources/small.vcf \ > small.adam > ... > java.lang.NoSuchMethodError: > org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) > at > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) > at > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) > at > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) > at > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) > at > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) > at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) > at > org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > The issue can be reproduced from this pull request > https://github.com/bigdatagenomics/adam/pull/1360 > and is reported as Jenkins CI test failures, e.g. > https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 > d...@spark.apache.org mailing list archive thread > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17280) Flaky test: org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite and JavaDirectKafkaStreamSuite.testKafkaStream
[ https://issues.apache.org/jira/browse/SPARK-17280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Armin Braun resolved SPARK-17280. - Resolution: Fixed closing this, can't find any recent examples of this on Jenkins and haven't experienced this locally either as of late. Also tried reproducing this running 1k+ loops of all the Kafka0.10_2.11 tests with 3 forks in parallel without issues. > Flaky test: org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite and > JavaDirectKafkaStreamSuite.testKafkaStream > > > Key: SPARK-17280 > URL: https://issues.apache.org/jira/browse/SPARK-17280 > Project: Spark > Issue Type: Bug > Components: DStreams, Tests >Reporter: Yin Huai > > https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.2/1793 > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/1793/ > {code} > org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream > Error Message > assertion failed: Partition [topic1, 0] metadata not propagated after timeout > Stacktrace > java.util.concurrent.TimeoutException: assertion failed: Partition [topic1, > 0] metadata not propagated after timeout > at > org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.createTopicAndSendData(JavaDirectKafkaStreamSuite.java:176) > at > org.apache.spark.streaming.kafka010.JavaDirectKafkaStreamSuite.testKafkaStream(JavaDirectKafkaStreamSuite.java:74) > {code} > {code} > org.apache.spark.streaming.kafka010.JavaKafkaRDDSuite.testKafkaRDD > Error Message > Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-java-test-consumer--363965267-1472280538438 topic2 0 0 after > polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > Stacktrace > org.apache.spark.SparkException: > Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most > recent failure: Lost task 1.0 in stage 0.0 (TID 1, localhost): > java.lang.AssertionError: assertion failed: Failed to get records for > spark-executor-java-test-consumer--363965267-1472280538438 topic2 0 0 after > polling for 512 > at scala.Predef$.assert(Predef.scala:170) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:74) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1910) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745)
[jira] [Resolved] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19697. --- Resolution: Duplicate Yes, as you note it's version mismatch. Spark doesn't use 1.8. > NoSuchMethodError: org.apache.avro.Schema.getLogicalType() > -- > > Key: SPARK-19697 > URL: https://issues.apache.org/jira/browse/SPARK-19697 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 2.1.0 > Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java > HotSpot(TM) 64-Bit Server VM, 1.8.0_60 >Reporter: Michael Heuer > > In a downstream project (https://github.com/bigdatagenomics/adam), adding a > dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at > runtime on various Spark versions, including 2.1.0. > pom.xml: > {code:xml} > > 1.8 > 1.8.1 > 2.11.8 > 2.11 > 2.1.0 > 1.8.2 > > > > > org.apache.parquet > parquet-avro > ${parquet.version} > > {code} > Example using spark-submit (called via adam-submit below): > {code} > $ ./bin/adam-submit vcf2adam \ > adam-core/src/test/resources/small.vcf \ > small.adam > ... > java.lang.NoSuchMethodError: > org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) > at > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) > at > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) > at > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) > at > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) > at > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) > at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) > at > org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > The issue can be reproduced from this pull request > https://github.com/bigdatagenomics/adam/pull/1360 > and is reported as Jenkins CI test failures, e.g. > https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 > d...@spark.apache.org mailing list archive thread > http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Heuer updated SPARK-19697: -- Description: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using spark-submit (called via adam-submit below): {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} The issue can be reproduced from this pull request https://github.com/bigdatagenomics/adam/pull/1360 and is reported as Jenkins CI test failures, e.g. https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 d...@spark.apache.org mailing list archive thread http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html was: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using spark-submit (called via adam-submit below): {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at
[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Heuer updated SPARK-19697: -- Description: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using spark-submit (called via adam-submit below): {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} The issue can be reproduced from this pull request https://github.com/bigdatagenomics/adam/pull/1360 and is reported as Jenkins CI test failures https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 d...@spark.apache.org mailing list archive thread http://apache-spark-developers-list.1001551.n3.nabble.com/Re-VOTE-Release-Apache-Parquet-1-8-2-RC1-tp20711p20720.html was: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using `spark-submit` (called via `adam-submit` below): {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at
[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Heuer updated SPARK-19697: -- Description: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on parquet-avro version 1.8.2 results in NoSuchMethodExceptions at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using `spark-submit` (called via `adam-submit` below): {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} The issue can be reproduced from this pull request https://github.com/bigdatagenomics/adam/pull/1360 and is reported as Jenkins CI test failures https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 was: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on {{parquet-avro}} version 1.8.2 results in {{NoSuchMethodException}}s at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using `spark-submit` (called via `adam-submit` below) {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at
[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Heuer updated SPARK-19697: -- Description: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using `spark-submit` (called via `adam-submit` below) {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} The issue can be reproduced from this pull request https://github.com/bigdatagenomics/adam/pull/1360 and is reported as Jenkins CI test failures https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 was: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} org.apache.parquet parquet-avro ${parquet.version} org.apache.parquet parquet-scala_2.10 ${parquet.version} org.scala-lang scala-library {code} Example using `spark-submit` (called via `adam-submit` below) {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at
[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Heuer updated SPARK-19697: -- Description: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on {{parquet-avro}} version 1.8.2 results in {{NoSuchMethodException}}s at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using `spark-submit` (called via `adam-submit` below) {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} The issue can be reproduced from this pull request https://github.com/bigdatagenomics/adam/pull/1360 and is reported as Jenkins CI test failures https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 was: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} 1.8 1.8.1 2.11.8 2.11 2.1.0 1.8.2 org.apache.parquet parquet-avro ${parquet.version} {code} Example using `spark-submit` (called via `adam-submit` below) {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at
[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Heuer updated SPARK-19697: -- Description: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s at runtime on various Spark versions, including 2.1.0. pom.xml: {code:xml} org.apache.parquet parquet-avro ${parquet.version} org.apache.parquet parquet-scala_2.10 ${parquet.version} org.scala-lang scala-library {code} Example using `spark-submit` (called via `adam-submit` below) {code} $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} The issue can be reproduced from this pull request https://github.com/bigdatagenomics/adam/pull/1360 and is reported as Jenkins CI test failures https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 was: In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s at runtime on various Spark versions, including 2.1.0. pom.xml: {{ org.apache.parquet parquet-avro ${parquet.version} org.apache.parquet parquet-scala_2.10 ${parquet.version} org.scala-lang scala-library }} Example using `spark-submit` (called via `adam-submit` below) {{ $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at
[jira] [Updated] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
[ https://issues.apache.org/jira/browse/SPARK-19697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Heuer updated SPARK-19697: -- Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_60 (was: {{ $ spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_60 Branch Compiled by user jenkins on 2016-12-16T02:04:48Z Revision Url Type --help for more information. }} ) > NoSuchMethodError: org.apache.avro.Schema.getLogicalType() > -- > > Key: SPARK-19697 > URL: https://issues.apache.org/jira/browse/SPARK-19697 > Project: Spark > Issue Type: Bug > Components: Build, Spark Core >Affects Versions: 2.1.0 > Environment: Apache Spark 2.1.0, Scala version 2.11.8, Java > HotSpot(TM) 64-Bit Server VM, 1.8.0_60 >Reporter: Michael Heuer > > In a downstream project (https://github.com/bigdatagenomics/adam), adding a > dependency on `parquet-avro` version 1.8.2 results in > `NoSuchMethodException`s at runtime on various Spark versions, including > 2.1.0. > pom.xml: > {{ > > > > org.apache.parquet > parquet-avro > ${parquet.version} > > > org.apache.parquet > > parquet-scala_2.10 > ${parquet.version} > > > org.scala-lang > scala-library > > > > }} > Example using `spark-submit` (called via `adam-submit` below) > {{ > $ ./bin/adam-submit vcf2adam \ > adam-core/src/test/resources/small.vcf \ > small.adam > ... > java.lang.NoSuchMethodError: > org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) > at > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) > at > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) > at > org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) > at > org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) > at > org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) > at > org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) > at > org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) > at > org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) > at > org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) > at > org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > }} > The issue can be reproduced from this pull request > https://github.com/bigdatagenomics/adam/pull/1360 > and is reported as Jenkins CI test failures > https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19697) NoSuchMethodError: org.apache.avro.Schema.getLogicalType()
Michael Heuer created SPARK-19697: - Summary: NoSuchMethodError: org.apache.avro.Schema.getLogicalType() Key: SPARK-19697 URL: https://issues.apache.org/jira/browse/SPARK-19697 Project: Spark Issue Type: Bug Components: Build, Spark Core Affects Versions: 2.1.0 Environment: {{ $ spark-submit --version Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8, Java HotSpot(TM) 64-Bit Server VM, 1.8.0_60 Branch Compiled by user jenkins on 2016-12-16T02:04:48Z Revision Url Type --help for more information. }} Reporter: Michael Heuer In a downstream project (https://github.com/bigdatagenomics/adam), adding a dependency on `parquet-avro` version 1.8.2 results in `NoSuchMethodException`s at runtime on various Spark versions, including 2.1.0. pom.xml: {{ org.apache.parquet parquet-avro ${parquet.version} org.apache.parquet parquet-scala_2.10 ${parquet.version} org.scala-lang scala-library }} Example using `spark-submit` (called via `adam-submit` below) {{ $ ./bin/adam-submit vcf2adam \ adam-core/src/test/resources/small.vcf \ small.adam ... java.lang.NoSuchMethodError: org.apache.avro.Schema.getLogicalType()Lorg/apache/avro/LogicalType; at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:178) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:152) at org.apache.parquet.avro.AvroSchemaConverter.convertUnion(AvroSchemaConverter.java:214) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:171) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:130) at org.apache.parquet.avro.AvroSchemaConverter.convertField(AvroSchemaConverter.java:227) at org.apache.parquet.avro.AvroSchemaConverter.convertFields(AvroSchemaConverter.java:124) at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:115) at org.apache.parquet.avro.AvroWriteSupport.init(AvroWriteSupport.java:117) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:311) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:283) at org.apache.spark.rdd.InstrumentedOutputFormat.getRecordWriter(InstrumentedOutputFormat.scala:35) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:99) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) }} The issue can be reproduced from this pull request https://github.com/bigdatagenomics/adam/pull/1360 and is reported as Jenkins CI test failures https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1810 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19405) Add support to KinesisUtils for cross-account Kinesis reads via STS
[ https://issues.apache.org/jira/browse/SPARK-19405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-19405. - Resolution: Fixed Assignee: Adam Budde Fix Version/s: 2.2.0 Resolved with: https://github.com/apache/spark/pull/16744 > Add support to KinesisUtils for cross-account Kinesis reads via STS > --- > > Key: SPARK-19405 > URL: https://issues.apache.org/jira/browse/SPARK-19405 > Project: Spark > Issue Type: Improvement > Components: DStreams >Reporter: Adam Budde >Assignee: Adam Budde >Priority: Minor > Fix For: 2.2.0 > > > h1. Summary > Enable KinesisReceiver to utilize STSAssumeRoleSessionCredentialsProvider > when setting up the Kinesis Client Library in order to enable secure > cross-account Kinesis stream reads managed by AWS Simple Token Service (STS) > h1. Details > Spark's KinesisReceiver implementation utilizes the Kinesis Client Library in > order to allow users to write Spark Streaming jobs that operate on Kinesis > data. The KCL uses a few AWS services under the hood in order to provide > checkpointed, load-balanced processing of the underlying data in a Kinesis > stream. Running the KCL requires permissions to be set up for the following > AWS resources. > * AWS Kinesis for reading stream data > * AWS DynamoDB for storing KCL shared state in tables > * AWS CloudWatch for logging KCL metrics > The KinesisUtils.createStream() API allows users to authenticate to these > services either by specifying an explicit AWS access key/secret key > credential pair or by using the default credential provider chain. This > supports authorizing to the three AWS services using either an AWS keypair > (either provided explicitly or parsed from environment variables, etc.): > !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/KeypairOnly.png! > Or the IAM instance profile (when running on EC2): > !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/InstanceProfileOnly.png! > AWS users often need to access resources across separate accounts. This could > be done in order to consume data produced by another organization or from a > service running in another account for resource isolation purposes. AWS > Simple Token Service (STS) provides a secure way to authorize cross-account > resource access by using temporary sessions to assuming an IAM role in the > AWS account with the resources being accessed. > The [IAM > documentation|http://docs.aws.amazon.com/IAM/latest/UserGuide/tutorial_cross-account-with-roles.html] > covers the specifics of how cross account IAM role assumption works in much > greater detail, but if an actor in account A wanted to read from a Kinesis > stream in account B the general steps required would look something like this: > * An IAM role is added to account B with read permissions for the Kinesis > stream > ** Trust policy is configured to allow account A to assume the role > * Actor in account A uses its own long-lived credentials to tell STS to > assume the role in account B > * STS returns temporary credentials with permission to read from the stream > in account B > Applied to KinesisReceiver and the KCL, we could use a keypair as our > long-lived credentials to authenticate to STS and assume an external role > with the necessary KCL permissions: > !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSKeypair.png! > Or the instance profile as long-lived credentials: > !https://raw.githubusercontent.com/budde/budde_asf_jira_images/master/spark/kinesis_sts_support/STSInstanceProfile.png! > The STSAssumeRoleSessionCredentialsProvider implementation of the > AWSCredentialsProviderChain interface from the AWS SDK abstracts all of the > management of the temporary session credentials away from the user. > STSAssumeRoleSessionCredentialsProvider simply needs the ARN of the AWS role > to be assumed, a session name for STS labeling purposes, an optional session > external ID and long-lived credentials to use for authenticating with the STS > service itself. > Supporting cross-account Kinesis access via STS requires supplying the > following additional configuration parameters: > * ARN of IAM role to assume in external account > * A name to apply to the STS session > * (optional) An IAM external ID to validate the assumed role against > The STSAssumeRoleSessionCredentialsProvider implementation of the > AWSCredentialsProvider interface takes these parameters as input and > abstracts away all of the lifecycle management for the temporary session > credentials. Ideally, users could simply supply an AWSCredentialsProvider > instance as an argument
[jira] [Comment Edited] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878623#comment-15878623 ] Cody Koeninger edited comment on SPARK-19680 at 2/22/17 4:25 PM: - The issue here is likely that you have lost data (because of retention expiration) between the time the batch was defined on the driver, and the time the executor attempted to process the batch. Having executor consumers obey auto offset reset would result in silent data loss, which is a bad thing. There's a more detailed description of the semantic issues around this for kafka in KAFKA-3370 and for structured streaming kafka in SPARK-17937 If you've got really aggressive retention settings and are having trouble getting a stream started, look at specifying earliest + some margin on startup as a workaround. If you're having this trouble after a stream has been running for a while, you need more retention or smaller batches. was (Author: c...@koeninger.org): The issue here is likely that you have lost data (because of retention expiration) between the time the batch was defined on the driver, and the time the executor attempted to process the batch. Having executor consumers obey auto offset reset would result in silent data loss, which is a bad thing. There's a more detailed description of the semantic issues around this for kafka in KAFKA-3370 and for structured streaming kafka in SPARK-17937 > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, > slideDuration: Int): DStream[TopListEntry] = { > val Mapper = MapperObject.readerFor[SearchEventDTO] > result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s)) > .filter(s => s != null && s.getSearchRequest != null && > s.getSearchRequest.getSearchParameters != null && s.getVertical == > Vertical.BAP && > s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName)) > .map(row => { > val name = > row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase() > (name, new TopListEntry(name, 1, row.getResultCount)) > }) > .reduceByKeyAndWindow( > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + > b.getMeanSearchHits), > (a: TopListEntry, b: TopListEntry) => new
[jira] [Commented] (SPARK-19680) Offsets out of range with no configured reset policy for partitions
[ https://issues.apache.org/jira/browse/SPARK-19680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878623#comment-15878623 ] Cody Koeninger commented on SPARK-19680: The issue here is likely that you have lost data (because of retention expiration) between the time the batch was defined on the driver, and the time the executor attempted to process the batch. Having executor consumers obey auto offset reset would result in silent data loss, which is a bad thing. There's a more detailed description of the semantic issues around this for kafka in KAFKA-3370 and for structured streaming kafka in SPARK-17937 > Offsets out of range with no configured reset policy for partitions > --- > > Key: SPARK-19680 > URL: https://issues.apache.org/jira/browse/SPARK-19680 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Schakmann Rene > > I'm using spark streaming with kafka to acutally create a toplist. I want to > read all the messages in kafka. So I set >"auto.offset.reset" -> "earliest" > Nevertheless when I start the job on our spark cluster it is not working I > get: > Error: > {code:title=error.log|borderStyle=solid} > Job aborted due to stage failure: Task 2 in stage 111.0 failed 4 times, > most recent failure: Lost task 2.3 in stage 111.0 (TID 1270, 194.232.55.23, > executor 2): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {SearchEvents-2=161803385} > {code} > This is somehow wrong because I did set the auto.offset.reset property > Setup: > Kafka Parameter: > {code:title=Config.Scala|borderStyle=solid} > def getDefaultKafkaReceiverParameter(properties: Properties):Map[String, > Object] = { > Map( > "bootstrap.servers" -> > properties.getProperty("kafka.bootstrap.servers"), > "group.id" -> properties.getProperty("kafka.consumer.group"), > "auto.offset.reset" -> "earliest", > "spark.streaming.kafka.consumer.cache.enabled" -> "false", > "enable.auto.commit" -> "false", > "key.deserializer" -> classOf[StringDeserializer], > "value.deserializer" -> "at.willhaben.sid.DTOByteDeserializer") > } > {code} > Job: > {code:title=Job.Scala|borderStyle=solid} > def processSearchKeyWords(stream: InputDStream[ConsumerRecord[String, > Array[Byte]]], windowDuration: Int, slideDuration: Int, kafkaSink: > Broadcast[KafkaSink[TopList]]): Unit = { > getFilteredStream(stream.map(_.value()), windowDuration, > slideDuration).foreachRDD(rdd => { > val topList = new TopList > topList.setCreated(new Date()) > topList.setTopListEntryList(rdd.take(TopListLength).toList) > CurrentLogger.info("TopList length: " + > topList.getTopListEntryList.size().toString) > kafkaSink.value.send(SendToTopicName, topList) > CurrentLogger.info("Last Run: " + System.currentTimeMillis()) > }) > } > def getFilteredStream(result: DStream[Array[Byte]], windowDuration: Int, > slideDuration: Int): DStream[TopListEntry] = { > val Mapper = MapperObject.readerFor[SearchEventDTO] > result.repartition(100).map(s => Mapper.readValue[SearchEventDTO](s)) > .filter(s => s != null && s.getSearchRequest != null && > s.getSearchRequest.getSearchParameters != null && s.getVertical == > Vertical.BAP && > s.getSearchRequest.getSearchParameters.containsKey(EspParameterEnum.KEYWORD.getName)) > .map(row => { > val name = > row.getSearchRequest.getSearchParameters.get(EspParameterEnum.KEYWORD.getName).getEspSearchParameterDTO.getValue.toLowerCase() > (name, new TopListEntry(name, 1, row.getResultCount)) > }) > .reduceByKeyAndWindow( > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount + b.getSearchCount, a.getMeanSearchHits + > b.getMeanSearchHits), > (a: TopListEntry, b: TopListEntry) => new TopListEntry(a.getKeyword, > a.getSearchCount - b.getSearchCount, a.getMeanSearchHits - > b.getMeanSearchHits), > Minutes(windowDuration), > Seconds(slideDuration)) > .filter((x: (String, TopListEntry)) => x._2.getSearchCount > 200L) > .map(row => (row._2.getSearchCount, row._2)) > .transform(rdd => rdd.sortByKey(ascending = false)) > .map(row => new TopListEntry(row._2.getKeyword, row._2.getSearchCount, > row._2.getMeanSearchHits / row._2.getSearchCount)) > } > def main(properties: Properties): Unit = { > val sparkSession = SparkUtil.getDefaultSparkSession(properties, TaskName) > val kafkaSink = > sparkSession.sparkContext.broadcast(KafkaSinkUtil.apply[TopList](SparkUtil.getDefaultSparkProperties(properties))) > val kafkaParams: Map[String, Object] = >
[jira] [Commented] (SPARK-19687) Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, kindly please help us with any examples.
[ https://issues.apache.org/jira/browse/SPARK-19687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878613#comment-15878613 ] Praveen Tallapudi commented on SPARK-19687: --- I tried but i got a delivery failure, Sorry to trouble you. I really need to know the solution exist or not, thats why. :-) > Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, > kindly please help us with any examples. > - > > Key: SPARK-19687 > URL: https://issues.apache.org/jira/browse/SPARK-19687 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Praveen Tallapudi > > Dear Team, > I am little new to Scala development and trying to find the solution for the > below. Please forgive me if this is not the correct place to post this > question. > I am trying to insert data from a data frame into postgres table. > > Dataframe Schema: > root > |-- ID: string (nullable = true) > |-- evtInfo: struct (nullable = true) > ||-- @date: string (nullable = true) > ||-- @time: string (nullable = true) > ||-- @timeID: string (nullable = true) > ||-- TranCode: string (nullable = true) > ||-- custName: string (nullable = true) > ||-- evtInfo: array (nullable = true) > |||-- element: string (containsNull = true) > ||-- Type: string (nullable = true) > ||-- opID: string (nullable = true) > ||-- tracNbr: string (nullable = true) > > > DataBase Table Schema: > CREATE TABLE public.test > ( >id bigint NOT NULL, >evtInfo jsonb NOT NULL, >evt_val bigint NOT NULL > ) > > When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, > "public.test", dbPropForDFtoSave) to save the data, I am seeing the below > error. > > Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC > type for > struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string> > > Can you please suggest the best approach to save the data frame into the > posgres JSONB table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16625) Oracle JDBC table creation fails with ORA-00902: invalid datatype
[ https://issues.apache.org/jira/browse/SPARK-16625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878604#comment-15878604 ] Sean Owen commented on SPARK-16625: --- Does anyone have a view on whether this is OK to back-port to 2.0.x or 1.6.x? > Oracle JDBC table creation fails with ORA-00902: invalid datatype > - > > Key: SPARK-16625 > URL: https://issues.apache.org/jira/browse/SPARK-16625 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2 >Reporter: Daniel Darabos >Assignee: Yuming Wang > Fix For: 2.1.0 > > > Unfortunately I know very little about databases, but I figure this is a bug. > I have a DataFrame with the following schema: > {noformat} > StructType(StructField(dst,StringType,true), StructField(id,LongType,true), > StructField(src,StringType,true)) > {noformat} > I am trying to write it to an Oracle database like this: > {code:java} > String url = "jdbc:oracle:thin:root/rootroot@:1521:db"; > java.util.Properties p = new java.util.Properties(); > p.setProperty("driver", "oracle.jdbc.OracleDriver"); > df.write().mode("overwrite").jdbc(url, "my_table", p); > {code} > And I get: > {noformat} > Exception in thread "main" java.sql.SQLSyntaxErrorException: ORA-00902: > invalid datatype > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:461) > at oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:402) > at oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:1108) > at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:541) > at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:264) > at oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:598) > at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:213) > at oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:26) > at > oracle.jdbc.driver.T4CStatement.executeForRows(T4CStatement.java:1241) > at > oracle.jdbc.driver.OracleStatement.doExecuteWithTimeout(OracleStatement.java:1558) > at > oracle.jdbc.driver.OracleStatement.executeUpdateInternal(OracleStatement.java:2498) > at > oracle.jdbc.driver.OracleStatement.executeUpdate(OracleStatement.java:2431) > at > oracle.jdbc.driver.OracleStatementWrapper.executeUpdate(OracleStatementWrapper.java:975) > at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:302) > {noformat} > The Oracle server I am running against is the one I get on Amazon RDS for > engine type {{oracle-se}}. The same code (with the right driver) against the > RDS instance with engine type {{MySQL}} works. > The error message is the same as in > https://issues.apache.org/jira/browse/SPARK-12941. Could it be that {{Long}} > is also translated into the wrong data type? Thanks. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19687) Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, kindly please help us with any examples.
[ https://issues.apache.org/jira/browse/SPARK-19687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878565#comment-15878565 ] Praveen Tallapudi commented on SPARK-19687: --- i tried sending it to u...@spark.apache.org , it is failed and got a failure notice. :-) > Does SPARK supports for Postgres JSONB data type to store JSON data, if yes, > kindly please help us with any examples. > - > > Key: SPARK-19687 > URL: https://issues.apache.org/jira/browse/SPARK-19687 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.1.1 >Reporter: Praveen Tallapudi > > Dear Team, > I am little new to Scala development and trying to find the solution for the > below. Please forgive me if this is not the correct place to post this > question. > I am trying to insert data from a data frame into postgres table. > > Dataframe Schema: > root > |-- ID: string (nullable = true) > |-- evtInfo: struct (nullable = true) > ||-- @date: string (nullable = true) > ||-- @time: string (nullable = true) > ||-- @timeID: string (nullable = true) > ||-- TranCode: string (nullable = true) > ||-- custName: string (nullable = true) > ||-- evtInfo: array (nullable = true) > |||-- element: string (containsNull = true) > ||-- Type: string (nullable = true) > ||-- opID: string (nullable = true) > ||-- tracNbr: string (nullable = true) > > > DataBase Table Schema: > CREATE TABLE public.test > ( >id bigint NOT NULL, >evtInfo jsonb NOT NULL, >evt_val bigint NOT NULL > ) > > When I use dataFrame_toSave.write.mode(SaveMode.Append).jdbc(dbUrl, > "public.test", dbPropForDFtoSave) to save the data, I am seeing the below > error. > > Exception in thread "main" java.lang.IllegalArgumentException: Can't get JDBC > type for > struct<@dateEvt:string,@timeEvt:string,@timeID:string,CICSTranCode:string,custName:string,evtInfo:array,evtType:string,operID:string,trackingNbr:string> > > Can you please suggest the best approach to save the data frame into the > posgres JSONB table. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
[ https://issues.apache.org/jira/browse/SPARK-19392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878598#comment-15878598 ] Sean Owen commented on SPARK-19392: --- It's because SPARK-16625 was back-ported to the 1.6.x release in CDH, I guess because of a customer problem. It's at https://github.com/apache/spark/blob/39e2bad6a866d27c3ca594d15e574a1da3ee84cc/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L33 I could easily back-port SPARK-16625 to 2.0.x and 1.6.x upstream. I don't think there was a particular reason it wasn't, so I'll ask if there are any objections and then do so. > Throw an exception "NoSuchElementException: key not found: scale" in > OracleDialect > -- > > Key: SPARK-19392 > URL: https://issues.apache.org/jira/browse/SPARK-19392 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle > jdbc, this throws an exception below; > {code} > java.util.NoSuchElementException: key not found: scale at > scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) at > scala.collection.MapLike$class.apply(MapLike.scala:141) > {code} > This ticket comes from > https://www.mail-archive.com/user@spark.apache.org/msg61280.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
[ https://issues.apache.org/jira/browse/SPARK-19392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878570#comment-15878570 ] Takeshi Yamamuro commented on SPARK-19392: -- The entry in line 33 of `OracleDialect` does not exist (See: https://github.com/apache/spark/blob/v1.6.0/sql/core/src/main/scala/org/apache/spark/sql/jdbc/OracleDialect.scala#L33)in the community released v1.6.0. So, it'd be better to ask Cloudera guys? Thanks! > Throw an exception "NoSuchElementException: key not found: scale" in > OracleDialect > -- > > Key: SPARK-19392 > URL: https://issues.apache.org/jira/browse/SPARK-19392 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle > jdbc, this throws an exception below; > {code} > java.util.NoSuchElementException: key not found: scale at > scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) at > scala.collection.MapLike$class.apply(MapLike.scala:141) > {code} > This ticket comes from > https://www.mail-archive.com/user@spark.apache.org/msg61280.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19392) Throw an exception "NoSuchElementException: key not found: scale" in OracleDialect
[ https://issues.apache.org/jira/browse/SPARK-19392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878520#comment-15878520 ] Hokam Singh Chauhan commented on SPARK-19392: - I am also getting the similar issue with Spark 1.6.0 on CDH5.8.3 environment. 17/02/22 18:26:11 INFO com.AppReceiver: Query in AppReceiver : select * from tutorials 17/02/22 18:26:14 ERROR com.AppDriver: Failed to start the driver for JDBCOracleApp java.util.NoSuchElementException: key not found: scale at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:58) at scala.collection.MapLike$class.apply(MapLike.scala:141) at scala.collection.AbstractMap.apply(Map.scala:58) at org.apache.spark.sql.types.Metadata.get(Metadata.scala:108) at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51) at org.apache.spark.sql.jdbc.OracleDialect$.getCatalystType(OracleDialect.scala:33) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:140) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:91) at org.apache.spark.sql.execution.datasources.jdbc.DefaultSource.createRelation(DefaultSource.scala:57) at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:158) Can anyone please help on this to resolve it in existing version? > Throw an exception "NoSuchElementException: key not found: scale" in > OracleDialect > -- > > Key: SPARK-19392 > URL: https://issues.apache.org/jira/browse/SPARK-19392 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In OracleDialect, if you use Numeric types in `DataFrameWriter` with Oracle > jdbc, this throws an exception below; > {code} > java.util.NoSuchElementException: key not found: scale at > scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) at > scala.collection.MapLike$class.apply(MapLike.scala:141) > {code} > This ticket comes from > https://www.mail-archive.com/user@spark.apache.org/msg61280.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
[ https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878433#comment-15878433 ] Praveen Tallapudi edited comment on SPARK-7869 at 2/22/17 3:28 PM: --- Hi Nipun, I am using Spark. Is there a way to insert the Jsonb data into postgres. We have a new project in design phase. We are thinking of using Apache Spark + Postgres DB. But we are facing issues while inserting JSONB data type. Is there a support for Postgres-JSONB from spark? Can you please help us ? I have posted this question in the issues but no response. We really need help, can you please let us know if there is a way of inserting?? was (Author: praveen.tallapudi): Hi Nipun, I am using Spark. Is there a way to insert the Jsonb data into postgres. We have a new project in design phase. We are thinking of using Apache Spark + Postgres DB. But we are facing issues while inserting JSONB data type. Is there a support for Postgres-JSONB from spark? Can you please help us ? I have posted this question in the issues but no response. Can you please help?? We really need help, can you please help?? > Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns > -- > > Key: SPARK-7869 > URL: https://issues.apache.org/jira/browse/SPARK-7869 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0, 1.3.1 > Environment: Spark 1.3.1 >Reporter: Brad Willard >Assignee: Alexey Grishchenko >Priority: Minor > Fix For: 1.6.0 > > > Most of our tables load into dataframes just fine with postgres. However we > have a number of tables leveraging the JSONB datatype. Spark will error and > refuse to load this table. While asking for Spark to support JSONB might be a > tall order in the short term, it would be great if Spark would at least load > the table ignoring the columns it can't load or have it be an option. > {code} > pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json") > Py4JJavaError: An error occurred while calling o41.load. > : java.sql.SQLException: Unsupported type > at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78) > at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112) > at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133) > at > org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121) > at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19692) Comparison on BinaryType has incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878443#comment-15878443 ] Sean Owen edited comment on SPARK-19692 at 2/22/17 3:16 PM: Bytes are signed in the JVM, and thus in Scala and Java. It's always been this way everywhere and isn't specific to Spark. 0x8C, as a byte, is a way of writing -116, not a positive value. 0x8C is a positive integer literal, but when cast to a byte, it's a negative 2s-complement value. was (Author: srowen): Bytes are signed in the JVM, and thus in Scala and Java. It's always been this way everywhere and isn't specific to Spark. 0x8C is a way of writing -116, not a positive value. > Comparison on BinaryType has incorrect results > -- > > Key: SPARK-19692 > URL: https://issues.apache.org/jira/browse/SPARK-19692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Smith > > I believe there is an issue with comparisons on binary fields: > {code} > val sc = SparkSession.builder.appName("test").getOrCreate() > val schema = StructType(Seq(StructField("ip", BinaryType))) > val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => > InetAddress.getByName(s).getAddress) > val df = sc.createDataFrame( > sc.sparkContext.parallelize(ips, 1).map { ip => > Row(ip) > }, schema > ) > val query = df > .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress) > .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress) > logger.info(query.explain(true)) > val results = query.collect() > results.length mustEqual 1 > {code} > returns no results. > i believe the problem is that the comparison is coercing the bytes to signed > integers in the call to compareTo here in TypeUtils: > {code} > def compareBinary(x: Array[Byte], y: Array[Byte]): Int = { > for (i <- 0 until x.length; if i < y.length) { > val res = x(i).compareTo(y(i)) > if (res != 0) return res > } > x.length - y.length > } > {code} > with some hacky testing i was able to get the desired results with: {code} > val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code} > thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878443#comment-15878443 ] Sean Owen commented on SPARK-19692: --- Bytes are signed in the JVM, and thus in Scala and Java. It's always been this way everywhere and isn't specific to Spark. 0x8C is a way of writing -116, not a positive value. > Comparison on BinaryType has incorrect results > -- > > Key: SPARK-19692 > URL: https://issues.apache.org/jira/browse/SPARK-19692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Smith > > I believe there is an issue with comparisons on binary fields: > {code} > val sc = SparkSession.builder.appName("test").getOrCreate() > val schema = StructType(Seq(StructField("ip", BinaryType))) > val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => > InetAddress.getByName(s).getAddress) > val df = sc.createDataFrame( > sc.sparkContext.parallelize(ips, 1).map { ip => > Row(ip) > }, schema > ) > val query = df > .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress) > .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress) > logger.info(query.explain(true)) > val results = query.collect() > results.length mustEqual 1 > {code} > returns no results. > i believe the problem is that the comparison is coercing the bytes to signed > integers in the call to compareTo here in TypeUtils: > {code} > def compareBinary(x: Array[Byte], y: Array[Byte]): Int = { > for (i <- 0 until x.length; if i < y.length) { > val res = x(i).compareTo(y(i)) > if (res != 0) return res > } > x.length - y.length > } > {code} > with some hacky testing i was able to get the desired results with: {code} > val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code} > thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7869) Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns
[ https://issues.apache.org/jira/browse/SPARK-7869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878433#comment-15878433 ] Praveen Tallapudi commented on SPARK-7869: -- Hi Nipun, I am using Spark. Is there a way to insert the Jsonb data into postgres. We have a new project in design phase. We are thinking of using Apache Spark + Postgres DB. But we are facing issues while inserting JSONB data type. Is there a support for Postgres-JSONB from spark? Can you please help us ? I have posted this question in the issues but no response. Can you please help?? We really need help, can you please help?? > Spark Data Frame Fails to Load Postgres Tables with JSONB DataType Columns > -- > > Key: SPARK-7869 > URL: https://issues.apache.org/jira/browse/SPARK-7869 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.3.0, 1.3.1 > Environment: Spark 1.3.1 >Reporter: Brad Willard >Assignee: Alexey Grishchenko >Priority: Minor > Fix For: 1.6.0 > > > Most of our tables load into dataframes just fine with postgres. However we > have a number of tables leveraging the JSONB datatype. Spark will error and > refuse to load this table. While asking for Spark to support JSONB might be a > tall order in the short term, it would be great if Spark would at least load > the table ignoring the columns it can't load or have it be an option. > {code} > pdf = sql_context.load(source="jdbc", url=url, dbtable="table_of_json") > Py4JJavaError: An error occurred while calling o41.load. > : java.sql.SQLException: Unsupported type > at org.apache.spark.sql.jdbc.JDBCRDD$.getCatalystType(JDBCRDD.scala:78) > at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:112) > at org.apache.spark.sql.jdbc.JDBCRelation.(JDBCRelation.scala:133) > at > org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:121) > at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:219) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:697) > at org.apache.spark.sql.SQLContext.load(SQLContext.scala:685) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) > at py4j.Gateway.invoke(Gateway.java:259) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:207) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19692) Comparison on BinaryType has incorrect results
[ https://issues.apache.org/jira/browse/SPARK-19692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878431#comment-15878431 ] Don Smith commented on SPARK-19692: an even more trivial example: {code} val sc = SparkSession.builder.appName("test").getOrCreate() val schema = StructType(Seq(StructField("byte", BinaryType))) val byte = Seq(Array(0x8C.toByte)) val df = sc.createDataFrame( sc.sparkContext.parallelize(byte, 1).map { ip => SQLRow(ip) }, schema ) logger.info(df.show) val query = df .where(df("byte") >= Array(0x00.toByte)) .where(df("byte") <= Array(0xFF.toByte)) logger.info(query.explain(true)) val results = query.collect() results.length mustEqual 1 {code} i'm having trouble believing this is the expected behavior, and if it is, is it defined somewhere? > Comparison on BinaryType has incorrect results > -- > > Key: SPARK-19692 > URL: https://issues.apache.org/jira/browse/SPARK-19692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Don Smith > > I believe there is an issue with comparisons on binary fields: > {code} > val sc = SparkSession.builder.appName("test").getOrCreate() > val schema = StructType(Seq(StructField("ip", BinaryType))) > val ips = Seq("1.1.1.1", "2.2.2.2", "200.10.6.7").map(s => > InetAddress.getByName(s).getAddress) > val df = sc.createDataFrame( > sc.sparkContext.parallelize(ips, 1).map { ip => > Row(ip) > }, schema > ) > val query = df > .where(df("ip") >= InetAddress.getByName("200.10.0.0").getAddress) > .where(df("ip") <= InetAddress.getByName("200.10.255.255").getAddress) > logger.info(query.explain(true)) > val results = query.collect() > results.length mustEqual 1 > {code} > returns no results. > i believe the problem is that the comparison is coercing the bytes to signed > integers in the call to compareTo here in TypeUtils: > {code} > def compareBinary(x: Array[Byte], y: Array[Byte]): Int = { > for (i <- 0 until x.length; if i < y.length) { > val res = x(i).compareTo(y(i)) > if (res != 0) return res > } > x.length - y.length > } > {code} > with some hacky testing i was able to get the desired results with: {code} > val res = (x(i).toByte & 0xff) - (y(i).toByte & 0xff) {code} > thanks! -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19659) Fetch big blocks to disk when shuffle-read
[ https://issues.apache.org/jira/browse/SPARK-19659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878389#comment-15878389 ] jin xing commented on SPARK-19659: -- [~irashid] Thanks a lot for your comments. I will file a design pdf late this week and your concerns will be included. > Fetch big blocks to disk when shuffle-read > -- > > Key: SPARK-19659 > URL: https://issues.apache.org/jira/browse/SPARK-19659 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.1.0 >Reporter: jin xing > > Currently the whole block is fetched into memory(offheap by default) when > shuffle-read. A block is defined by (shuffleId, mapId, reduceId). Thus it can > be large when skew situations. If OOM happens during shuffle read, job will > be killed and users will be notified to "Consider boosting > spark.yarn.executor.memoryOverhead". Adjusting parameter and allocating more > memory can resolve the OOM. However the approach is not perfectly suitable > for production environment, especially for data warehouse. > Using Spark SQL as data engine in warehouse, users hope to have a unified > parameter(e.g. memory) but less resource wasted(resource is allocated but not > used), > It's not always easy to predict skew situations, when happen, it make sense > to fetch remote blocks to disk for shuffle-read, rather than > kill the job because of OOM. This approach is mentioned during the discussion > in SPARK-3019, by [~sandyr] and [~mridulm80] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19679) Destroy broadcasted object without blocking
[ https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19679. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17016 [https://github.com/apache/spark/pull/17016] > Destroy broadcasted object without blocking > --- > > Key: SPARK-19679 > URL: https://issues.apache.org/jira/browse/SPARK-19679 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Trivial > Fix For: 2.2.0 > > > Destroy broadcasted object without blocking in ML -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19679) Destroy broadcasted object without blocking
[ https://issues.apache.org/jira/browse/SPARK-19679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19679: -- Assignee: zhengruifeng > Destroy broadcasted object without blocking > --- > > Key: SPARK-19679 > URL: https://issues.apache.org/jira/browse/SPARK-19679 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.2.0 > > > Destroy broadcasted object without blocking in ML -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath reassigned SPARK-19694: -- Assignee: zhengruifeng > Add missing 'setTopicDistributionCol' for LDAModel > -- > > Key: SPARK-19694 > URL: https://issues.apache.org/jira/browse/SPARK-19694 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Trivial > Fix For: 2.2.0 > > > {{LDAModel}} can not set the output column now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19694) Add missing 'setTopicDistributionCol' for LDAModel
[ https://issues.apache.org/jira/browse/SPARK-19694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nick Pentreath resolved SPARK-19694. Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17021 [https://github.com/apache/spark/pull/17021] > Add missing 'setTopicDistributionCol' for LDAModel > -- > > Key: SPARK-19694 > URL: https://issues.apache.org/jira/browse/SPARK-19694 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: zhengruifeng >Priority: Trivial > Fix For: 2.2.0 > > > {{LDAModel}} can not set the output column now. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-19691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19691: Assignee: Apache Spark > Calculating percentile of decimal column fails with ClassCastException > -- > > Key: SPARK-19691 > URL: https://issues.apache.org/jira/browse/SPARK-19691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > Running > {code} > spark.range(10).selectExpr("cast (id as decimal) as > x").selectExpr("percentile(x, 0.5)").collect() > {code} > results in a ClassCastException: > {code} > java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be > cast to java.lang.Number > at > org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141) > at > org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:113) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-19691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878270#comment-15878270 ] Apache Spark commented on SPARK-19691: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/17028 > Calculating percentile of decimal column fails with ClassCastException > -- > > Key: SPARK-19691 > URL: https://issues.apache.org/jira/browse/SPARK-19691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen > > Running > {code} > spark.range(10).selectExpr("cast (id as decimal) as > x").selectExpr("percentile(x, 0.5)").collect() > {code} > results in a ClassCastException: > {code} > java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be > cast to java.lang.Number > at > org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141) > at > org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:113) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19691) Calculating percentile of decimal column fails with ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-19691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19691: Assignee: (was: Apache Spark) > Calculating percentile of decimal column fails with ClassCastException > -- > > Key: SPARK-19691 > URL: https://issues.apache.org/jira/browse/SPARK-19691 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Josh Rosen > > Running > {code} > spark.range(10).selectExpr("cast (id as decimal) as > x").selectExpr("percentile(x, 0.5)").collect() > {code} > results in a ClassCastException: > {code} > java.lang.ClassCastException: org.apache.spark.sql.types.Decimal cannot be > cast to java.lang.Number > at > org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:141) > at > org.apache.spark.sql.catalyst.expressions.aggregate.Percentile.update(Percentile.scala:58) > at > org.apache.spark.sql.catalyst.expressions.aggregate.TypedImperativeAggregate.update(interfaces.scala:514) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$anonfun$applyOrElse$1.apply(AggregationIterator.scala:171) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:187) > at > org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateProcessRow$1.apply(AggregationIterator.scala:181) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:151) > at > org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:78) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:109) > at > org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:101) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:113) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job
[ https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19650: Assignee: Apache Spark (was: Sameer Agarwal) > Metastore-only operations shouldn't trigger a spark job > --- > > Key: SPARK-19650 > URL: https://issues.apache.org/jira/browse/SPARK-19650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sameer Agarwal >Assignee: Apache Spark > > We currently trigger a spark job even for simple metastore operations ({{SHOW > TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these > otherwise get executed on a driver, it prevents a user from doing these > operations on a driver-only cluster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job
[ https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15878260#comment-15878260 ] Apache Spark commented on SPARK-19650: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17027 > Metastore-only operations shouldn't trigger a spark job > --- > > Key: SPARK-19650 > URL: https://issues.apache.org/jira/browse/SPARK-19650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > > We currently trigger a spark job even for simple metastore operations ({{SHOW > TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these > otherwise get executed on a driver, it prevents a user from doing these > operations on a driver-only cluster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19650) Metastore-only operations shouldn't trigger a spark job
[ https://issues.apache.org/jira/browse/SPARK-19650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19650: Assignee: Sameer Agarwal (was: Apache Spark) > Metastore-only operations shouldn't trigger a spark job > --- > > Key: SPARK-19650 > URL: https://issues.apache.org/jira/browse/SPARK-19650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Sameer Agarwal >Assignee: Sameer Agarwal > > We currently trigger a spark job even for simple metastore operations ({{SHOW > TABLES}}, {{SHOW DATABASES}}, {{CREATE TABLE}} etc.). Even though these > otherwise get executed on a driver, it prevents a user from doing these > operations on a driver-only cluster. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org