[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906122#comment-15906122 ] yuhao yang commented on SPARK-14503: Thanks for reporting that. I just found there's a misplaced distinct along the iterations of updates. Do you want to send a PR with some unit tests? [~zero323] > spark.ml Scala API for FPGrowth > --- > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang > Fix For: 2.2.0 > > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19914) Spark Scala - Calling persist after reading a parquet file makes certain spark.sql queries return empty results
[ https://issues.apache.org/jira/browse/SPARK-19914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906121#comment-15906121 ] Takeshi Yamamuro commented on SPARK-19914: -- I couldn't reproduce this and do I miss anything? {code} scala> Seq(("1", 2)).toDF("id", "value").write.parquet("/Users/maropu/Desktop/data") scala> val df = spark.read.parquet("/Users/maropu/Desktop/data") scala> df.createOrReplaceTempView("t") scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10) +---+-+ | id|value| +---+-+ | 1|2| +---+-+ scala> import org.apache.spark.storage.StorageLevel scala> val df = spark.read.parquet("/Users/maropu/Desktop/data").persist(StorageLevel.DISK_ONLY) scala> df.createOrReplaceTempView("t") scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10) +---+-+ | id|value| +---+-+ | 1|2| +---+-+ {code} > Spark Scala - Calling persist after reading a parquet file makes certain > spark.sql queries return empty results > --- > > Key: SPARK-19914 > URL: https://issues.apache.org/jira/browse/SPARK-19914 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Yifeng Li > > Hi There, > It seems like calling .persist() after spark.read.parquet will make spark.sql > statements return empty results if the query is written in a certain way. > I have the following code here: > val df = spark.read.parquet("C:\\...") > df.createOrReplaceTempView("t1") > spark.sql("select * from t1 a where a.id = '123456789'").show(10) > Everything works fine here. > Now, if I do: > val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY) > df.createOrReplaceTempView("t1") > spark.sql("select * from t1 a where a.id = '123456789'").show(10) > I will get empty results. > selecting individual columns works with persist, e.g.: > val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY) > df.createOrReplaceTempView("t1") > spark.sql("select a.id from t1 a where a.id = '123456789'").show(10) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19914) Spark Scala - Calling persist after reading a parquet file makes certain spark.sql queries return empty results
[ https://issues.apache.org/jira/browse/SPARK-19914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906121#comment-15906121 ] Takeshi Yamamuro edited comment on SPARK-19914 at 3/11/17 7:21 AM: --- I couldn't reproduce this and do I miss anything? (I checked in v2.1.0 and v2.0.2) {code} scala> Seq(("1", 2)).toDF("id", "value").write.parquet("/Users/maropu/Desktop/data") scala> val df = spark.read.parquet("/Users/maropu/Desktop/data") scala> df.createOrReplaceTempView("t") scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10) +---+-+ | id|value| +---+-+ | 1|2| +---+-+ scala> import org.apache.spark.storage.StorageLevel scala> val df = spark.read.parquet("/Users/maropu/Desktop/data").persist(StorageLevel.DISK_ONLY) scala> df.createOrReplaceTempView("t") scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10) +---+-+ | id|value| +---+-+ | 1|2| +---+-+ {code} was (Author: maropu): I couldn't reproduce this and do I miss anything? {code} scala> Seq(("1", 2)).toDF("id", "value").write.parquet("/Users/maropu/Desktop/data") scala> val df = spark.read.parquet("/Users/maropu/Desktop/data") scala> df.createOrReplaceTempView("t") scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10) +---+-+ | id|value| +---+-+ | 1|2| +---+-+ scala> import org.apache.spark.storage.StorageLevel scala> val df = spark.read.parquet("/Users/maropu/Desktop/data").persist(StorageLevel.DISK_ONLY) scala> df.createOrReplaceTempView("t") scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10) +---+-+ | id|value| +---+-+ | 1|2| +---+-+ {code} > Spark Scala - Calling persist after reading a parquet file makes certain > spark.sql queries return empty results > --- > > Key: SPARK-19914 > URL: https://issues.apache.org/jira/browse/SPARK-19914 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.0.0, 2.1.0 >Reporter: Yifeng Li > > Hi There, > It seems like calling .persist() after spark.read.parquet will make spark.sql > statements return empty results if the query is written in a certain way. > I have the following code here: > val df = spark.read.parquet("C:\\...") > df.createOrReplaceTempView("t1") > spark.sql("select * from t1 a where a.id = '123456789'").show(10) > Everything works fine here. > Now, if I do: > val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY) > df.createOrReplaceTempView("t1") > spark.sql("select * from t1 a where a.id = '123456789'").show(10) > I will get empty results. > selecting individual columns works with persist, e.g.: > val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY) > df.createOrReplaceTempView("t1") > spark.sql("select a.id from t1 a where a.id = '123456789'").show(10) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19919) Defer input path validation into DataSource in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19919: Assignee: Apache Spark > Defer input path validation into DataSource in CSV datasource > - > > Key: SPARK-19919 > URL: https://issues.apache.org/jira/browse/SPARK-19919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Trivial > > Currently, if other datasources fail to infer the schema, it returns {{None}} > and then this is being validated in {{DataSource}} as below: > {code} > scala> spark.read.json("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It > must be specified manually.; > {code} > {code} > scala> spark.read.orc("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > {code} > {code} > scala> spark.read.parquet("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > {code} > However, CSV it checks it within the datasource implementation and throws > another exception message as below: > {code} > scala> spark.read.csv("emptydir") > java.lang.IllegalArgumentException: requirement failed: Cannot infer schema > from an empty set of files > {code} > We could remove this duplicated check and validate this in one place in the > same way with the same message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19919) Defer input path validation into DataSource in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19919: Assignee: (was: Apache Spark) > Defer input path validation into DataSource in CSV datasource > - > > Key: SPARK-19919 > URL: https://issues.apache.org/jira/browse/SPARK-19919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently, if other datasources fail to infer the schema, it returns {{None}} > and then this is being validated in {{DataSource}} as below: > {code} > scala> spark.read.json("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It > must be specified manually.; > {code} > {code} > scala> spark.read.orc("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > {code} > {code} > scala> spark.read.parquet("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > {code} > However, CSV it checks it within the datasource implementation and throws > another exception message as below: > {code} > scala> spark.read.csv("emptydir") > java.lang.IllegalArgumentException: requirement failed: Cannot infer schema > from an empty set of files > {code} > We could remove this duplicated check and validate this in one place in the > same way with the same message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19919) Defer input path validation into DataSource in CSV datasource
[ https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906115#comment-15906115 ] Apache Spark commented on SPARK-19919: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17256 > Defer input path validation into DataSource in CSV datasource > - > > Key: SPARK-19919 > URL: https://issues.apache.org/jira/browse/SPARK-19919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Trivial > > Currently, if other datasources fail to infer the schema, it returns {{None}} > and then this is being validated in {{DataSource}} as below: > {code} > scala> spark.read.json("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It > must be specified manually.; > {code} > {code} > scala> spark.read.orc("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It > must be specified manually.; > {code} > {code} > scala> spark.read.parquet("emptydir") > org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. > It must be specified manually.; > {code} > However, CSV it checks it within the datasource implementation and throws > another exception message as below: > {code} > scala> spark.read.csv("emptydir") > java.lang.IllegalArgumentException: requirement failed: Cannot infer schema > from an empty set of files > {code} > We could remove this duplicated check and validate this in one place in the > same way with the same message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19919) Defer input path validation into DataSource in CSV datasource
Hyukjin Kwon created SPARK-19919: Summary: Defer input path validation into DataSource in CSV datasource Key: SPARK-19919 URL: https://issues.apache.org/jira/browse/SPARK-19919 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Hyukjin Kwon Priority: Trivial Currently, if other datasources fail to infer the schema, it returns {{None}} and then this is being validated in {{DataSource}} as below: {code} scala> spark.read.json("emptydir") org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.; {code} {code} scala> spark.read.orc("emptydir") org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.; {code} {code} scala> spark.read.parquet("emptydir") org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; {code} However, CSV it checks it within the datasource implementation and throws another exception message as below: {code} scala> spark.read.csv("emptydir") java.lang.IllegalArgumentException: requirement failed: Cannot infer schema from an empty set of files {code} We could remove this duplicated check and validate this in one place in the same way with the same message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat
[ https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19918: Assignee: Apache Spark > Use TextFileFormat in implementation of JsonFileFormat > -- > > Key: SPARK-19918 > URL: https://issues.apache.org/jira/browse/SPARK-19918 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > If we use Dataset for initial loading when inferring the schema, there are > advantages. Please refer SPARK-18362 > It seems JSON one was supposed to be fixed together but missed according to > https://github.com/apache/spark/pull/15813 > {quote} > A similar problem also affects the JSON file format and this patch originally > fixed that as well, but I've decided to split that change into a separate > patch so as not to conflict with changes in another JSON PR. > {quote} > Also, this affects some functionalities because it does not use > {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's > case). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat
[ https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19918: Assignee: (was: Apache Spark) > Use TextFileFormat in implementation of JsonFileFormat > -- > > Key: SPARK-19918 > URL: https://issues.apache.org/jira/browse/SPARK-19918 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > If we use Dataset for initial loading when inferring the schema, there are > advantages. Please refer SPARK-18362 > It seems JSON one was supposed to be fixed together but missed according to > https://github.com/apache/spark/pull/15813 > {quote} > A similar problem also affects the JSON file format and this patch originally > fixed that as well, but I've decided to split that change into a separate > patch so as not to conflict with changes in another JSON PR. > {quote} > Also, this affects some functionalities because it does not use > {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's > case). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat
[ https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906098#comment-15906098 ] Apache Spark commented on SPARK-19918: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17255 > Use TextFileFormat in implementation of JsonFileFormat > -- > > Key: SPARK-19918 > URL: https://issues.apache.org/jira/browse/SPARK-19918 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > If we use Dataset for initial loading when inferring the schema, there are > advantages. Please refer SPARK-18362 > It seems JSON one was supposed to be fixed together but missed according to > https://github.com/apache/spark/pull/15813 > {quote} > A similar problem also affects the JSON file format and this patch originally > fixed that as well, but I've decided to split that change into a separate > patch so as not to conflict with changes in another JSON PR. > {quote} > Also, this affects some functionalities because it does not use > {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's > case). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat
Hyukjin Kwon created SPARK-19918: Summary: Use TextFileFormat in implementation of JsonFileFormat Key: SPARK-19918 URL: https://issues.apache.org/jira/browse/SPARK-19918 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Hyukjin Kwon If we use Dataset for initial loading when inferring the schema, there are advantages. Please refer SPARK-18362 It seems JSON one was supposed to be fixed together but missed according to https://github.com/apache/spark/pull/15813 {quote} A similar problem also affects the JSON file format and this patch originally fixed that as well, but I've decided to split that change into a separate patch so as not to conflict with changes in another JSON PR. {quote} Also, this affects some functionalities because it does not use {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's case). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19901) Clean up the clunky method signature of acquireMemory
[ https://issues.apache.org/jira/browse/SPARK-19901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19901. --- Resolution: Not A Problem > Clean up the clunky method signature of acquireMemory > - > > Key: SPARK-19901 > URL: https://issues.apache.org/jira/browse/SPARK-19901 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: coneyliu >Priority: Minor > > Clean up the clunky method signature of acquireMemory -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19723) create table for data source tables should work with an non-existent location
[ https://issues.apache.org/jira/browse/SPARK-19723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-19723: --- Assignee: Song Jun > create table for data source tables should work with an non-existent location > - > > Key: SPARK-19723 > URL: https://issues.apache.org/jira/browse/SPARK-19723 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Song Jun > Fix For: 2.2.0 > > > This JIRA is a follow up work after SPARK-19583 > As we discussed in that [PR|https://github.com/apache/spark/pull/16938] > The following DDL for datasource table with an non-existent location should > work: > `` > CREATE TABLE ... (PARTITIONED BY ...) LOCATION path > ``` > Currently it will throw exception that path not exists -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19723) create table for data source tables should work with an non-existent location
[ https://issues.apache.org/jira/browse/SPARK-19723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19723. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17055 [https://github.com/apache/spark/pull/17055] > create table for data source tables should work with an non-existent location > - > > Key: SPARK-19723 > URL: https://issues.apache.org/jira/browse/SPARK-19723 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > Fix For: 2.2.0 > > > This JIRA is a follow up work after SPARK-19583 > As we discussed in that [PR|https://github.com/apache/spark/pull/16938] > The following DDL for datasource table with an non-existent location should > work: > `` > CREATE TABLE ... (PARTITIONED BY ...) LOCATION path > ``` > Currently it will throw exception that path not exists -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19917) qualified partition location stored in catalog
[ https://issues.apache.org/jira/browse/SPARK-19917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19917: Assignee: Apache Spark > qualified partition location stored in catalog > -- > > Key: SPARK-19917 > URL: https://issues.apache.org/jira/browse/SPARK-19917 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun >Assignee: Apache Spark > > partition path should be qualified to store in catalog. > There are some scenes: > 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' >qualified: file:/path/x > 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' > qualified: file:/tablelocation/x > 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' >qualified: file:/path/x > 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' > qualified: file:/tablelocation/x > Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde > table has the expected qualified path. we should make other scenes to be > consist with it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19917) qualified partition location stored in catalog
[ https://issues.apache.org/jira/browse/SPARK-19917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19917: Assignee: (was: Apache Spark) > qualified partition location stored in catalog > -- > > Key: SPARK-19917 > URL: https://issues.apache.org/jira/browse/SPARK-19917 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > partition path should be qualified to store in catalog. > There are some scenes: > 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' >qualified: file:/path/x > 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' > qualified: file:/tablelocation/x > 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' >qualified: file:/path/x > 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' > qualified: file:/tablelocation/x > Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde > table has the expected qualified path. we should make other scenes to be > consist with it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19917) qualified partition location stored in catalog
[ https://issues.apache.org/jira/browse/SPARK-19917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906072#comment-15906072 ] Apache Spark commented on SPARK-19917: -- User 'windpiger' has created a pull request for this issue: https://github.com/apache/spark/pull/17254 > qualified partition location stored in catalog > -- > > Key: SPARK-19917 > URL: https://issues.apache.org/jira/browse/SPARK-19917 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Song Jun > > partition path should be qualified to store in catalog. > There are some scenes: > 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' >qualified: file:/path/x > 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' > qualified: file:/tablelocation/x > 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' >qualified: file:/path/x > 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' > qualified: file:/tablelocation/x > Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde > table has the expected qualified path. we should make other scenes to be > consist with it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19917) qualified partition location stored in catalog
Song Jun created SPARK-19917: Summary: qualified partition location stored in catalog Key: SPARK-19917 URL: https://issues.apache.org/jira/browse/SPARK-19917 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Song Jun partition path should be qualified to store in catalog. There are some scenes: 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' qualified: file:/path/x 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' qualified: file:/tablelocation/x 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x' qualified: file:/path/x 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x' qualified: file:/tablelocation/x Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19916) simplify bad file handling
[ https://issues.apache.org/jira/browse/SPARK-19916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19916: Assignee: Apache Spark (was: Wenchen Fan) > simplify bad file handling > -- > > Key: SPARK-19916 > URL: https://issues.apache.org/jira/browse/SPARK-19916 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19916) simplify bad file handling
[ https://issues.apache.org/jira/browse/SPARK-19916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19916: Assignee: Wenchen Fan (was: Apache Spark) > simplify bad file handling > -- > > Key: SPARK-19916 > URL: https://issues.apache.org/jira/browse/SPARK-19916 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19916) simplify bad file handling
[ https://issues.apache.org/jira/browse/SPARK-19916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906048#comment-15906048 ] Apache Spark commented on SPARK-19916: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/17253 > simplify bad file handling > -- > > Key: SPARK-19916 > URL: https://issues.apache.org/jira/browse/SPARK-19916 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19916) simplify bad file handling
Wenchen Fan created SPARK-19916: --- Summary: simplify bad file handling Key: SPARK-19916 URL: https://issues.apache.org/jira/browse/SPARK-19916 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6634) Allow replacing columns in Transformers
[ https://issues.apache.org/jira/browse/SPARK-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904445#comment-15904445 ] Tree Field edited comment on SPARK-6634 at 3/11/17 2:59 AM: I want this feature too. because I often overwrite UnaryTransformer by myself to enable this. It seems it's only prevented in transformSchema method. Now, unlike before v1.4, dataframe's withColumn method used in UnaryTransformer allows replacing the input column. Are there any other reasons that is not allowed in transoformer, especially in UnaryTransformer? was (Author: greattreeinfield): I want this feature too. because I often overwrite UnaryTransformer by myself to enable this. It seems it's only prevented in transformSchema method. Now, unlike before v1.4, dataframe's withColumn method used in UnaryTransformer allows replacing the input column. Any other reasons that is not allowed in transoformer, especially in UnaryTransformer. > Allow replacing columns in Transformers > --- > > Key: SPARK-6634 > URL: https://issues.apache.org/jira/browse/SPARK-6634 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Currently, Transformers do not allow input and output columns to share the > same name. (In fact, this is not allowed but also not even checked.) > Short-term proposal: Disallow input and output columns with the same name, > and add a check in transformSchema. > Long-term proposal: Allow input & output columns with the same name, and > where the behavior is that the output columns replace input columns with the > same name. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product
[ https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19915: Assignee: Apache Spark > Improve join reorder: simplify cost evaluation, postpone column pruning, > exclude cartesian product > -- > > Key: SPARK-19915 > URL: https://issues.apache.org/jira/browse/SPARK-19915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark > > 1. Usually cardinality is more important than size, we can simplify cost > evaluation by using only cardinality. Note that this also enables us to not > care about column pruing during reordering. Because otherwise, project will > influence the output size of intermediate joins. > 2. Do column pruning during reordering is troublesome. Given the first > change, we can do it right after reordering, then logics for adding projects > on intermediate joins can be removed. This makes the code simpler and more > reliable. > 3. Exclude cartesian products in the "memo". This significantly reduces the > search space and memory overhead of memo. Otherwise every combination of > items will exist in the memo. We can find those unjoinable items after > reordering is finished and put them at the end. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product
[ https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19915: Assignee: (was: Apache Spark) > Improve join reorder: simplify cost evaluation, postpone column pruning, > exclude cartesian product > -- > > Key: SPARK-19915 > URL: https://issues.apache.org/jira/browse/SPARK-19915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > > 1. Usually cardinality is more important than size, we can simplify cost > evaluation by using only cardinality. Note that this also enables us to not > care about column pruing during reordering. Because otherwise, project will > influence the output size of intermediate joins. > 2. Do column pruning during reordering is troublesome. Given the first > change, we can do it right after reordering, then logics for adding projects > on intermediate joins can be removed. This makes the code simpler and more > reliable. > 3. Exclude cartesian products in the "memo". This significantly reduces the > search space and memory overhead of memo. Otherwise every combination of > items will exist in the memo. We can find those unjoinable items after > reordering is finished and put them at the end. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product
[ https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-19915: - Description: 1. Usually cardinality is more important than size, we can simplify cost evaluation by using only cardinality. Note that this also enables us to not care about column pruing during reordering. Because otherwise, project will influence the output size of intermediate joins. 2. Do column pruning during reordering is troublesome. Given the first change, we can do it right after reordering, then logics for adding projects on intermediate joins can be removed. This makes the code simpler and more reliable. 3. Exclude cartesian products in the "memo". This significantly reduces the search space and memory overhead of memo. Otherwise every combination of items will exist in the memo. We can find those unjoinable items after reordering is finished and put them at the end. was: Do column pruning during reordering is troublesome. We can do it right after reordering, then logics for adding projects on intermediate joins can be removed. This makes the code simpler and more reliable. Usually cardinality is more important than size, we can simplify cost evaluation by using only cardinality. Note that this enables us to not care about column pruing during reordering (the first point). Otherwise, project will influence the output size of intermediate joins. Exclude cartesian products in the "memo". This significantly reduces the search space and memory overhead of memo. Otherwise every combination of items will exist in the memo. We can find those unjoinable items after reordering is finished and put them at the end. > Improve join reorder: simplify cost evaluation, postpone column pruning, > exclude cartesian product > -- > > Key: SPARK-19915 > URL: https://issues.apache.org/jira/browse/SPARK-19915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > > 1. Usually cardinality is more important than size, we can simplify cost > evaluation by using only cardinality. Note that this also enables us to not > care about column pruing during reordering. Because otherwise, project will > influence the output size of intermediate joins. > 2. Do column pruning during reordering is troublesome. Given the first > change, we can do it right after reordering, then logics for adding projects > on intermediate joins can be removed. This makes the code simpler and more > reliable. > 3. Exclude cartesian products in the "memo". This significantly reduces the > search space and memory overhead of memo. Otherwise every combination of > items will exist in the memo. We can find those unjoinable items after > reordering is finished and put them at the end. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product
[ https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906018#comment-15906018 ] Apache Spark commented on SPARK-19915: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/17240 > Improve join reorder: simplify cost evaluation, postpone column pruning, > exclude cartesian product > -- > > Key: SPARK-19915 > URL: https://issues.apache.org/jira/browse/SPARK-19915 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Zhenhua Wang > > 1. Usually cardinality is more important than size, we can simplify cost > evaluation by using only cardinality. Note that this also enables us to not > care about column pruing during reordering. Because otherwise, project will > influence the output size of intermediate joins. > 2. Do column pruning during reordering is troublesome. Given the first > change, we can do it right after reordering, then logics for adding projects > on intermediate joins can be removed. This makes the code simpler and more > reliable. > 3. Exclude cartesian products in the "memo". This significantly reduces the > search space and memory overhead of memo. Otherwise every combination of > items will exist in the memo. We can find those unjoinable items after > reordering is finished and put them at the end. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product
Zhenhua Wang created SPARK-19915: Summary: Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product Key: SPARK-19915 URL: https://issues.apache.org/jira/browse/SPARK-19915 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.2.0 Reporter: Zhenhua Wang Do column pruning during reordering is troublesome. We can do it right after reordering, then logics for adding projects on intermediate joins can be removed. This makes the code simpler and more reliable. Usually cardinality is more important than size, we can simplify cost evaluation by using only cardinality. Note that this enables us to not care about column pruing during reordering (the first point). Otherwise, project will influence the output size of intermediate joins. Exclude cartesian products in the "memo". This significantly reduces the search space and memory overhead of memo. Otherwise every combination of items will exist in the memo. We can find those unjoinable items after reordering is finished and put them at the end. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905991#comment-15905991 ] LvDongrong commented on SPARK-19863: I see your comment on that issue(SPARK-19185), and I am agree with you. Our problem is different, our kafka Cluster cannot support so many connections, which is established by cached consumers to the kafka ,because the num of our topic and partition is large. So I think it is necessary not to use cached consumer in some cases. > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming application > > > Key: SPARK-19863 > URL: https://issues.apache.org/jira/browse/SPARK-19863 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 2.1.0 >Reporter: LvDongrong > > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming > application. In Spark 2.x, the kafka consumer was replaced by > CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka > cluster), and cannot change the way. In fact ,The KafkaRDD(used by > DirectKafkaInputDStream to connect kafka) provide the parameter > useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the > DirectKafkaInputDStream set the parameter true. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19914) Spark Scala - Calling persist after reading a parquet file makes certain spark.sql queries return empty results
Yifeng Li created SPARK-19914: - Summary: Spark Scala - Calling persist after reading a parquet file makes certain spark.sql queries return empty results Key: SPARK-19914 URL: https://issues.apache.org/jira/browse/SPARK-19914 Project: Spark Issue Type: Bug Components: Input/Output, SQL Affects Versions: 2.1.0, 2.0.0 Reporter: Yifeng Li Hi There, It seems like calling .persist() after spark.read.parquet will make spark.sql statements return empty results if the query is written in a certain way. I have the following code here: val df = spark.read.parquet("C:\\...") df.createOrReplaceTempView("t1") spark.sql("select * from t1 a where a.id = '123456789'").show(10) Everything works fine here. Now, if I do: val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY) df.createOrReplaceTempView("t1") spark.sql("select * from t1 a where a.id = '123456789'").show(10) I will get empty results. selecting individual columns works with persist, e.g.: val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY) df.createOrReplaceTempView("t1") spark.sql("select a.id from t1 a where a.id = '123456789'").show(10) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19611: Fix Version/s: 2.1.1 > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > --- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde >Assignee: Adam Budde > Fix For: 2.1.1, 2.2.0 > > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for saving a case-sensitive copy of the schema in the metastore table > properties, which HiveExternalCatalog will read in as the table's schema if > it is present. If it is not present, it will fall back to the > case-insensitive metastore schema. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19893) should not run DataFrame set oprations with map type
[ https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19893. - Resolution: Fixed Fix Version/s: 2.2.0 2.1.1 2.0.3 > should not run DataFrame set oprations with map type > > > Key: SPARK-19893 > URL: https://issues.apache.org/jira/browse/SPARK-19893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.3, 2.1.1, 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach
[ https://issues.apache.org/jira/browse/SPARK-19913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19913: Assignee: (was: Apache Spark) > Log warning rather than throw AnalysisException when output is partitioned > although format is memory, console or foreach > > > Key: SPARK-19913 > URL: https://issues.apache.org/jira/browse/SPARK-19913 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kousuke Saruta >Priority: Minor > > When batches are executed with memory, console or foreach format, > `assertNotPartitioned` will check whether output is not partitioned and throw > AnalysisException in case it is. > But I wonder it's better to log warning rather than throw the exception > because partitioning does not affect output for those formats but also does > not bring any negative impacts. > Also, this assertion is not applied when the format is `console`. I think in > this case too, we should assert that . > By fixing them, we can easily switch the format to memory or console for > debug purposes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach
[ https://issues.apache.org/jira/browse/SPARK-19913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19913: Assignee: Apache Spark > Log warning rather than throw AnalysisException when output is partitioned > although format is memory, console or foreach > > > Key: SPARK-19913 > URL: https://issues.apache.org/jira/browse/SPARK-19913 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > When batches are executed with memory, console or foreach format, > `assertNotPartitioned` will check whether output is not partitioned and throw > AnalysisException in case it is. > But I wonder it's better to log warning rather than throw the exception > because partitioning does not affect output for those formats but also does > not bring any negative impacts. > Also, this assertion is not applied when the format is `console`. I think in > this case too, we should assert that . > By fixing them, we can easily switch the format to memory or console for > debug purposes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach
[ https://issues.apache.org/jira/browse/SPARK-19913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905904#comment-15905904 ] Apache Spark commented on SPARK-19913: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/17252 > Log warning rather than throw AnalysisException when output is partitioned > although format is memory, console or foreach > > > Key: SPARK-19913 > URL: https://issues.apache.org/jira/browse/SPARK-19913 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kousuke Saruta >Priority: Minor > > When batches are executed with memory, console or foreach format, > `assertNotPartitioned` will check whether output is not partitioned and throw > AnalysisException in case it is. > But I wonder it's better to log warning rather than throw the exception > because partitioning does not affect output for those formats but also does > not bring any negative impacts. > Also, this assertion is not applied when the format is `console`. I think in > this case too, we should assert that . > By fixing them, we can easily switch the format to memory or console for > debug purposes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach
Kousuke Saruta created SPARK-19913: -- Summary: Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach Key: SPARK-19913 URL: https://issues.apache.org/jira/browse/SPARK-19913 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Kousuke Saruta Priority: Minor When batches are executed with memory, console or foreach format, `assertNotPartitioned` will check whether output is not partitioned and throw AnalysisException in case it is. But I wonder it's better to log warning rather than throw the exception because partitioning does not affect output for those formats but also does not bring any negative impacts. Also, this assertion is not applied when the format is `console`. I think in this case too, we should assert that . By fixing them, we can easily switch the format to memory or console for debug purposes. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905886#comment-15905886 ] Shixiong Zhu commented on SPARK-18057: -- > Based on previous kafka client upgrades I wouldn't expect them to be binary > compatible, so it's likely to cause someone problems if they were also making > use of kafka client libraries in their spark job. Still may be the path of > least resistance. I can confirm the APIs used by Kafka sink is source compatible since I didn't change any core source codes (test codes have to be changed because of the server APIs are changed). Since these APIs are Java APIs. I'm pretty sure they are binary compatible. So for the user, even if we upgrade the Kafka client version, they can still downgrade the Kafka client version if they want, and just repackage the codes with the kafka client. It's a bit annoying that "--packages" probably won't work but it's acceptable. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19912) String literals are not escaped while performing Hive metastore level partition pruning
[ https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19912: --- Summary: String literals are not escaped while performing Hive metastore level partition pruning (was: String literals are not escaped while performing partition pruning at Hive metastore level) > String literals are not escaped while performing Hive metastore level > partition pruning > --- > > Key: SPARK-19912 > URL: https://issues.apache.org/jira/browse/SPARK-19912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Cheng Lian > Labels: correctness > > {{Shim_v0_13.convertFilters()}} doesn't escape string literals while > generating Hive style partition predicates. > The following SQL-injection-like test case illustrates this issue: > {code} > test("SPARK-19912") { > withTable("spark_19912") { > Seq( > (1, "p1", "q1"), > (2, "p1\" and q=\"q1", "q2") > ).toDF("a", "p", "q").write.partitionBy("p", > "q").saveAsTable("spark_19912") > checkAnswer( > spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"), > Row(2) > ) > } > } > {code} > The above test case fails like this: > {noformat} > [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds) > [info] Results do not match for query: > [info] Timezone: > sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] > [info] Timezone Env: > [info] > [info] == Parsed Logical Plan == > [info] 'Project [unresolvedalias('a, None)] > [info] +- Filter (p#27 = p1" and q = "q1) > [info] +- SubqueryAlias spark_19912 > [info] +- Relation[a#26,p#27,q#28] parquet > [info] > [info] == Analyzed Logical Plan == > [info] a: int > [info] Project [a#26] > [info] +- Filter (p#27 = p1" and q = "q1) > [info] +- SubqueryAlias spark_19912 > [info] +- Relation[a#26,p#27,q#28] parquet > [info] > [info] == Optimized Logical Plan == > [info] Project [a#26] > [info] +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1)) > [info] +- Relation[a#26,p#27,q#28] parquet > [info] > [info] == Physical Plan == > [info] *Project [a#26] > [info] +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: > true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: > 0, PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], > PushedFilters: [], ReadSchema: struct > [info] == Results == > [info] > [info] == Results == > [info] !== Correct Answer - 1 == == Spark Answer - 0 == > [info]struct<> struct<> > [info] ![2] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables
[ https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19905. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 17247 [https://github.com/apache/spark/pull/17247] > Dataset.inputFiles is broken for Hive SerDe tables > -- > > Key: SPARK-19905 > URL: https://issues.apache.org/jira/browse/SPARK-19905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.2.0 > > > The following snippet reproduces this issue: > {code} > spark.range(10).createOrReplaceTempView("t") > spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t") > spark.table("u").inputFiles.foreach(println) > {code} > In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like > {noformat} > file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u > {noformat} > on my laptop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level
[ https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19912: --- Description: {{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating Hive style partition predicates. The following SQL-injection-like test case illustrates this issue: {code} test("SPARK-19912") { withTable("spark_19912") { Seq( (1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2") ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("spark_19912") checkAnswer( spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"), Row(2) ) } } {code} The above test case fails like this: {noformat} [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds) [info] Results do not match for query: [info] Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] 'Project [unresolvedalias('a, None)] [info] +- Filter (p#27 = p1" and q = "q1) [info] +- SubqueryAlias spark_19912 [info] +- Relation[a#26,p#27,q#28] parquet [info] [info] == Analyzed Logical Plan == [info] a: int [info] Project [a#26] [info] +- Filter (p#27 = p1" and q = "q1) [info] +- SubqueryAlias spark_19912 [info] +- Relation[a#26,p#27,q#28] parquet [info] [info] == Optimized Logical Plan == [info] Project [a#26] [info] +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1)) [info] +- Relation[a#26,p#27,q#28] parquet [info] [info] == Physical Plan == [info] *Project [a#26] [info] +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], PushedFilters: [], ReadSchema: struct [info] == Results == [info] [info] == Results == [info] !== Correct Answer - 1 == == Spark Answer - 0 == [info]struct<> struct<> [info] ![2] {noformat} was: {{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating Hive style partition predicates. The following SQL-injection-like test case illustrates this issue: {code} test("foo") { withTable("foo") { Seq( (1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2") ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo") checkAnswer( spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"), Row(2) ) } } {code} > String literals are not escaped while performing partition pruning at Hive > metastore level > -- > > Key: SPARK-19912 > URL: https://issues.apache.org/jira/browse/SPARK-19912 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 >Reporter: Cheng Lian > Labels: correctness > > {{Shim_v0_13.convertFilters()}} doesn't escape string literals while > generating Hive style partition predicates. > The following SQL-injection-like test case illustrates this issue: > {code} > test("SPARK-19912") { > withTable("spark_19912") { > Seq( > (1, "p1", "q1"), > (2, "p1\" and q=\"q1", "q2") > ).toDF("a", "p", "q").write.partitionBy("p", > "q").saveAsTable("spark_19912") > checkAnswer( > spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"), > Row(2) > ) > } > } > {code} > The above test case fails like this: > {noformat} > [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds) > [info] Results do not match for query: > [info] Timezone: > sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]] > [info] Timezone Env: > [info] > [info] == Parsed Logical Plan == > [info] 'Project [unresolvedalias('a, None)] > [info] +- Filter (p#27 = p1" and q = "q1) > [info] +- SubqueryAlias spark_19912 > [info] +- Relation[a#26,p#27,q#28] parquet > [info] > [info] == Analyzed Logical Plan == > [info] a: int >
[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables
[ https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19887: --- Labels: correctness (was: ) > __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in > partitioned persisted tables > - > > Key: SPARK-19887 > URL: https://issues.apache.org/jira/browse/SPARK-19887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Cheng Lian > Labels: correctness > > The following Spark shell snippet under Spark 2.1 reproduces this issue: > {code} > val data = Seq( > ("p1", 1, 1), > ("p2", 2, 2), > (null, 3, 3) > ) > // Correct case: Saving partitioned data to file system. > val path = "/tmp/partitioned" > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > parquet(path) > spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) > // +---+---+---+ > // |c |a |b | > // +---+---+---+ > // |2 |p2 |2 | > // |1 |p1 |1 | > // +---+---+---+ > // Incorrect case: Saving partitioned data as persisted table. > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > saveAsTable("test_null") > spark.table("test_null").filter($"a".isNotNull).show(truncate = false) > // +---+--+---+ > // |c |a |b | > // +---+--+---+ > // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here > // |1 |p1|1 | > // |2 |p2|2 | > // +---+--+---+ > {code} > Hive-style partitioned tables use the magic string > {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in > partition directory names. However, in the case persisted partitioned table, > this magic string is not interpreted as {{NULL}} but a regular string. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level
Cheng Lian created SPARK-19912: -- Summary: String literals are not escaped while performing partition pruning at Hive metastore level Key: SPARK-19912 URL: https://issues.apache.org/jira/browse/SPARK-19912 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.1, 2.2.0 Reporter: Cheng Lian {{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating Hive style partition predicates. The following SQL-injection-like test case illustrates this issue: {code} test("foo") { withTable("foo") { Seq( (1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2") ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo") checkAnswer( spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"), Row(2) ) } } {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905840#comment-15905840 ] Maciej Szymkiewicz commented on SPARK-14503: I think we should keep only unique predictions in {{transform}} otherwise it is possible to get results like this: {code} scala> val data = spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value", "\\s+").alias("features")) data: org.apache.spark.sql.DataFrame = [features: array] scala> val data = spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value", "\\s+").alias("features")) data: org.apache.spark.sql.DataFrame = [features: array] scala> fpm.transform(Seq(Array("t", "s")).toDF("features")).show(1, false) ++-+ |features|prediction | ++-+ |[t, s] |[y, x, z, x, y, x, z]| ++-+ {code} > spark.ml Scala API for FPGrowth > --- > > Key: SPARK-14503 > URL: https://issues.apache.org/jira/browse/SPARK-14503 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Assignee: yuhao yang > Fix For: 2.2.0 > > > This task is the first port of spark.mllib.fpm functionality to spark.ml > (Scala). > This will require a brief design doc to confirm a reasonable DataFrame-based > API, with details for this class. The doc could also look ahead to the other > fpm classes, especially if their API decisions will affect FPGrowth. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905834#comment-15905834 ] Maciej Szymkiewicz commented on SPARK-19899: Thanks [~yuhaoyan]. > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables
[ https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19887: --- Affects Version/s: 2.2.0 > __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in > partitioned persisted tables > - > > Key: SPARK-19887 > URL: https://issues.apache.org/jira/browse/SPARK-19887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0, 2.2.0 >Reporter: Cheng Lian > > The following Spark shell snippet under Spark 2.1 reproduces this issue: > {code} > val data = Seq( > ("p1", 1, 1), > ("p2", 2, 2), > (null, 3, 3) > ) > // Correct case: Saving partitioned data to file system. > val path = "/tmp/partitioned" > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > parquet(path) > spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) > // +---+---+---+ > // |c |a |b | > // +---+---+---+ > // |2 |p2 |2 | > // |1 |p1 |1 | > // +---+---+---+ > // Incorrect case: Saving partitioned data as persisted table. > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > saveAsTable("test_null") > spark.table("test_null").filter($"a".isNotNull).show(truncate = false) > // +---+--+---+ > // |c |a |b | > // +---+--+---+ > // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here > // |1 |p1|1 | > // |2 |p2|2 | > // +---+--+---+ > {code} > Hive-style partitioned tables use the magic string > {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in > partition directory names. However, in the case persisted partitioned table, > this magic string is not interpreted as {{NULL}} but a regular string. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905774#comment-15905774 ] Cody Koeninger commented on SPARK-18057: Based on previous kafka client upgrades I wouldn't expect them to be binary compatible, so it's likely to cause someone problems if they were also making use of kafka client libraries in their spark job. Still may be the path of least resistance. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19910) `stack` should not reject NULL values due to type mismatch
[ https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19910: Assignee: (was: Apache Spark) > `stack` should not reject NULL values due to type mismatch > -- > > Key: SPARK-19910 > URL: https://issues.apache.org/jira/browse/SPARK-19910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Dongjoon Hyun > > Since `stack` function generates a table with nullable columns, it should > allow mixed null values. > {code} > scala> sql("select stack(3, 1, 2, 3)").printSchema > root > |-- col0: integer (nullable = true) > scala> sql("select stack(3, 1, 2, null)").printSchema > org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' > due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); > line 1 pos 7; > 'Project [unresolvedalias(stack(3, 1, 2, null), None)] > +- OneRowRelation$ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19910) `stack` should not reject NULL values due to type mismatch
[ https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19910: Assignee: Apache Spark > `stack` should not reject NULL values due to type mismatch > -- > > Key: SPARK-19910 > URL: https://issues.apache.org/jira/browse/SPARK-19910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark > > Since `stack` function generates a table with nullable columns, it should > allow mixed null values. > {code} > scala> sql("select stack(3, 1, 2, 3)").printSchema > root > |-- col0: integer (nullable = true) > scala> sql("select stack(3, 1, 2, null)").printSchema > org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' > due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); > line 1 pos 7; > 'Project [unresolvedalias(stack(3, 1, 2, null), None)] > +- OneRowRelation$ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19910) `stack` should not reject NULL values due to type mismatch
[ https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905766#comment-15905766 ] Apache Spark commented on SPARK-19910: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/17251 > `stack` should not reject NULL values due to type mismatch > -- > > Key: SPARK-19910 > URL: https://issues.apache.org/jira/browse/SPARK-19910 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Dongjoon Hyun > > Since `stack` function generates a table with nullable columns, it should > allow mixed null values. > {code} > scala> sql("select stack(3, 1, 2, 3)").printSchema > root > |-- col0: integer (nullable = true) > scala> sql("select stack(3, 1, 2, null)").printSchema > org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' > due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); > line 1 pos 7; > 'Project [unresolvedalias(stack(3, 1, 2, null), None)] > +- OneRowRelation$ > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905755#comment-15905755 ] Michael Armbrust commented on SPARK-18057: -- It seems like we can upgrade the existing Kafka10 artifacts without causing any compatibility issues (since 0.10.2.0 is compatible with 0.10.0.0+), so I don't think there is any need to make new artifacts or do any refactoring. I think we can just upgrade? > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905747#comment-15905747 ] Cody Koeninger commented on SPARK-18057: I think the bigger question is once there's a kafka version you want to upgrade to, are you going to just forcibly upgrade, make another set of separate artifacts, or refactor common code so that it can use a different / provided kafka version. Ditto for the DStream, unless you're just abandoning it. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19911) Add builder interface for Kinesis DStreams
[ https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19911: Assignee: Apache Spark > Add builder interface for Kinesis DStreams > -- > > Key: SPARK-19911 > URL: https://issues.apache.org/jira/browse/SPARK-19911 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Adam Budde >Assignee: Apache Spark >Priority: Minor > > The ```KinesisUtils.createStream()``` interface for creating Kinesis-based > DStreams is quite brittle and requires adding a combinatorial number of > overrides whenever another optional configuration parameter is added. This > makes incorporating a lot of additional features supported by the Kinesis > Client Library such as per-service authorization unfeasible. This interface > should be replaced by a builder pattern class > (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater > extensibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19911) Add builder interface for Kinesis DStreams
[ https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19911: Assignee: (was: Apache Spark) > Add builder interface for Kinesis DStreams > -- > > Key: SPARK-19911 > URL: https://issues.apache.org/jira/browse/SPARK-19911 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Adam Budde >Priority: Minor > > The ```KinesisUtils.createStream()``` interface for creating Kinesis-based > DStreams is quite brittle and requires adding a combinatorial number of > overrides whenever another optional configuration parameter is added. This > makes incorporating a lot of additional features supported by the Kinesis > Client Library such as per-service authorization unfeasible. This interface > should be replaced by a builder pattern class > (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater > extensibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19911) Add builder interface for Kinesis DStreams
[ https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905741#comment-15905741 ] Apache Spark commented on SPARK-19911: -- User 'budde' has created a pull request for this issue: https://github.com/apache/spark/pull/17250 > Add builder interface for Kinesis DStreams > -- > > Key: SPARK-19911 > URL: https://issues.apache.org/jira/browse/SPARK-19911 > Project: Spark > Issue Type: New Feature > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Adam Budde >Priority: Minor > > The ```KinesisUtils.createStream()``` interface for creating Kinesis-based > DStreams is quite brittle and requires adding a combinatorial number of > overrides whenever another optional configuration parameter is added. This > makes incorporating a lot of additional features supported by the Kinesis > Client Library such as per-service authorization unfeasible. This interface > should be replaced by a builder pattern class > (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater > extensibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17979) Remove deprecated support for config SPARK_YARN_USER_ENV
[ https://issues.apache.org/jira/browse/SPARK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17979. Resolution: Fixed Assignee: Yong Tang Fix Version/s: 2.2.0 > Remove deprecated support for config SPARK_YARN_USER_ENV > - > > Key: SPARK-17979 > URL: https://issues.apache.org/jira/browse/SPARK-17979 > Project: Spark > Issue Type: Improvement >Reporter: Kishor Patil >Assignee: Yong Tang >Priority: Trivial > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-14453. Resolution: Fixed Assignee: Yong Tang Fix Version/s: 2.2.0 > Remove SPARK_JAVA_OPTS environment variable > --- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Task > Components: Spark Core, YARN >Reporter: Saisai Shao >Assignee: Yong Tang >Priority: Minor > Fix For: 2.2.0 > > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly
[ https://issues.apache.org/jira/browse/SPARK-19888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905730#comment-15905730 ] Cody Koeninger commented on SPARK-19888: That stacktrace also shows a concurrent modification exception, yes?. See SPARK-19185 for that See e.g. SPARK-19680 for background on why offset out of range may occur on executor when it doesn't on driver. Although if you're using reset latest, unless you have really short retention this is kind of surprising. > Seeing offsets not resetting even when reset policy is configured explicitly > > > Key: SPARK-19888 > URL: https://issues.apache.org/jira/browse/SPARK-19888 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Justin Miller > > I was told to post this in a Spark ticket from KAFKA-4396: > I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be > two separate errors, I'm not sure. What's puzzling is that I'm setting > auto.offset.reset to latest and it's still throwing an > OffsetOutOfRangeException, behavior that's contrary to the code. Please help! > :) > {code} > val kafkaParams = Map[String, Object]( > "group.id" -> consumerGroup, > "bootstrap.servers" -> bootstrapServers, > "key.deserializer" -> classOf[ByteArrayDeserializer], > "value.deserializer" -> classOf[MessageRowDeserializer], > "auto.offset.reset" -> "latest", > "enable.auto.commit" -> (false: java.lang.Boolean), > "max.poll.records" -> persisterConfig.maxPollRecords, > "request.timeout.ms" -> persisterConfig.requestTimeoutMs, > "session.timeout.ms" -> persisterConfig.sessionTimeoutMs, > "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs, > "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs > ) > {code} > {code} > 16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory > on xyz (size: 146.3 KB, free: 8.4 GB) > 16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID > 38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {topic=231884473} > at > org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588) > at > org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID > 39388) in 12043 ms on xyz (1/16) > 16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID > 39375) in 13444 ms on xyz (2/16) > 16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID > 38843, xyz):
[jira] [Created] (SPARK-19911) Add builder interface for Kinesis DStreams
Adam Budde created SPARK-19911: -- Summary: Add builder interface for Kinesis DStreams Key: SPARK-19911 URL: https://issues.apache.org/jira/browse/SPARK-19911 Project: Spark Issue Type: New Feature Components: DStreams Affects Versions: 2.1.0 Reporter: Adam Budde Priority: Minor The ```KinesisUtils.createStream()``` interface for creating Kinesis-based DStreams is quite brittle and requires adding a combinatorial number of overrides whenever another optional configuration parameter is added. This makes incorporating a lot of additional features supported by the Kinesis Client Library such as per-service authorization unfeasible. This interface should be replaced by a builder pattern class (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater extensibility. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905716#comment-15905716 ] Shixiong Zhu edited comment on SPARK-18057 at 3/10/17 9:21 PM: --- I did some investigation yesterday, and found one issue in 0.10.2.0: https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may hang forever when deleting a topic Our current tests will just hang forever due to KAFKA-4879. This prevents us from upgrading 0.10.2.0. I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try to summary the current situation: The benefits of upgrading Kafka client to 0.10.2.0: - Forward compatibility - Reading topics from a timestamp - The following bug fixes: Issues that we already have workarounds: https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow some interrupts meant for the calling thread https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an infinite loop if the polling thread is interrupted, and either commitSync or committed is called https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw NullPointerException on poll when delete the relative topic Issues related to Kafka record compression https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native Memory For Longer Than Needed With Compressed Messages https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does not write EndMark if flush() is not called before close() Others: https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope well with topic deletions For 0.10.1.x, KAFKA-4547 prevents us from upgrading to 0.10.1.x. At last, IMO, "Reading topics from a timestamp" is pretty useful and is the most important reason that we should upgrade Kafka. However, since the Spark 2.2 code freeze is coming, we won't get enough time to deliver this feature to the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka release. I don't think the next Kafka release will be later than Spark 2.3. Then we should be able to upgrade Kafka before Spark 2.3. was (Author: zsxwing): I did some investigation yesterday, and found one issue in 0.10.2.0: https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may hang forever when deleting a topic Our current tests will just hang forever due to KAFKA-4879. This prevents us from upgrading 0.10.2.0. I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try to summary the current situation: The benefits of upgrading Kafka client to 0.10.2.0: - Forward compatibility - Reading topics from a timestamp - The following bug fixes: Issues that we already have workarounds: https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow some interrupts meant for the calling thread https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an infinite loop if the polling thread is interrupted, and either commitSync or committed is called https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw NullPointerException on poll when delete the relative topic Issues related to Kafka record compression https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native Memory For Longer Than Needed With Compressed Messages https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does not write EndMark if flush() is not called before close() Others: https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope well with topic deletions For 0.10.1.x, KAFKA-4547 prevents us from upgrading to 0.10.1.x. At last, IMO, "Reading topics from a timestamp" is pretty useful and is the most important reason that we should upgrade Kafka. However, since the Spark 2.2 code freeze is coming, we won't get enough time to deliver this feature to the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka release. I don't think the next Kafka release will be later than Spark 2.3. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905716#comment-15905716 ] Shixiong Zhu edited comment on SPARK-18057 at 3/10/17 9:21 PM: --- I did some investigation yesterday, and found one issue in 0.10.2.0: https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may hang forever when deleting a topic Our current tests will just hang forever due to KAFKA-4879. This prevents us from upgrading 0.10.2.0. I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try to summary the current situation: The benefits of upgrading Kafka client to 0.10.2.0: - Forward compatibility - Reading topics from a timestamp - The following bug fixes: Issues that we already have workarounds: https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow some interrupts meant for the calling thread https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an infinite loop if the polling thread is interrupted, and either commitSync or committed is called https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw NullPointerException on poll when delete the relative topic Issues related to Kafka record compression https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native Memory For Longer Than Needed With Compressed Messages https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does not write EndMark if flush() is not called before close() Others: https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope well with topic deletions For 0.10.1.x, KAFKA-4547 prevents us from upgrading to 0.10.1.x. At last, IMO, "Reading topics from a timestamp" is pretty useful and is the most important reason that we should upgrade Kafka. However, since the Spark 2.2 code freeze is coming, we won't get enough time to deliver this feature to the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka release. I don't think the next Kafka release will be later than Spark 2.3. was (Author: zsxwing): I did some investigation yesterday, and found one issue in 0.10.2.0: https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may hang forever when deleting a topic Our current tests will just hang forever due to KAFKA-4879. This prevents us from upgrading 0.10.2.0. I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try to summary the current situation: The benefits of upgrading Kafka client to 0.10.2.0: - Forward compatibility - Reading topics from a timestamp - The following bug fixes: Issues that we already have workarounds: https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow some interrupts meant for the calling thread https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an infinite loop if the polling thread is interrupted, and either commitSync or committed is called https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw NullPointerException on poll when delete the relative topic Issues related to Kafka record compression https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native Memory For Longer Than Needed With Compressed Messages https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does not write EndMark if flush() is not called before close() Others: https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope well with topic deletions For 0.10.1.*, KAFKA-4547 prevents us from upgrading to 0.10.1.*. At last, IMO, "Reading topics from a timestamp" is pretty useful and is the most important reason that we should upgrade Kafka. However, since the Spark 2.2 code freeze is coming, we won't get enough time to deliver this feature to the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka release. I don't think the next Kafka release will be later than Spark 2.3. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0
[ https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905716#comment-15905716 ] Shixiong Zhu commented on SPARK-18057: -- I did some investigation yesterday, and found one issue in 0.10.2.0: https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may hang forever when deleting a topic Our current tests will just hang forever due to KAFKA-4879. This prevents us from upgrading 0.10.2.0. I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try to summary the current situation: The benefits of upgrading Kafka client to 0.10.2.0: - Forward compatibility - Reading topics from a timestamp - The following bug fixes: Issues that we already have workarounds: https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow some interrupts meant for the calling thread https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an infinite loop if the polling thread is interrupted, and either commitSync or committed is called https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw NullPointerException on poll when delete the relative topic Issues related to Kafka record compression https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native Memory For Longer Than Needed With Compressed Messages https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does not write EndMark if flush() is not called before close() Others: https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope well with topic deletions For 0.10.1.*, KAFKA-4547 prevents us from upgrading to 0.10.1.*. At last, IMO, "Reading topics from a timestamp" is pretty useful and is the most important reason that we should upgrade Kafka. However, since the Spark 2.2 code freeze is coming, we won't get enough time to deliver this feature to the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka release. I don't think the next Kafka release will be later than Spark 2.3. > Update structured streaming kafka from 10.0.1 to 10.2.0 > --- > > Key: SPARK-18057 > URL: https://issues.apache.org/jira/browse/SPARK-18057 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Reporter: Cody Koeninger > > There are a couple of relevant KIPs here, > https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
[ https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905705#comment-15905705 ] Apache Spark commented on SPARK-19611: -- User 'budde' has created a pull request for this issue: https://github.com/apache/spark/pull/17249 > Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files > --- > > Key: SPARK-19611 > URL: https://issues.apache.org/jira/browse/SPARK-19611 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Adam Budde >Assignee: Adam Budde > Fix For: 2.2.0 > > > This issue replaces > [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR > #16797|https://github.com/apache/spark/pull/16797] > [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the > schema inferrence from the HiveMetastoreCatalog class when converting a > MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in > favor of simply using the schema returend by the metastore. This results in > an optimization as the underlying file status no longer need to be resolved > until after the partition pruning step, reducing the number of files to be > touched significantly in some cases. The downside is that the data schema > used may no longer match the underlying file schema for case-sensitive > formats such as Parquet. > [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support > for saving a case-sensitive copy of the schema in the metastore table > properties, which HiveExternalCatalog will read in as the table's schema if > it is present. If it is not present, it will fall back to the > case-insensitive metastore schema. > Unfortunately, this silently breaks queries over tables where the underlying > data fields are case-sensitive but a case-sensitive schema wasn't written to > the table properties by Spark. This situation will occur for any Hive table > that wasn't created by Spark or that was created prior to Spark 2.1.0. If a > user attempts to run a query over such a table containing a case-sensitive > field name in the query projection or in the query filter, the query will > return 0 results in every case. > The change we are proposing is to bring back the schema inference that was > used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the > table properties. > - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive > schema can be read from the table properties. Attempt to save the inferred > schema in the table properties to avoid future inference. > - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but > don't attempt to save it. > - NEVER_INFER: Fall back to using the case-insensitive schema returned by the > Hive Metatore. Useful if the user knows that none of the underlying data is > case-sensitive. > See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] > for more discussion around this issue and the proposed solution. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19910) `stack` should not reject NULL values due to type mismatch
Dongjoon Hyun created SPARK-19910: - Summary: `stack` should not reject NULL values due to type mismatch Key: SPARK-19910 URL: https://issues.apache.org/jira/browse/SPARK-19910 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0, 2.0.1, 2.0.0 Reporter: Dongjoon Hyun Since `stack` function generates a table with nullable columns, it should allow mixed null values. {code} scala> sql("select stack(3, 1, 2, 3)").printSchema root |-- col0: integer (nullable = true) scala> sql("select stack(3, 1, 2, null)").printSchema org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); line 1 pos 7; 'Project [unresolvedalias(stack(3, 1, 2, null), None)] +- OneRowRelation$ {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly
[ https://issues.apache.org/jira/browse/SPARK-19888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-19888: - Component/s: (was: Spark Core) DStreams > Seeing offsets not resetting even when reset policy is configured explicitly > > > Key: SPARK-19888 > URL: https://issues.apache.org/jira/browse/SPARK-19888 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Justin Miller > > I was told to post this in a Spark ticket from KAFKA-4396: > I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be > two separate errors, I'm not sure. What's puzzling is that I'm setting > auto.offset.reset to latest and it's still throwing an > OffsetOutOfRangeException, behavior that's contrary to the code. Please help! > :) > {code} > val kafkaParams = Map[String, Object]( > "group.id" -> consumerGroup, > "bootstrap.servers" -> bootstrapServers, > "key.deserializer" -> classOf[ByteArrayDeserializer], > "value.deserializer" -> classOf[MessageRowDeserializer], > "auto.offset.reset" -> "latest", > "enable.auto.commit" -> (false: java.lang.Boolean), > "max.poll.records" -> persisterConfig.maxPollRecords, > "request.timeout.ms" -> persisterConfig.requestTimeoutMs, > "session.timeout.ms" -> persisterConfig.sessionTimeoutMs, > "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs, > "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs > ) > {code} > {code} > 16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory > on xyz (size: 146.3 KB, free: 8.4 GB) > 16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID > 38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: > Offsets out of range with no configured reset policy for partitions: > {topic=231884473} > at > org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588) > at > org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354) > at > org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99) > at > org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227) > at > org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193) > at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID > 39388) in 12043 ms on xyz (1/16) > 16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID > 39375) in 13444 ms on xyz (2/16) > 16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID > 38843, xyz): java.util.ConcurrentModificationException: KafkaConsumer is not > safe for multi-threaded access > at > org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431) > at > org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929) >
[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS
[ https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905641#comment-15905641 ] Apache Spark commented on SPARK-19909: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/17248 > Batches will fail in case that temporary checkpoint dir is on local file > system while metadata dir is on HDFS > - > > Key: SPARK-19909 > URL: https://issues.apache.org/jira/browse/SPARK-19909 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kousuke Saruta >Priority: Minor > > When we try to run Structured Streaming in local mode but use HDFS for the > storage, batches will be fail because of error like as follows. > {code} > val handle = stream.writeStream.format("console").start() > 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata > StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to > /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=kou, access=WRITE, > inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x > {code} > It's because that a temporary checkpoint directory is created on local file > system but metadata whose path is based on the checkpoint directory will be > created on HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS
[ https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19909: Assignee: (was: Apache Spark) > Batches will fail in case that temporary checkpoint dir is on local file > system while metadata dir is on HDFS > - > > Key: SPARK-19909 > URL: https://issues.apache.org/jira/browse/SPARK-19909 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kousuke Saruta >Priority: Minor > > When we try to run Structured Streaming in local mode but use HDFS for the > storage, batches will be fail because of error like as follows. > {code} > val handle = stream.writeStream.format("console").start() > 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata > StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to > /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=kou, access=WRITE, > inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x > {code} > It's because that a temporary checkpoint directory is created on local file > system but metadata whose path is based on the checkpoint directory will be > created on HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS
[ https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19909: Assignee: Apache Spark > Batches will fail in case that temporary checkpoint dir is on local file > system while metadata dir is on HDFS > - > > Key: SPARK-19909 > URL: https://issues.apache.org/jira/browse/SPARK-19909 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > When we try to run Structured Streaming in local mode but use HDFS for the > storage, batches will be fail because of error like as follows. > {code} > val handle = stream.writeStream.format("console").start() > 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata > StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to > /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata > org.apache.hadoop.security.AccessControlException: Permission denied: > user=kou, access=WRITE, > inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x > {code} > It's because that a temporary checkpoint directory is created on local file > system but metadata whose path is based on the checkpoint directory will be > created on HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS
Kousuke Saruta created SPARK-19909: -- Summary: Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS Key: SPARK-19909 URL: https://issues.apache.org/jira/browse/SPARK-19909 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Kousuke Saruta Priority: Minor When we try to run Structured Streaming in local mode but use HDFS for the storage, batches will be fail because of error like as follows. {code} val handle = stream.writeStream.format("console").start() 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata org.apache.hadoop.security.AccessControlException: Permission denied: user=kou, access=WRITE, inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x {code} It's because that a temporary checkpoint directory is created on local file system but metadata whose path is based on the checkpoint directory will be created on HDFS. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables
[ https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19905: Assignee: Cheng Lian (was: Apache Spark) > Dataset.inputFiles is broken for Hive SerDe tables > -- > > Key: SPARK-19905 > URL: https://issues.apache.org/jira/browse/SPARK-19905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following snippet reproduces this issue: > {code} > spark.range(10).createOrReplaceTempView("t") > spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t") > spark.table("u").inputFiles.foreach(println) > {code} > In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like > {noformat} > file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u > {noformat} > on my laptop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables
[ https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19905: Assignee: Apache Spark (was: Cheng Lian) > Dataset.inputFiles is broken for Hive SerDe tables > -- > > Key: SPARK-19905 > URL: https://issues.apache.org/jira/browse/SPARK-19905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Apache Spark > > The following snippet reproduces this issue: > {code} > spark.range(10).createOrReplaceTempView("t") > spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t") > spark.table("u").inputFiles.foreach(println) > {code} > In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like > {noformat} > file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u > {noformat} > on my laptop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables
[ https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905608#comment-15905608 ] Apache Spark commented on SPARK-19905: -- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/17247 > Dataset.inputFiles is broken for Hive SerDe tables > -- > > Key: SPARK-19905 > URL: https://issues.apache.org/jira/browse/SPARK-19905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > > The following snippet reproduces this issue: > {code} > spark.range(10).createOrReplaceTempView("t") > spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t") > spark.table("u").inputFiles.foreach(println) > {code} > In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like > {noformat} > file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u > {noformat} > on my laptop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
[ https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905603#comment-15905603 ] Cody Koeninger commented on SPARK-19863: Isn't this basically a duplicate of SPARK-19185 with the same implications? > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming application > > > Key: SPARK-19863 > URL: https://issues.apache.org/jira/browse/SPARK-19863 > Project: Spark > Issue Type: Bug > Components: DStreams, Input/Output >Affects Versions: 2.1.0 >Reporter: LvDongrong > > Whether or not use CachedKafkaConsumer need to be configured, when you use > DirectKafkaInputDStream to connect the kafka in a Spark Streaming > application. In Spark 2.x, the kafka consumer was replaced by > CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka > cluster), and cannot change the way. In fact ,The KafkaRDD(used by > DirectKafkaInputDStream to connect kafka) provide the parameter > useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the > DirectKafkaInputDStream set the parameter true. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.
Zhan Zhang created SPARK-19908: -- Summary: Direct buffer memory OOM should not cause stage retries. Key: SPARK-19908 URL: https://issues.apache.org/jira/browse/SPARK-19908 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 2.1.0 Reporter: Zhan Zhang Priority: Minor Currently if there is java.lang.OutOfMemoryError: Direct buffer memory, the exception will be changed to FetchFailedException, causing stage retries. org.apache.spark.shuffle.FetchFailedException: Direct buffer memory at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731) at org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854) at org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887) at org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:658) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228) at io.netty.buffer.PoolArena.allocate(PoolArena.java:212) at io.netty.buffer.PoolArena.allocate(PoolArena.java:132) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146) at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107) at io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at
[jira] [Updated] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website
[ https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cody Koeninger updated SPARK-19904: --- Description: see http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html > SPIP Add Spark Project Improvement Proposal doc to website > -- > > Key: SPARK-19904 > URL: https://issues.apache.org/jira/browse/SPARK-19904 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Cody Koeninger > Labels: SPIP > > see > http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19907) Spark Submit Does not pick up the HBase Jars
[ https://issues.apache.org/jira/browse/SPARK-19907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-19907. - > Spark Submit Does not pick up the HBase Jars > > > Key: SPARK-19907 > URL: https://issues.apache.org/jira/browse/SPARK-19907 > Project: Spark > Issue Type: Task > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.0.0 > Environment: Linux, cloudera-jdk-1.7 >Reporter: Ramchandhar Rapolu >Priority: Blocker > > Using properties file: > /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf > Adding default property: > spark.serializer=org.apache.spark.serializer.KryoSerializer > Adding default property: > spark.yarn.jars=local:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/jars/*:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/hbase/jars/* > Adding default property: spark.eventLog.enabled=true > Adding default property: spark.hadoop.mapreduce.application.classpath= > Adding default property: spark.shuffle.service.enabled=true > Adding default property: > spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > Adding default property: > spark.yarn.historyServer.address=http://instance-26765.bigstep.io:18089 > Adding default property: spark.ui.killEnabled=true > Adding default property: > spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/../hive/lib/*:${env:HADOOP_COMMON_HOME}/client/* > Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1 > Adding default property: > spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels > Adding default property: > spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../.. > Adding default property: spark.submit.deployMode=client > Adding default property: spark.shuffle.service.port=7337 > Adding default property: spark.master=yarn > Adding default property: spark.authenticate=false > Adding default property: > spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > Adding default property: > spark.eventLog.dir=hdfs://instance-26765.bigstep.io:8020/user/spark/spark2ApplicationHistory > Adding default property: spark.dynamicAllocation.enabled=true > Adding default property: spark.sql.catalogImplementation=hive > Adding default property: spark.hadoop.yarn.application.classpath= > Adding default property: spark.dynamicAllocation.minExecutors=0 > Adding default property: spark.dynamicAllocation.executorIdleTimeout=60 > Adding default property: spark.sql.hive.metastore.version=1.1.0 > Parsed arguments: > master yarn > deployMode client > executorMemory 2g > executorCores 1 > totalExecutorCores null > propertiesFile > /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf > driverMemory4g > driverCores null > driverExtraClassPathnull > driverExtraLibraryPath > /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > driverExtraJavaOptions null > supervise false > queue null > numExecutorsnull > files null > pyFiles null > archivesnull > mainClass com.golfbreaks.spark.streaming.Test > primaryResource > file:/opt/golfbreaks/spark-jars/streaming-1.0-jar-with-dependencies.jar > namecom.golfbreaks.spark.streaming.Test > childArgs [] > jars >
[jira] [Resolved] (SPARK-19907) Spark Submit Does not pick up the HBase Jars
[ https://issues.apache.org/jira/browse/SPARK-19907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19907. --- Resolution: Invalid Target Version/s: (was: 2.0.0) A huge dump of your config and logs certainly isn't a suitable JIRA. Spark doesn't include HBase jars, no. This is invalid for several reasons. Please read http://spark.apache.org/contributing.html > Spark Submit Does not pick up the HBase Jars > > > Key: SPARK-19907 > URL: https://issues.apache.org/jira/browse/SPARK-19907 > Project: Spark > Issue Type: Task > Components: DStreams, Spark Submit, YARN >Affects Versions: 2.0.0 > Environment: Linux, cloudera-jdk-1.7 >Reporter: Ramchandhar Rapolu >Priority: Blocker > > Using properties file: > /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf > Adding default property: > spark.serializer=org.apache.spark.serializer.KryoSerializer > Adding default property: > spark.yarn.jars=local:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/jars/*:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/hbase/jars/* > Adding default property: spark.eventLog.enabled=true > Adding default property: spark.hadoop.mapreduce.application.classpath= > Adding default property: spark.shuffle.service.enabled=true > Adding default property: > spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > Adding default property: > spark.yarn.historyServer.address=http://instance-26765.bigstep.io:18089 > Adding default property: spark.ui.killEnabled=true > Adding default property: > spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/../hive/lib/*:${env:HADOOP_COMMON_HOME}/client/* > Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1 > Adding default property: > spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels > Adding default property: > spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../.. > Adding default property: spark.submit.deployMode=client > Adding default property: spark.shuffle.service.port=7337 > Adding default property: spark.master=yarn > Adding default property: spark.authenticate=false > Adding default property: > spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > Adding default property: > spark.eventLog.dir=hdfs://instance-26765.bigstep.io:8020/user/spark/spark2ApplicationHistory > Adding default property: spark.dynamicAllocation.enabled=true > Adding default property: spark.sql.catalogImplementation=hive > Adding default property: spark.hadoop.yarn.application.classpath= > Adding default property: spark.dynamicAllocation.minExecutors=0 > Adding default property: spark.dynamicAllocation.executorIdleTimeout=60 > Adding default property: spark.sql.hive.metastore.version=1.1.0 > Parsed arguments: > master yarn > deployMode client > executorMemory 2g > executorCores 1 > totalExecutorCores null > propertiesFile > /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf > driverMemory4g > driverCores null > driverExtraClassPathnull > driverExtraLibraryPath > /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native > driverExtraJavaOptions null > supervise false > queue null > numExecutorsnull > files null > pyFiles null > archivesnull > mainClass com.golfbreaks.spark.streaming.Test > primaryResource > file:/opt/golfbreaks/spark-jars/streaming-1.0-jar-with-dependencies.jar > namecom.golfbreaks.spark.streaming.Test > childArgs [] > jars >
[jira] [Assigned] (SPARK-19906) Add Documentation for Kafka Write paths
[ https://issues.apache.org/jira/browse/SPARK-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19906: Assignee: Apache Spark > Add Documentation for Kafka Write paths > --- > > Key: SPARK-19906 > URL: https://issues.apache.org/jira/browse/SPARK-19906 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie >Assignee: Apache Spark > > We need documentation that describes how to write streaming and batch queries > to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19907) Spark Submit Does not pick up the HBase Jars
Ramchandhar Rapolu created SPARK-19907: -- Summary: Spark Submit Does not pick up the HBase Jars Key: SPARK-19907 URL: https://issues.apache.org/jira/browse/SPARK-19907 Project: Spark Issue Type: Task Components: DStreams, Spark Submit, YARN Affects Versions: 2.0.0 Environment: Linux, cloudera-jdk-1.7 Reporter: Ramchandhar Rapolu Priority: Blocker Using properties file: /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf Adding default property: spark.serializer=org.apache.spark.serializer.KryoSerializer Adding default property: spark.yarn.jars=local:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/jars/*:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/hbase/jars/* Adding default property: spark.eventLog.enabled=true Adding default property: spark.hadoop.mapreduce.application.classpath= Adding default property: spark.shuffle.service.enabled=true Adding default property: spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native Adding default property: spark.yarn.historyServer.address=http://instance-26765.bigstep.io:18089 Adding default property: spark.ui.killEnabled=true Adding default property: spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/../hive/lib/*:${env:HADOOP_COMMON_HOME}/client/* Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1 Adding default property: spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels Adding default property: spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../.. Adding default property: spark.submit.deployMode=client Adding default property: spark.shuffle.service.port=7337 Adding default property: spark.master=yarn Adding default property: spark.authenticate=false Adding default property: spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native Adding default property: spark.eventLog.dir=hdfs://instance-26765.bigstep.io:8020/user/spark/spark2ApplicationHistory Adding default property: spark.dynamicAllocation.enabled=true Adding default property: spark.sql.catalogImplementation=hive Adding default property: spark.hadoop.yarn.application.classpath= Adding default property: spark.dynamicAllocation.minExecutors=0 Adding default property: spark.dynamicAllocation.executorIdleTimeout=60 Adding default property: spark.sql.hive.metastore.version=1.1.0 Parsed arguments: master yarn deployMode client executorMemory 2g executorCores 1 totalExecutorCores null propertiesFile /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf driverMemory4g driverCores null driverExtraClassPathnull driverExtraLibraryPath /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native driverExtraJavaOptions null supervise false queue null numExecutorsnull files null pyFiles null archivesnull mainClass com.golfbreaks.spark.streaming.Test primaryResource file:/opt/golfbreaks/spark-jars/streaming-1.0-jar-with-dependencies.jar namecom.golfbreaks.spark.streaming.Test childArgs [] jars
[jira] [Assigned] (SPARK-19906) Add Documentation for Kafka Write paths
[ https://issues.apache.org/jira/browse/SPARK-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19906: Assignee: (was: Apache Spark) > Add Documentation for Kafka Write paths > --- > > Key: SPARK-19906 > URL: https://issues.apache.org/jira/browse/SPARK-19906 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie > > We need documentation that describes how to write streaming and batch queries > to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19906) Add Documentation for Kafka Write paths
[ https://issues.apache.org/jira/browse/SPARK-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905580#comment-15905580 ] Apache Spark commented on SPARK-19906: -- User 'tcondie' has created a pull request for this issue: https://github.com/apache/spark/pull/17246 > Add Documentation for Kafka Write paths > --- > > Key: SPARK-19906 > URL: https://issues.apache.org/jira/browse/SPARK-19906 > Project: Spark > Issue Type: Documentation > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Tyson Condie > > We need documentation that describes how to write streaming and batch queries > to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19906) Add Documentation for Kafka Write paths
Tyson Condie created SPARK-19906: Summary: Add Documentation for Kafka Write paths Key: SPARK-19906 URL: https://issues.apache.org/jira/browse/SPARK-19906 Project: Spark Issue Type: Documentation Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Tyson Condie We need documentation that describes how to write streaming and batch queries to Kafka. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
[ https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai reassigned SPARK-19620: Assignee: Carson Wang > Incorrect exchange coordinator Id in physical plan > -- > > Key: SPARK-19620 > URL: https://issues.apache.org/jira/browse/SPARK-19620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Minor > Fix For: 2.2.0 > > > When adaptive execution is enabled, an exchange coordinator is used to in the > Exchange operators. For Join, the same exchange coordinator is used for its > two Exchanges. But the physical plan shows two different coordinator Ids > which is confusing. > Here is an example: > {code} > == Physical Plan == > *Project [key1#3L, value2#12L] > +- *SortMergeJoin [key1#3L], [key2#11L], Inner >:- *Sort [key1#3L ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), > coordinator[target post-shuffle partition size: 67108864] >: +- *Project [(id#0L % 500) AS key1#3L] >:+- *Filter isnotnull((id#0L % 500)) >: +- *Range (0, 1000, step=1, splits=Some(10)) >+- *Sort [key2#11L ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] > +- *Filter isnotnull((id#8L % 500)) >+- *Range (0, 1000, step=1, splits=Some(10)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19620) Incorrect exchange coordinator Id in physical plan
[ https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-19620. -- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16952 [https://github.com/apache/spark/pull/16952] > Incorrect exchange coordinator Id in physical plan > -- > > Key: SPARK-19620 > URL: https://issues.apache.org/jira/browse/SPARK-19620 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Carson Wang >Priority: Minor > Fix For: 2.2.0 > > > When adaptive execution is enabled, an exchange coordinator is used to in the > Exchange operators. For Join, the same exchange coordinator is used for its > two Exchanges. But the physical plan shows two different coordinator Ids > which is confusing. > Here is an example: > {code} > == Physical Plan == > *Project [key1#3L, value2#12L] > +- *SortMergeJoin [key1#3L], [key2#11L], Inner >:- *Sort [key1#3L ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), > coordinator[target post-shuffle partition size: 67108864] >: +- *Project [(id#0L % 500) AS key1#3L] >:+- *Filter isnotnull((id#0L % 500)) >: +- *Range (0, 1000, step=1, splits=Some(10)) >+- *Sort [key2#11L ASC NULLS FIRST], false, 0 > +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), > coordinator[target post-shuffle partition size: 67108864] > +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L] > +- *Filter isnotnull((id#8L % 500)) >+- *Range (0, 1000, step=1, splits=Some(10)) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV
[ https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19885. - Resolution: Fixed Fix Version/s: 2.2.0 > The config ignoreCorruptFiles doesn't work for CSV > -- > > Key: SPARK-19885 > URL: https://issues.apache.org/jira/browse/SPARK-19885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu > Fix For: 2.2.0 > > > CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL > "ignoreCorruptFiles" doesn't work. > {code} > java.io.EOFException: Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214) > at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981) > at
[jira] [Created] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables
Cheng Lian created SPARK-19905: -- Summary: Dataset.inputFiles is broken for Hive SerDe tables Key: SPARK-19905 URL: https://issues.apache.org/jira/browse/SPARK-19905 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Cheng Lian Assignee: Cheng Lian The following snippet reproduces this issue: {code} spark.range(10).createOrReplaceTempView("t") spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t") spark.table("u").inputFiles.foreach(println) {code} In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like {noformat} file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u {noformat} on my laptop. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV
[ https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905564#comment-15905564 ] Wenchen Fan commented on SPARK-19885: - Oh, so this issue is already fixed by SPARK-18362 in Spark 2.2 Since it's not a critical issue and SPARK-18362 is an optimization, we should not backport it. Let's just resolve this ticket. > The config ignoreCorruptFiles doesn't work for CSV > -- > > Key: SPARK-19885 > URL: https://issues.apache.org/jira/browse/SPARK-19885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu > Fix For: 2.2.0 > > > CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL > "ignoreCorruptFiles" doesn't work. > {code} > java.io.EOFException: Unexpected end of input stream > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) > at java.io.InputStream.read(InputStream.java:101) > at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248) > at > org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266) > at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157) > at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336) > at > scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214) > at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112) > at > org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980) > at > org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at >
[jira] [Updated] (SPARK-19893) should not run DataFrame set oprations with map type
[ https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-19893: Summary: should not run DataFrame set oprations with map type (was: Cannot run intersect/except/distinct with map type) > should not run DataFrame set oprations with map type > > > Key: SPARK-19893 > URL: https://issues.apache.org/jira/browse/SPARK-19893 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.0, 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable
[ https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905518#comment-15905518 ] Apache Spark commented on SPARK-14453: -- User 'yongtang' has created a pull request for this issue: https://github.com/apache/spark/pull/17212 > Remove SPARK_JAVA_OPTS environment variable > --- > > Key: SPARK-14453 > URL: https://issues.apache.org/jira/browse/SPARK-14453 > Project: Spark > Issue Type: Task > Components: Spark Core, YARN >Reporter: Saisai Shao >Priority: Minor > > SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version > (2.0), I think it would be better to remove the support of this env variable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables
[ https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19887: --- Description: The following Spark shell snippet under Spark 2.1 reproduces this issue: {code} val data = Seq( ("p1", 1, 1), ("p2", 2, 2), (null, 3, 3) ) // Correct case: Saving partitioned data to file system. val path = "/tmp/partitioned" data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). parquet(path) spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) // +---+---+---+ // |c |a |b | // +---+---+---+ // |2 |p2 |2 | // |1 |p1 |1 | // +---+---+---+ // Incorrect case: Saving partitioned data as persisted table. data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). saveAsTable("test_null") spark.table("test_null").filter($"a".isNotNull).show(truncate = false) // +---+--+---+ // |c |a |b | // +---+--+---+ // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here // |1 |p1|1 | // |2 |p2|2 | // +---+--+---+ {code} Hive-style partitioned tables use the magic string {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in partition directory names. However, in the case persisted partitioned table, this magic string is not interpreted as {{NULL}} but a regular string. was: The following Spark shell snippet under Spark 2.1 reproduces this issue: {code} val data = Seq( ("p1", 1, 1), ("p2", 2, 2), (null, 3, 3) ) // Correct case: Saving partitioned data to file system. val path = "/tmp/partitioned" data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). parquet(path) spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) // +---+---+---+ // |c |a |b | // +---+---+---+ // |2 |p2 |2 | // |1 |p1 |1 | // +---+---+---+ // Incorrect case: Saving partitioned data as persisted table. data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). saveAsTable("test_null") spark.table("test_null").filter($"a".isNotNull).show(truncate = false) // +---+--+---+ // |c |a |b | // +---+--+---+ // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here // |1 |p1|1 | // |2 |p2|2 | // +---+--+---+ {code} Hive-style partitioned tables use the magic string {{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in partition directory names. However, in the case persisted partitioned table, this magic string is not interpreted as {{NULL}} but a regular string. > __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in > partitioned persisted tables > - > > Key: SPARK-19887 > URL: https://issues.apache.org/jira/browse/SPARK-19887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Cheng Lian > > The following Spark shell snippet under Spark 2.1 reproduces this issue: > {code} > val data = Seq( > ("p1", 1, 1), > ("p2", 2, 2), > (null, 3, 3) > ) > // Correct case: Saving partitioned data to file system. > val path = "/tmp/partitioned" > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > parquet(path) > spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) > // +---+---+---+ > // |c |a |b | > // +---+---+---+ > // |2 |p2 |2 | > // |1 |p1 |1 | > // +---+---+---+ > // Incorrect case: Saving partitioned data as persisted table. > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > saveAsTable("test_null") > spark.table("test_null").filter($"a".isNotNull).show(truncate = false) > // +---+--+---+ > // |c |a |b | > // +---+--+---+ > // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here > // |1 |p1|1 | > // |2 |p2|2 | > // +---+--+---+ > {code} > Hive-style partitioned tables use the magic string > {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in > partition directory names. However, in the case persisted partitioned table, > this magic string is not interpreted as {{NULL}} but a regular string. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe,
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905492#comment-15905492 ] yuhao yang commented on SPARK-19899: also cc [~podongfeng] since I recalled he mentioned to use SetInputCol in the original PR. > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905484#comment-15905484 ] yuhao yang commented on SPARK-19899: Thanks for the reply. We can wait for some time to see if people like the idea. > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables
[ https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-19887: --- Description: The following Spark shell snippet under Spark 2.1 reproduces this issue: {code} val data = Seq( ("p1", 1, 1), ("p2", 2, 2), (null, 3, 3) ) // Correct case: Saving partitioned data to file system. val path = "/tmp/partitioned" data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). parquet(path) spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) // +---+---+---+ // |c |a |b | // +---+---+---+ // |2 |p2 |2 | // |1 |p1 |1 | // +---+---+---+ // Incorrect case: Saving partitioned data as persisted table. data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). saveAsTable("test_null") spark.table("test_null").filter($"a".isNotNull).show(truncate = false) // +---+--+---+ // |c |a |b | // +---+--+---+ // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here // |1 |p1|1 | // |2 |p2|2 | // +---+--+---+ {code} Hive-style partitioned tables use the magic string {{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in partition directory names. However, in the case persisted partitioned table, this magic string is not interpreted as {{NULL}} but a regular string. was: The following Spark shell snippet under Spark 2.1 reproduces this issue: {code} val data = Seq( ("p1", 1, 1), ("p2", 2, 2), (null, 3, 3) ) // Correct case: Saving partitioned data to file system. val path = "/tmp/partitioned" data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). parquet(path) spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) // +---+---+---+ // |c |a |b | // +---+---+---+ // |2 |p2 |2 | // |1 |p1 |1 | // +---+---+---+ // Incorrect case: Saving partitioned data as persisted table. data. toDF("a", "b", "c"). write. mode("overwrite"). partitionBy("a", "b"). saveAsTable("test_null") spark.table("test_null").filter($"a".isNotNull).show(truncate = false) // +---+--+---+ // |c |a |b | // +---+--+---+ // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here // |1 |p1|1 | // |2 |p2|2 | // +---+--+---+ {code} Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} to indicate {{NULL}} partition values in partition directory names. However, in the case persisted partitioned table, this magic string is not interpreted as {{NULL}} but a regular string. > __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in > partitioned persisted tables > - > > Key: SPARK-19887 > URL: https://issues.apache.org/jira/browse/SPARK-19887 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Cheng Lian > > The following Spark shell snippet under Spark 2.1 reproduces this issue: > {code} > val data = Seq( > ("p1", 1, 1), > ("p2", 2, 2), > (null, 3, 3) > ) > // Correct case: Saving partitioned data to file system. > val path = "/tmp/partitioned" > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > parquet(path) > spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false) > // +---+---+---+ > // |c |a |b | > // +---+---+---+ > // |2 |p2 |2 | > // |1 |p1 |1 | > // +---+---+---+ > // Incorrect case: Saving partitioned data as persisted table. > data. > toDF("a", "b", "c"). > write. > mode("overwrite"). > partitionBy("a", "b"). > saveAsTable("test_null") > spark.table("test_null").filter($"a".isNotNull).show(truncate = false) > // +---+--+---+ > // |c |a |b | > // +---+--+---+ > // |3 |__HIVE_DEFAULT_PARTITION__|3 | <-- This line should not be here > // |1 |p1|1 | > // |2 |p2|2 | > // +---+--+---+ > {code} > Hive-style partitioned tables use the magic string > {{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in > partition directory names. However, in the case persisted partitioned table, > this magic string is not interpreted as {{NULL}} but a regular string. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail:
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905480#comment-15905480 ] Maciej Szymkiewicz commented on SPARK-19899: This is just an idea, but I would start with: - {{featuresCol}} - {{VectorUDT}} - {{transactionsCol}} - {{Array<\_>}} - for frequent (unordered) pattern mining. - {{sequencesCol}} - {{Array>}} - for sequential pattern ming. > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array }} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905451#comment-15905451 ] yuhao yang commented on SPARK-19899: {quote} if we mix-in HasFeaturesCol the featuresCol should be VectorUDT. {quote} Guess I misunderstood and thought you want to support vector for FPGrowth. Using SparseVector to represent records is not unreasonable for me and supporting that is easy and straightforward. But surely we don't need to support that until there's a clear requirement. May I know how do you want to name the new trait for array>, as users will need to invoke setCol("...") during fitting. Then we can see if it's more intuitive. > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array }} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website
Cody Koeninger created SPARK-19904: -- Summary: SPIP Add Spark Project Improvement Proposal doc to website Key: SPARK-19904 URL: https://issues.apache.org/jira/browse/SPARK-19904 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 2.1.0 Reporter: Cody Koeninger -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19786) Facilitate loop optimizations in a JIT compiler regarding range()
[ https://issues.apache.org/jira/browse/SPARK-19786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-19786. --- Resolution: Fixed Assignee: Kazuaki Ishizaki Fix Version/s: 2.2.0 > Facilitate loop optimizations in a JIT compiler regarding range() > - > > Key: SPARK-19786 > URL: https://issues.apache.org/jira/browse/SPARK-19786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Kazuaki Ishizaki >Assignee: Kazuaki Ishizaki > Fix For: 2.2.0 > > > [This > article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html] > suggests that better generated code can improve performance by facilitating > compiler optimizations. > This JIRA changes the generated code for {{range()}} to facilitate loop > optimizations in a JIT compiler for achieving better performance. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19850) Support aliased expressions in function parameters
[ https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19850: Assignee: Herman van Hovell (was: Apache Spark) > Support aliased expressions in function parameters > -- > > Key: SPARK-19850 > URL: https://issues.apache.org/jira/browse/SPARK-19850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The SQL parser currently does not allow a user to pass an aliased expression > as function parameter. This can be useful if we want to create a struct. For > example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a > struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19899) FPGrowth input column naming
[ https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905405#comment-15905405 ] Maciej Szymkiewicz commented on SPARK-19899: In my opinion a trait for each input category ({{Vector}}, {{array<\_>}}, {{array>}}) is the way to go. Development overhead is low (these things are small and easy to test), it is unlikely we'll need much more any time soon, any this gives us some way to communicate expected input. I am strongly against using {{Vector}} - it is counterintuitive, requires a lot of additional effort and without any supported way of mapping from vector to features (I don't count {{Column}} metadata) it will significantly degrade user experience. Moreover it won't be useful for {{PrefixSpan}} at all. I believe that we should acknowledge that pattern mining techniques are significantly different from the common {{ml}} algorithms and don't hesitate to reflect that in the API. > FPGrowth input column naming > > > Key: SPARK-19899 > URL: https://issues.apache.org/jira/browse/SPARK-19899 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Maciej Szymkiewicz > > Current implementation extends {{HasFeaturesCol}}. Personally I find it > rather unfortunate. Up to this moment we used consistent conventions - if we > mix-in {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. > Using the same {{Param}} for an {{array}} (and possibly for > {{array }} once {{PrefixSpan}} is ported to {{ml}}) will be > confusing for the users. > I would like to suggest adding new {{trait}} (let's say > {{HasTransactionsCol}}) to clearly indicate that the input type differs for > the other {{Estiamtors}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19850) Support aliased expressions in function parameters
[ https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19850: Assignee: Apache Spark (was: Herman van Hovell) > Support aliased expressions in function parameters > -- > > Key: SPARK-19850 > URL: https://issues.apache.org/jira/browse/SPARK-19850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Apache Spark > > The SQL parser currently does not allow a user to pass an aliased expression > as function parameter. This can be useful if we want to create a struct. For > example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a > struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19850) Support aliased expressions in function parameters
[ https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905404#comment-15905404 ] Apache Spark commented on SPARK-19850: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/17245 > Support aliased expressions in function parameters > -- > > Key: SPARK-19850 > URL: https://issues.apache.org/jira/browse/SPARK-19850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The SQL parser currently does not allow a user to pass an aliased expression > as function parameter. This can be useful if we want to create a struct. For > example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a > struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org