[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-03-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906122#comment-15906122
 ] 

yuhao yang commented on SPARK-14503:


Thanks for reporting that. I just found there's a misplaced distinct along the 
iterations of updates. Do you want to send a PR with some unit tests? 
[~zero323] 

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.2.0
>
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19914) Spark Scala - Calling persist after reading a parquet file makes certain spark.sql queries return empty results

2017-03-10 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906121#comment-15906121
 ] 

Takeshi Yamamuro commented on SPARK-19914:
--

I couldn't reproduce this and do I miss anything?
{code}
scala> Seq(("1", 2)).toDF("id", 
"value").write.parquet("/Users/maropu/Desktop/data")
scala> val df = spark.read.parquet("/Users/maropu/Desktop/data")
scala> df.createOrReplaceTempView("t")
scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10)
+---+-+
| id|value|
+---+-+
|  1|2|
+---+-+

scala> import org.apache.spark.storage.StorageLevel
scala> val df = 
spark.read.parquet("/Users/maropu/Desktop/data").persist(StorageLevel.DISK_ONLY)
scala> df.createOrReplaceTempView("t")
scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10)
+---+-+
| id|value|
+---+-+
|  1|2|
+---+-+
{code}

> Spark Scala - Calling persist after reading a parquet file makes certain 
> spark.sql queries return empty results
> ---
>
> Key: SPARK-19914
> URL: https://issues.apache.org/jira/browse/SPARK-19914
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Yifeng Li
>
> Hi There,
> It seems like calling .persist() after spark.read.parquet will make spark.sql 
> statements return empty results if the query is written in a certain way.
> I have the following code here:
> val df = spark.read.parquet("C:\\...")
> df.createOrReplaceTempView("t1")
> spark.sql("select * from t1 a where a.id = '123456789'").show(10)
> Everything works fine here.
> Now, if I do:
> val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY)
> df.createOrReplaceTempView("t1")
> spark.sql("select * from t1 a where a.id = '123456789'").show(10)
> I will get empty results.
> selecting individual columns works with persist, e.g.:
> val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY)
> df.createOrReplaceTempView("t1")
> spark.sql("select a.id from t1 a where a.id = '123456789'").show(10)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19914) Spark Scala - Calling persist after reading a parquet file makes certain spark.sql queries return empty results

2017-03-10 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906121#comment-15906121
 ] 

Takeshi Yamamuro edited comment on SPARK-19914 at 3/11/17 7:21 AM:
---

I couldn't reproduce this and do I miss anything? (I checked in v2.1.0 and 
v2.0.2)
{code}
scala> Seq(("1", 2)).toDF("id", 
"value").write.parquet("/Users/maropu/Desktop/data")
scala> val df = spark.read.parquet("/Users/maropu/Desktop/data")
scala> df.createOrReplaceTempView("t")
scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10)
+---+-+
| id|value|
+---+-+
|  1|2|
+---+-+

scala> import org.apache.spark.storage.StorageLevel
scala> val df = 
spark.read.parquet("/Users/maropu/Desktop/data").persist(StorageLevel.DISK_ONLY)
scala> df.createOrReplaceTempView("t")
scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10)
+---+-+
| id|value|
+---+-+
|  1|2|
+---+-+
{code}


was (Author: maropu):
I couldn't reproduce this and do I miss anything?
{code}
scala> Seq(("1", 2)).toDF("id", 
"value").write.parquet("/Users/maropu/Desktop/data")
scala> val df = spark.read.parquet("/Users/maropu/Desktop/data")
scala> df.createOrReplaceTempView("t")
scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10)
+---+-+
| id|value|
+---+-+
|  1|2|
+---+-+

scala> import org.apache.spark.storage.StorageLevel
scala> val df = 
spark.read.parquet("/Users/maropu/Desktop/data").persist(StorageLevel.DISK_ONLY)
scala> df.createOrReplaceTempView("t")
scala> sql("SELECT * FROM t a WHERE a.id = '1'").show(10)
+---+-+
| id|value|
+---+-+
|  1|2|
+---+-+
{code}

> Spark Scala - Calling persist after reading a parquet file makes certain 
> spark.sql queries return empty results
> ---
>
> Key: SPARK-19914
> URL: https://issues.apache.org/jira/browse/SPARK-19914
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Yifeng Li
>
> Hi There,
> It seems like calling .persist() after spark.read.parquet will make spark.sql 
> statements return empty results if the query is written in a certain way.
> I have the following code here:
> val df = spark.read.parquet("C:\\...")
> df.createOrReplaceTempView("t1")
> spark.sql("select * from t1 a where a.id = '123456789'").show(10)
> Everything works fine here.
> Now, if I do:
> val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY)
> df.createOrReplaceTempView("t1")
> spark.sql("select * from t1 a where a.id = '123456789'").show(10)
> I will get empty results.
> selecting individual columns works with persist, e.g.:
> val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY)
> df.createOrReplaceTempView("t1")
> spark.sql("select a.id from t1 a where a.id = '123456789'").show(10)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19919) Defer input path validation into DataSource in CSV datasource

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19919:


Assignee: Apache Spark

> Defer input path validation into DataSource in CSV datasource
> -
>
> Key: SPARK-19919
> URL: https://issues.apache.org/jira/browse/SPARK-19919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently, if other datasources fail to infer the schema, it returns {{None}} 
> and then this is being validated in {{DataSource}} as below:
> {code}
> scala> spark.read.json("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.orc("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.parquet("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
> {code}
> However, CSV it checks it within the datasource implementation and throws 
> another exception message as below:
> {code}
> scala> spark.read.csv("emptydir")
> java.lang.IllegalArgumentException: requirement failed: Cannot infer schema 
> from an empty set of files
> {code}
> We could remove this duplicated check and validate this in one place in the 
> same way with the same message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19919) Defer input path validation into DataSource in CSV datasource

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19919:


Assignee: (was: Apache Spark)

> Defer input path validation into DataSource in CSV datasource
> -
>
> Key: SPARK-19919
> URL: https://issues.apache.org/jira/browse/SPARK-19919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, if other datasources fail to infer the schema, it returns {{None}} 
> and then this is being validated in {{DataSource}} as below:
> {code}
> scala> spark.read.json("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.orc("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.parquet("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
> {code}
> However, CSV it checks it within the datasource implementation and throws 
> another exception message as below:
> {code}
> scala> spark.read.csv("emptydir")
> java.lang.IllegalArgumentException: requirement failed: Cannot infer schema 
> from an empty set of files
> {code}
> We could remove this duplicated check and validate this in one place in the 
> same way with the same message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19919) Defer input path validation into DataSource in CSV datasource

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906115#comment-15906115
 ] 

Apache Spark commented on SPARK-19919:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17256

> Defer input path validation into DataSource in CSV datasource
> -
>
> Key: SPARK-19919
> URL: https://issues.apache.org/jira/browse/SPARK-19919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Trivial
>
> Currently, if other datasources fail to infer the schema, it returns {{None}} 
> and then this is being validated in {{DataSource}} as below:
> {code}
> scala> spark.read.json("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.orc("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It 
> must be specified manually.;
> {code}
> {code}
> scala> spark.read.parquet("emptydir")
> org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. 
> It must be specified manually.;
> {code}
> However, CSV it checks it within the datasource implementation and throws 
> another exception message as below:
> {code}
> scala> spark.read.csv("emptydir")
> java.lang.IllegalArgumentException: requirement failed: Cannot infer schema 
> from an empty set of files
> {code}
> We could remove this duplicated check and validate this in one place in the 
> same way with the same message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19919) Defer input path validation into DataSource in CSV datasource

2017-03-10 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-19919:


 Summary: Defer input path validation into DataSource in CSV 
datasource
 Key: SPARK-19919
 URL: https://issues.apache.org/jira/browse/SPARK-19919
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Trivial


Currently, if other datasources fail to infer the schema, it returns {{None}} 
and then this is being validated in {{DataSource}} as below:

{code}
scala> spark.read.json("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It 
must be specified manually.;
{code}

{code}
scala> spark.read.orc("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must 
be specified manually.;
{code}

{code}
scala> spark.read.parquet("emptydir")
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It 
must be specified manually.;
{code}

However, CSV it checks it within the datasource implementation and throws 
another exception message as below:

{code}
scala> spark.read.csv("emptydir")
java.lang.IllegalArgumentException: requirement failed: Cannot infer schema 
from an empty set of files
{code}

We could remove this duplicated check and validate this in one place in the 
same way with the same message.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19918:


Assignee: Apache Spark

> Use TextFileFormat in implementation of JsonFileFormat
> --
>
> Key: SPARK-19918
> URL: https://issues.apache.org/jira/browse/SPARK-19918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>
> If we use Dataset for initial loading when inferring the schema, there are 
> advantages. Please refer SPARK-18362
> It seems JSON one was supposed to be fixed together but missed according to 
> https://github.com/apache/spark/pull/15813
> {quote}
> A similar problem also affects the JSON file format and this patch originally 
> fixed that as well, but I've decided to split that change into a separate 
> patch so as not to conflict with changes in another JSON PR.
> {quote}
> Also, this affects some functionalities because it does not use 
> {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's 
> case).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19918:


Assignee: (was: Apache Spark)

> Use TextFileFormat in implementation of JsonFileFormat
> --
>
> Key: SPARK-19918
> URL: https://issues.apache.org/jira/browse/SPARK-19918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> If we use Dataset for initial loading when inferring the schema, there are 
> advantages. Please refer SPARK-18362
> It seems JSON one was supposed to be fixed together but missed according to 
> https://github.com/apache/spark/pull/15813
> {quote}
> A similar problem also affects the JSON file format and this patch originally 
> fixed that as well, but I've decided to split that change into a separate 
> patch so as not to conflict with changes in another JSON PR.
> {quote}
> Also, this affects some functionalities because it does not use 
> {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's 
> case).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906098#comment-15906098
 ] 

Apache Spark commented on SPARK-19918:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/17255

> Use TextFileFormat in implementation of JsonFileFormat
> --
>
> Key: SPARK-19918
> URL: https://issues.apache.org/jira/browse/SPARK-19918
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>
> If we use Dataset for initial loading when inferring the schema, there are 
> advantages. Please refer SPARK-18362
> It seems JSON one was supposed to be fixed together but missed according to 
> https://github.com/apache/spark/pull/15813
> {quote}
> A similar problem also affects the JSON file format and this patch originally 
> fixed that as well, but I've decided to split that change into a separate 
> patch so as not to conflict with changes in another JSON PR.
> {quote}
> Also, this affects some functionalities because it does not use 
> {{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's 
> case).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19918) Use TextFileFormat in implementation of JsonFileFormat

2017-03-10 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-19918:


 Summary: Use TextFileFormat in implementation of JsonFileFormat
 Key: SPARK-19918
 URL: https://issues.apache.org/jira/browse/SPARK-19918
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon


If we use Dataset for initial loading when inferring the schema, there are 
advantages. Please refer SPARK-18362

It seems JSON one was supposed to be fixed together but missed according to 
https://github.com/apache/spark/pull/15813

{quote}
A similar problem also affects the JSON file format and this patch originally 
fixed that as well, but I've decided to split that change into a separate patch 
so as not to conflict with changes in another JSON PR.
{quote}

Also, this affects some functionalities because it does not use 
{{FileScanRDD}}. This problem is described in SPARK-19885 (but it was CSV's 
case).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19901) Clean up the clunky method signature of acquireMemory

2017-03-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19901.
---
Resolution: Not A Problem

> Clean up the clunky method signature of acquireMemory
> -
>
> Key: SPARK-19901
> URL: https://issues.apache.org/jira/browse/SPARK-19901
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0
>Reporter: coneyliu
>Priority: Minor
>
> Clean up the clunky method signature of acquireMemory



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-03-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19723:
---

Assignee: Song Jun

> create table for data source tables should work with an non-existent location
> -
>
> Key: SPARK-19723
> URL: https://issues.apache.org/jira/browse/SPARK-19723
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Song Jun
> Fix For: 2.2.0
>
>
> This JIRA is a follow up work after SPARK-19583
> As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 
> The following DDL for datasource table with an non-existent location should 
> work:
> ``
> CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
> ```
> Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19723) create table for data source tables should work with an non-existent location

2017-03-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19723.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17055
[https://github.com/apache/spark/pull/17055]

> create table for data source tables should work with an non-existent location
> -
>
> Key: SPARK-19723
> URL: https://issues.apache.org/jira/browse/SPARK-19723
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
> Fix For: 2.2.0
>
>
> This JIRA is a follow up work after SPARK-19583
> As we discussed in that [PR|https://github.com/apache/spark/pull/16938] 
> The following DDL for datasource table with an non-existent location should 
> work:
> ``
> CREATE TABLE ... (PARTITIONED BY ...) LOCATION path
> ```
> Currently it will throw exception  that path not exists



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19917) qualified partition location stored in catalog

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19917:


Assignee: Apache Spark

> qualified partition location stored in catalog
> --
>
> Key: SPARK-19917
> URL: https://issues.apache.org/jira/browse/SPARK-19917
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>Assignee: Apache Spark
>
> partition path should be qualified to store in catalog. 
> There are some scenes:
> 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' 
>qualified: file:/path/x
> 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' 
>  qualified: file:/tablelocation/x
> 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
>qualified: file:/path/x
> 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
>  qualified: file:/tablelocation/x
> Currently only  ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde 
> table has the expected qualified path. we should make other scenes to be 
> consist with it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19917) qualified partition location stored in catalog

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19917:


Assignee: (was: Apache Spark)

> qualified partition location stored in catalog
> --
>
> Key: SPARK-19917
> URL: https://issues.apache.org/jira/browse/SPARK-19917
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> partition path should be qualified to store in catalog. 
> There are some scenes:
> 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' 
>qualified: file:/path/x
> 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' 
>  qualified: file:/tablelocation/x
> 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
>qualified: file:/path/x
> 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
>  qualified: file:/tablelocation/x
> Currently only  ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde 
> table has the expected qualified path. we should make other scenes to be 
> consist with it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19917) qualified partition location stored in catalog

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906072#comment-15906072
 ] 

Apache Spark commented on SPARK-19917:
--

User 'windpiger' has created a pull request for this issue:
https://github.com/apache/spark/pull/17254

> qualified partition location stored in catalog
> --
>
> Key: SPARK-19917
> URL: https://issues.apache.org/jira/browse/SPARK-19917
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Song Jun
>
> partition path should be qualified to store in catalog. 
> There are some scenes:
> 1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' 
>qualified: file:/path/x
> 2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' 
>  qualified: file:/tablelocation/x
> 3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
>qualified: file:/path/x
> 4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
>  qualified: file:/tablelocation/x
> Currently only  ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde 
> table has the expected qualified path. we should make other scenes to be 
> consist with it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19917) qualified partition location stored in catalog

2017-03-10 Thread Song Jun (JIRA)
Song Jun created SPARK-19917:


 Summary: qualified partition location stored in catalog
 Key: SPARK-19917
 URL: https://issues.apache.org/jira/browse/SPARK-19917
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Song Jun


partition path should be qualified to store in catalog. 
There are some scenes:
1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x' 
   qualified: file:/path/x
2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x' 
 qualified: file:/tablelocation/x
3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
   qualified: file:/path/x
4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
 qualified: file:/tablelocation/x

Currently only  ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table 
has the expected qualified path. we should make other scenes to be consist with 
it.





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19916) simplify bad file handling

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19916:


Assignee: Apache Spark  (was: Wenchen Fan)

> simplify bad file handling
> --
>
> Key: SPARK-19916
> URL: https://issues.apache.org/jira/browse/SPARK-19916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19916) simplify bad file handling

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19916:


Assignee: Wenchen Fan  (was: Apache Spark)

> simplify bad file handling
> --
>
> Key: SPARK-19916
> URL: https://issues.apache.org/jira/browse/SPARK-19916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19916) simplify bad file handling

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906048#comment-15906048
 ] 

Apache Spark commented on SPARK-19916:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/17253

> simplify bad file handling
> --
>
> Key: SPARK-19916
> URL: https://issues.apache.org/jira/browse/SPARK-19916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19916) simplify bad file handling

2017-03-10 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-19916:
---

 Summary: simplify bad file handling
 Key: SPARK-19916
 URL: https://issues.apache.org/jira/browse/SPARK-19916
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6634) Allow replacing columns in Transformers

2017-03-10 Thread Tree Field (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15904445#comment-15904445
 ] 

Tree Field edited comment on SPARK-6634 at 3/11/17 2:59 AM:


I want this feature too.
because I often overwrite UnaryTransformer by myself  to enable this.

It seems it's only prevented in transformSchema method.
Now, unlike before v1.4,  dataframe's withColumn method used in 
UnaryTransformer allows replacing  the input column.

Are there any other reasons that is not allowed in transoformer, especially in 
UnaryTransformer?






was (Author: greattreeinfield):
I want this feature too.
because I often overwrite UnaryTransformer by myself  to enable this.

It seems it's only prevented in transformSchema method.
Now, unlike before v1.4,  dataframe's withColumn method used in 
UnaryTransformer allows replacing  the input column.

Any other reasons that is not allowed in transoformer, especially in 
UnaryTransformer.





> Allow replacing columns in Transformers
> ---
>
> Key: SPARK-6634
> URL: https://issues.apache.org/jira/browse/SPARK-6634
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Currently, Transformers do not allow input and output columns to share the 
> same name.  (In fact, this is not allowed but also not even checked.)
> Short-term proposal: Disallow input and output columns with the same name, 
> and add a check in transformSchema.
> Long-term proposal: Allow input & output columns with the same name, and 
> where the behavior is that the output columns replace input columns with the 
> same name.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19915:


Assignee: Apache Spark

> Improve join reorder: simplify cost evaluation, postpone column pruning, 
> exclude cartesian product
> --
>
> Key: SPARK-19915
> URL: https://issues.apache.org/jira/browse/SPARK-19915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>
> 1. Usually cardinality is more important than size, we can simplify cost 
> evaluation by using only cardinality. Note that this also enables us to not 
> care about column pruing during reordering. Because otherwise, project will 
> influence the output size of intermediate joins.
> 2. Do column pruning during reordering is troublesome. Given the first 
> change, we can do it right after reordering, then logics for adding projects 
> on intermediate joins can be removed. This makes the code simpler and more 
> reliable.
> 3. Exclude cartesian products in the "memo". This significantly reduces the 
> search space and memory overhead of memo. Otherwise every combination of 
> items will exist in the memo. We can find those unjoinable items after 
> reordering is finished and put them at the end.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19915:


Assignee: (was: Apache Spark)

> Improve join reorder: simplify cost evaluation, postpone column pruning, 
> exclude cartesian product
> --
>
> Key: SPARK-19915
> URL: https://issues.apache.org/jira/browse/SPARK-19915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> 1. Usually cardinality is more important than size, we can simplify cost 
> evaluation by using only cardinality. Note that this also enables us to not 
> care about column pruing during reordering. Because otherwise, project will 
> influence the output size of intermediate joins.
> 2. Do column pruning during reordering is troublesome. Given the first 
> change, we can do it right after reordering, then logics for adding projects 
> on intermediate joins can be removed. This makes the code simpler and more 
> reliable.
> 3. Exclude cartesian products in the "memo". This significantly reduces the 
> search space and memory overhead of memo. Otherwise every combination of 
> items will exist in the memo. We can find those unjoinable items after 
> reordering is finished and put them at the end.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product

2017-03-10 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-19915:
-
Description: 
1. Usually cardinality is more important than size, we can simplify cost 
evaluation by using only cardinality. Note that this also enables us to not 
care about column pruing during reordering. Because otherwise, project will 
influence the output size of intermediate joins.
2. Do column pruning during reordering is troublesome. Given the first change, 
we can do it right after reordering, then logics for adding projects on 
intermediate joins can be removed. This makes the code simpler and more 
reliable.
3. Exclude cartesian products in the "memo". This significantly reduces the 
search space and memory overhead of memo. Otherwise every combination of items 
will exist in the memo. We can find those unjoinable items after reordering is 
finished and put them at the end.

  was:
Do column pruning during reordering is troublesome. We can do it right after 
reordering, then logics for adding projects on intermediate joins can be 
removed. This makes the code simpler and more reliable.
Usually cardinality is more important than size, we can simplify cost 
evaluation by using only cardinality. Note that this enables us to not care 
about column pruing during reordering (the first point). Otherwise, project 
will influence the output size of intermediate joins.
Exclude cartesian products in the "memo". This significantly reduces the search 
space and memory overhead of memo. Otherwise every combination of items will 
exist in the memo. We can find those unjoinable items after reordering is 
finished and put them at the end.


> Improve join reorder: simplify cost evaluation, postpone column pruning, 
> exclude cartesian product
> --
>
> Key: SPARK-19915
> URL: https://issues.apache.org/jira/browse/SPARK-19915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> 1. Usually cardinality is more important than size, we can simplify cost 
> evaluation by using only cardinality. Note that this also enables us to not 
> care about column pruing during reordering. Because otherwise, project will 
> influence the output size of intermediate joins.
> 2. Do column pruning during reordering is troublesome. Given the first 
> change, we can do it right after reordering, then logics for adding projects 
> on intermediate joins can be removed. This makes the code simpler and more 
> reliable.
> 3. Exclude cartesian products in the "memo". This significantly reduces the 
> search space and memory overhead of memo. Otherwise every combination of 
> items will exist in the memo. We can find those unjoinable items after 
> reordering is finished and put them at the end.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15906018#comment-15906018
 ] 

Apache Spark commented on SPARK-19915:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/17240

> Improve join reorder: simplify cost evaluation, postpone column pruning, 
> exclude cartesian product
> --
>
> Key: SPARK-19915
> URL: https://issues.apache.org/jira/browse/SPARK-19915
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Zhenhua Wang
>
> 1. Usually cardinality is more important than size, we can simplify cost 
> evaluation by using only cardinality. Note that this also enables us to not 
> care about column pruing during reordering. Because otherwise, project will 
> influence the output size of intermediate joins.
> 2. Do column pruning during reordering is troublesome. Given the first 
> change, we can do it right after reordering, then logics for adding projects 
> on intermediate joins can be removed. This makes the code simpler and more 
> reliable.
> 3. Exclude cartesian products in the "memo". This significantly reduces the 
> search space and memory overhead of memo. Otherwise every combination of 
> items will exist in the memo. We can find those unjoinable items after 
> reordering is finished and put them at the end.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19915) Improve join reorder: simplify cost evaluation, postpone column pruning, exclude cartesian product

2017-03-10 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-19915:


 Summary: Improve join reorder: simplify cost evaluation, postpone 
column pruning, exclude cartesian product
 Key: SPARK-19915
 URL: https://issues.apache.org/jira/browse/SPARK-19915
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.2.0
Reporter: Zhenhua Wang


Do column pruning during reordering is troublesome. We can do it right after 
reordering, then logics for adding projects on intermediate joins can be 
removed. This makes the code simpler and more reliable.
Usually cardinality is more important than size, we can simplify cost 
evaluation by using only cardinality. Note that this enables us to not care 
about column pruing during reordering (the first point). Otherwise, project 
will influence the output size of intermediate joins.
Exclude cartesian products in the "memo". This significantly reduces the search 
space and memory overhead of memo. Otherwise every combination of items will 
exist in the memo. We can find those unjoinable items after reordering is 
finished and put them at the end.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application

2017-03-10 Thread LvDongrong (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905991#comment-15905991
 ] 

LvDongrong commented on SPARK-19863:


I see your comment on that issue(SPARK-19185), and I am agree with you. Our 
problem is different, our kafka Cluster cannot support so many connections, 
which is  established by  cached consumers to the kafka ,because the num of our 
topic and partition is large. So I think it is necessary not to use cached 
consumer in some cases.

> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
> 
>
> Key: SPARK-19863
> URL: https://issues.apache.org/jira/browse/SPARK-19863
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>
> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming 
> application. In Spark 2.x, the kafka consumer was replaced by 
> CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka 
> cluster), and cannot change the way. In fact ,The KafkaRDD(used by 
> DirectKafkaInputDStream to connect kafka) provide the parameter 
> useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the 
> DirectKafkaInputDStream set the parameter true.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19914) Spark Scala - Calling persist after reading a parquet file makes certain spark.sql queries return empty results

2017-03-10 Thread Yifeng Li (JIRA)
Yifeng Li created SPARK-19914:
-

 Summary: Spark Scala - Calling persist after reading a parquet 
file makes certain spark.sql queries return empty results
 Key: SPARK-19914
 URL: https://issues.apache.org/jira/browse/SPARK-19914
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Affects Versions: 2.1.0, 2.0.0
Reporter: Yifeng Li


Hi There,

It seems like calling .persist() after spark.read.parquet will make spark.sql 
statements return empty results if the query is written in a certain way.

I have the following code here:

val df = spark.read.parquet("C:\\...")
df.createOrReplaceTempView("t1")
spark.sql("select * from t1 a where a.id = '123456789'").show(10)

Everything works fine here.

Now, if I do:
val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY)
df.createOrReplaceTempView("t1")
spark.sql("select * from t1 a where a.id = '123456789'").show(10)

I will get empty results.

selecting individual columns works with persist, e.g.:
val df = spark.read.parquet("C:\\...").persist(StorageLevel.DISK_ONLY)
df.createOrReplaceTempView("t1")
spark.sql("select a.id from t1 a where a.id = '123456789'").show(10)





--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19611:

Fix Version/s: 2.1.1

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
> Fix For: 2.1.1, 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19893) should not run DataFrame set oprations with map type

2017-03-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19893.
-
   Resolution: Fixed
Fix Version/s: 2.2.0
   2.1.1
   2.0.3

> should not run DataFrame set oprations with map type
> 
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.3, 2.1.1, 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19913:


Assignee: (was: Apache Spark)

> Log warning rather than throw AnalysisException when output is partitioned 
> although format is memory, console or foreach
> 
>
> Key: SPARK-19913
> URL: https://issues.apache.org/jira/browse/SPARK-19913
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When batches are executed with memory, console or foreach format, 
> `assertNotPartitioned` will check whether output is not partitioned and throw 
> AnalysisException in case it is.
> But I wonder it's better to log warning rather than throw the exception 
> because partitioning does not affect output for those formats but also does 
> not bring any negative impacts.
> Also, this assertion is not applied when the format is `console`. I think in 
> this case too, we should assert that .
> By fixing them, we can easily switch the format to memory or console for 
> debug purposes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19913:


Assignee: Apache Spark

> Log warning rather than throw AnalysisException when output is partitioned 
> although format is memory, console or foreach
> 
>
> Key: SPARK-19913
> URL: https://issues.apache.org/jira/browse/SPARK-19913
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> When batches are executed with memory, console or foreach format, 
> `assertNotPartitioned` will check whether output is not partitioned and throw 
> AnalysisException in case it is.
> But I wonder it's better to log warning rather than throw the exception 
> because partitioning does not affect output for those formats but also does 
> not bring any negative impacts.
> Also, this assertion is not applied when the format is `console`. I think in 
> this case too, we should assert that .
> By fixing them, we can easily switch the format to memory or console for 
> debug purposes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905904#comment-15905904
 ] 

Apache Spark commented on SPARK-19913:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/17252

> Log warning rather than throw AnalysisException when output is partitioned 
> although format is memory, console or foreach
> 
>
> Key: SPARK-19913
> URL: https://issues.apache.org/jira/browse/SPARK-19913
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When batches are executed with memory, console or foreach format, 
> `assertNotPartitioned` will check whether output is not partitioned and throw 
> AnalysisException in case it is.
> But I wonder it's better to log warning rather than throw the exception 
> because partitioning does not affect output for those formats but also does 
> not bring any negative impacts.
> Also, this assertion is not applied when the format is `console`. I think in 
> this case too, we should assert that .
> By fixing them, we can easily switch the format to memory or console for 
> debug purposes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19913) Log warning rather than throw AnalysisException when output is partitioned although format is memory, console or foreach

2017-03-10 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-19913:
--

 Summary: Log warning rather than throw AnalysisException when 
output is partitioned although format is memory, console or foreach
 Key: SPARK-19913
 URL: https://issues.apache.org/jira/browse/SPARK-19913
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Kousuke Saruta
Priority: Minor


When batches are executed with memory, console or foreach format, 
`assertNotPartitioned` will check whether output is not partitioned and throw 
AnalysisException in case it is.

But I wonder it's better to log warning rather than throw the exception because 
partitioning does not affect output for those formats but also does not bring 
any negative impacts.

Also, this assertion is not applied when the format is `console`. I think in 
this case too, we should assert that .

By fixing them, we can easily switch the format to memory or console for debug 
purposes.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905886#comment-15905886
 ] 

Shixiong Zhu commented on SPARK-18057:
--

>  Based on previous kafka client upgrades I wouldn't expect them to be binary 
> compatible, so it's likely to cause someone problems if they were also making 
> use of kafka client libraries in their spark job. Still may be the path of 
> least resistance.

I can confirm the APIs used by Kafka sink is source compatible since I didn't 
change any core source codes (test codes have to be changed because of the 
server APIs are changed). Since these APIs are Java APIs. I'm pretty sure they 
are binary compatible. So for the user, even if we upgrade the Kafka client 
version, they can still downgrade the Kafka client version if they want, and 
just repackage the codes with the kafka client. It's a bit annoying that 
"--packages" probably won't work but it's acceptable.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19912) String literals are not escaped while performing Hive metastore level partition pruning

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19912:
---
Summary: String literals are not escaped while performing Hive metastore 
level partition pruning  (was: String literals are not escaped while performing 
partition pruning at Hive metastore level)

> String literals are not escaped while performing Hive metastore level 
> partition pruning
> ---
>
> Key: SPARK-19912
> URL: https://issues.apache.org/jira/browse/SPARK-19912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> {{Shim_v0_13.convertFilters()}} doesn't escape string literals while 
> generating Hive style partition predicates.
> The following SQL-injection-like test case illustrates this issue:
> {code}
>   test("SPARK-19912") {
> withTable("spark_19912") {
>   Seq(
> (1, "p1", "q1"),
> (2, "p1\" and q=\"q1", "q2")
>   ).toDF("a", "p", "q").write.partitionBy("p", 
> "q").saveAsTable("spark_19912")
>   checkAnswer(
> spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
> Row(2)
>   )
> }
>   }
> {code}
> The above test case fails like this:
> {noformat}
> [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [unresolvedalias('a, None)]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   a: int
> [info]   Project [a#26]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Optimized Logical Plan ==
> [info]   Project [a#26]
> [info]   +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1))
> [info]  +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Physical Plan ==
> [info]   *Project [a#26]
> [info]   +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: 
> true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 
> 0, PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], 
> PushedFilters: [], ReadSchema: struct
> [info]   == Results ==
> [info]
> [info]   == Results ==
> [info]   !== Correct Answer - 1 ==   == Spark Answer - 0 ==
> [info]struct<>   struct<>
> [info]   ![2]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables

2017-03-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19905.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 17247
[https://github.com/apache/spark/pull/17247]

> Dataset.inputFiles is broken for Hive SerDe tables
> --
>
> Key: SPARK-19905
> URL: https://issues.apache.org/jira/browse/SPARK-19905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.2.0
>
>
> The following snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t")
> spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t")
> spark.table("u").inputFiles.foreach(println)
> {code}
> In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like
> {noformat}
> file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u
> {noformat}
> on my laptop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19912:
---
Description: 
{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("SPARK-19912") {
withTable("spark_19912") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", 
"q").saveAsTable("spark_19912")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}
The above test case fails like this:
{noformat}
[info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
[info]   Results do not match for query:
[info]   Timezone: 
sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
[info]   Timezone Env:
[info]
[info]   == Parsed Logical Plan ==
[info]   'Project [unresolvedalias('a, None)]
[info]   +- Filter (p#27 = p1" and q = "q1)
[info]  +- SubqueryAlias spark_19912
[info] +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Analyzed Logical Plan ==
[info]   a: int
[info]   Project [a#26]
[info]   +- Filter (p#27 = p1" and q = "q1)
[info]  +- SubqueryAlias spark_19912
[info] +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Optimized Logical Plan ==
[info]   Project [a#26]
[info]   +- Filter (isnotnull(p#27) && (p#27 = p1" and q = "q1))
[info]  +- Relation[a#26,p#27,q#28] parquet
[info]
[info]   == Physical Plan ==
[info]   *Project [a#26]
[info]   +- *FileScan parquet default.spark_19912[a#26,p#27,q#28] Batched: 
true, Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, 
PartitionFilters: [isnotnull(p#27), (p#27 = p1" and q = "q1)], PushedFilters: 
[], ReadSchema: struct
[info]   == Results ==
[info]
[info]   == Results ==
[info]   !== Correct Answer - 1 ==   == Spark Answer - 0 ==
[info]struct<>   struct<>
[info]   ![2]
{noformat}

  was:
{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("foo") {
withTable("foo") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}


> String literals are not escaped while performing partition pruning at Hive 
> metastore level
> --
>
> Key: SPARK-19912
> URL: https://issues.apache.org/jira/browse/SPARK-19912
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> {{Shim_v0_13.convertFilters()}} doesn't escape string literals while 
> generating Hive style partition predicates.
> The following SQL-injection-like test case illustrates this issue:
> {code}
>   test("SPARK-19912") {
> withTable("spark_19912") {
>   Seq(
> (1, "p1", "q1"),
> (2, "p1\" and q=\"q1", "q2")
>   ).toDF("a", "p", "q").write.partitionBy("p", 
> "q").saveAsTable("spark_19912")
>   checkAnswer(
> spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
> Row(2)
>   )
> }
>   }
> {code}
> The above test case fails like this:
> {noformat}
> [info] - spark_19912 *** FAILED *** (13 seconds, 74 milliseconds)
> [info]   Results do not match for query:
> [info]   Timezone: 
> sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-2880,dstSavings=360,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-2880,dstSavings=360,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=720,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=720,endTimeMode=0]]
> [info]   Timezone Env:
> [info]
> [info]   == Parsed Logical Plan ==
> [info]   'Project [unresolvedalias('a, None)]
> [info]   +- Filter (p#27 = p1" and q = "q1)
> [info]  +- SubqueryAlias spark_19912
> [info] +- Relation[a#26,p#27,q#28] parquet
> [info]
> [info]   == Analyzed Logical Plan ==
> [info]   a: int
> 

[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Labels: correctness  (was: )

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>  Labels: correctness
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19912) String literals are not escaped while performing partition pruning at Hive metastore level

2017-03-10 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19912:
--

 Summary: String literals are not escaped while performing 
partition pruning at Hive metastore level
 Key: SPARK-19912
 URL: https://issues.apache.org/jira/browse/SPARK-19912
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.1, 2.2.0
Reporter: Cheng Lian


{{Shim_v0_13.convertFilters()}} doesn't escape string literals while generating 
Hive style partition predicates.

The following SQL-injection-like test case illustrates this issue:
{code}
  test("foo") {
withTable("foo") {
  Seq(
(1, "p1", "q1"),
(2, "p1\" and q=\"q1", "q2")
  ).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("foo")

  checkAnswer(
spark.table("foo").filter($"p" === "p1\" and q = \"q1").select($"a"),
Row(2)
  )
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14503) spark.ml Scala API for FPGrowth

2017-03-10 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905840#comment-15905840
 ] 

Maciej Szymkiewicz commented on SPARK-14503:


I think we should keep only unique predictions in {{transform}} otherwise it is 
possible to get results like this:

{code}
scala> val data = 
spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value", 
"\\s+").alias("features")) 
data: org.apache.spark.sql.DataFrame = [features: array]

scala> val data = 
spark.read.text("data/mllib/sample_fpgrowth.txt").select(split($"value", 
"\\s+").alias("features")) 
data: org.apache.spark.sql.DataFrame = [features: array]

scala> fpm.transform(Seq(Array("t", "s")).toDF("features")).show(1, false)
++-+
|features|prediction   |
++-+
|[t, s]  |[y, x, z, x, y, x, z]|
++-+

{code}

> spark.ml Scala API for FPGrowth
> ---
>
> Key: SPARK-14503
> URL: https://issues.apache.org/jira/browse/SPARK-14503
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
> Fix For: 2.2.0
>
>
> This task is the first port of spark.mllib.fpm functionality to spark.ml 
> (Scala).
> This will require a brief design doc to confirm a reasonable DataFrame-based 
> API, with details for this class.  The doc could also look ahead to the other 
> fpm classes, especially if their API decisions will affect FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-10 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905834#comment-15905834
 ] 

Maciej Szymkiewicz commented on SPARK-19899:


Thanks [~yuhaoyan].

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Affects Version/s: 2.2.0

> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905774#comment-15905774
 ] 

Cody Koeninger commented on SPARK-18057:


Based on previous kafka client upgrades I wouldn't expect them to be binary 
compatible, so it's likely to cause someone problems if they were also making 
use of kafka client libraries in their spark job.  Still may be the path of 
least resistance.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19910) `stack` should not reject NULL values due to type mismatch

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19910:


Assignee: (was: Apache Spark)

> `stack` should not reject NULL values due to type mismatch
> --
>
> Key: SPARK-19910
> URL: https://issues.apache.org/jira/browse/SPARK-19910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Dongjoon Hyun
>
> Since `stack` function generates a table with nullable columns, it should 
> allow mixed null values.
> {code}
> scala> sql("select stack(3, 1, 2, 3)").printSchema
> root
>  |-- col0: integer (nullable = true)
> scala> sql("select stack(3, 1, 2, null)").printSchema
> org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' 
> due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); 
> line 1 pos 7;
> 'Project [unresolvedalias(stack(3, 1, 2, null), None)]
> +- OneRowRelation$
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19910) `stack` should not reject NULL values due to type mismatch

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19910:


Assignee: Apache Spark

> `stack` should not reject NULL values due to type mismatch
> --
>
> Key: SPARK-19910
> URL: https://issues.apache.org/jira/browse/SPARK-19910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>
> Since `stack` function generates a table with nullable columns, it should 
> allow mixed null values.
> {code}
> scala> sql("select stack(3, 1, 2, 3)").printSchema
> root
>  |-- col0: integer (nullable = true)
> scala> sql("select stack(3, 1, 2, null)").printSchema
> org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' 
> due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); 
> line 1 pos 7;
> 'Project [unresolvedalias(stack(3, 1, 2, null), None)]
> +- OneRowRelation$
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19910) `stack` should not reject NULL values due to type mismatch

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905766#comment-15905766
 ] 

Apache Spark commented on SPARK-19910:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/17251

> `stack` should not reject NULL values due to type mismatch
> --
>
> Key: SPARK-19910
> URL: https://issues.apache.org/jira/browse/SPARK-19910
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Dongjoon Hyun
>
> Since `stack` function generates a table with nullable columns, it should 
> allow mixed null values.
> {code}
> scala> sql("select stack(3, 1, 2, 3)").printSchema
> root
>  |-- col0: integer (nullable = true)
> scala> sql("select stack(3, 1, 2, null)").printSchema
> org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' 
> due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); 
> line 1 pos 7;
> 'Project [unresolvedalias(stack(3, 1, 2, null), None)]
> +- OneRowRelation$
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905755#comment-15905755
 ] 

Michael Armbrust commented on SPARK-18057:
--

It seems like we can upgrade the existing Kafka10 artifacts without causing any 
compatibility issues (since 0.10.2.0 is compatible with 0.10.0.0+), so I don't 
think there is any need to make new artifacts or do any refactoring.  I think 
we can just upgrade?

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905747#comment-15905747
 ] 

Cody Koeninger commented on SPARK-18057:


I think the bigger question is once there's a kafka version you want to upgrade 
to, are you going to just forcibly upgrade, make another set of separate 
artifacts, or refactor common code so that it can use a different / provided 
kafka version.  Ditto for the DStream, unless you're just abandoning it.

> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19911) Add builder interface for Kinesis DStreams

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19911:


Assignee: Apache Spark

> Add builder interface for Kinesis DStreams
> --
>
> Key: SPARK-19911
> URL: https://issues.apache.org/jira/browse/SPARK-19911
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Apache Spark
>Priority: Minor
>
> The ```KinesisUtils.createStream()``` interface for creating Kinesis-based 
> DStreams is quite brittle and requires adding a combinatorial number of 
> overrides whenever another optional configuration parameter is added. This 
> makes incorporating a lot of additional features supported by the Kinesis 
> Client Library such as per-service authorization unfeasible. This interface 
> should be replaced by a builder pattern class 
> (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater 
> extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19911) Add builder interface for Kinesis DStreams

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19911:


Assignee: (was: Apache Spark)

> Add builder interface for Kinesis DStreams
> --
>
> Key: SPARK-19911
> URL: https://issues.apache.org/jira/browse/SPARK-19911
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Priority: Minor
>
> The ```KinesisUtils.createStream()``` interface for creating Kinesis-based 
> DStreams is quite brittle and requires adding a combinatorial number of 
> overrides whenever another optional configuration parameter is added. This 
> makes incorporating a lot of additional features supported by the Kinesis 
> Client Library such as per-service authorization unfeasible. This interface 
> should be replaced by a builder pattern class 
> (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater 
> extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19911) Add builder interface for Kinesis DStreams

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905741#comment-15905741
 ] 

Apache Spark commented on SPARK-19911:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/17250

> Add builder interface for Kinesis DStreams
> --
>
> Key: SPARK-19911
> URL: https://issues.apache.org/jira/browse/SPARK-19911
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Priority: Minor
>
> The ```KinesisUtils.createStream()``` interface for creating Kinesis-based 
> DStreams is quite brittle and requires adding a combinatorial number of 
> overrides whenever another optional configuration parameter is added. This 
> makes incorporating a lot of additional features supported by the Kinesis 
> Client Library such as per-service authorization unfeasible. This interface 
> should be replaced by a builder pattern class 
> (https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater 
> extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17979) Remove deprecated support for config SPARK_YARN_USER_ENV

2017-03-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17979.

   Resolution: Fixed
 Assignee: Yong Tang
Fix Version/s: 2.2.0

> Remove deprecated support for config SPARK_YARN_USER_ENV 
> -
>
> Key: SPARK-17979
> URL: https://issues.apache.org/jira/browse/SPARK-17979
> Project: Spark
>  Issue Type: Improvement
>Reporter: Kishor Patil
>Assignee: Yong Tang
>Priority: Trivial
> Fix For: 2.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable

2017-03-10 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-14453.

   Resolution: Fixed
 Assignee: Yong Tang
Fix Version/s: 2.2.0

> Remove SPARK_JAVA_OPTS environment variable
> ---
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Assignee: Yong Tang
>Priority: Minor
> Fix For: 2.2.0
>
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly

2017-03-10 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905730#comment-15905730
 ] 

Cody Koeninger commented on SPARK-19888:


That stacktrace also shows a concurrent modification exception, yes?.  See 
SPARK-19185 for that

See e.g. SPARK-19680 for background on why offset out of range may occur on 
executor when it doesn't on driver.  Although if you're using reset latest, 
unless you have really short retention this is kind of surprising.

> Seeing offsets not resetting even when reset policy is configured explicitly
> 
>
> Key: SPARK-19888
> URL: https://issues.apache.org/jira/browse/SPARK-19888
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Justin Miller
>
> I was told to post this in a Spark ticket from KAFKA-4396:
> I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be 
> two separate errors, I'm not sure. What's puzzling is that I'm setting 
> auto.offset.reset to latest and it's still throwing an 
> OffsetOutOfRangeException, behavior that's contrary to the code. Please help! 
> :)
> {code}
> val kafkaParams = Map[String, Object](
>   "group.id" -> consumerGroup,
>   "bootstrap.servers" -> bootstrapServers,
>   "key.deserializer" -> classOf[ByteArrayDeserializer],
>   "value.deserializer" -> classOf[MessageRowDeserializer],
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean),
>   "max.poll.records" -> persisterConfig.maxPollRecords,
>   "request.timeout.ms" -> persisterConfig.requestTimeoutMs,
>   "session.timeout.ms" -> persisterConfig.sessionTimeoutMs,
>   "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs,
>   "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs
> )
> {code}
> {code}
> 16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory 
> on xyz (size: 146.3 KB, free: 8.4 GB)
> 16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID 
> 38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {topic=231884473}
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID 
> 39388) in 12043 ms on xyz (1/16)
> 16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID 
> 39375) in 13444 ms on xyz (2/16)
> 16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID 
> 38843, xyz): 

[jira] [Created] (SPARK-19911) Add builder interface for Kinesis DStreams

2017-03-10 Thread Adam Budde (JIRA)
Adam Budde created SPARK-19911:
--

 Summary: Add builder interface for Kinesis DStreams
 Key: SPARK-19911
 URL: https://issues.apache.org/jira/browse/SPARK-19911
 Project: Spark
  Issue Type: New Feature
  Components: DStreams
Affects Versions: 2.1.0
Reporter: Adam Budde
Priority: Minor


The ```KinesisUtils.createStream()``` interface for creating Kinesis-based 
DStreams is quite brittle and requires adding a combinatorial number of 
overrides whenever another optional configuration parameter is added. This 
makes incorporating a lot of additional features supported by the Kinesis 
Client Library such as per-service authorization unfeasible. This interface 
should be replaced by a builder pattern class 
(https://en.wikipedia.org/wiki/Builder_pattern) to allow for greater 
extensibility.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905716#comment-15905716
 ] 

Shixiong Zhu edited comment on SPARK-18057 at 3/10/17 9:21 PM:
---

I did some investigation yesterday, and found one issue in 0.10.2.0:
https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may 
hang forever when deleting a topic

Our current tests will just hang forever due to KAFKA-4879. This prevents us 
from upgrading 0.10.2.0.

I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try 
to summary the current situation:

The benefits of upgrading Kafka client to 0.10.2.0:
- Forward compatibility
- Reading topics from a timestamp
- The following bug fixes:

Issues that we already have workarounds:
https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow 
some interrupts meant for the calling thread
https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an 
infinite loop if the polling thread is interrupted, and either commitSync or 
committed is called
https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw 
NullPointerException on poll when delete the relative topic

Issues related to Kafka record compression
https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native 
Memory For Longer Than Needed With Compressed Messages
https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does 
not write EndMark if flush() is not called before close()

Others:
https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope 
well with topic deletions

For 0.10.1.x, KAFKA-4547 prevents us from upgrading to 0.10.1.x.

At last, IMO, "Reading topics from a timestamp" is pretty useful and is the 
most important reason that we should upgrade Kafka. However, since the Spark 
2.2 code freeze is coming, we won't get enough time to deliver this feature to 
the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka 
release. I don't think the next Kafka release will be later than Spark 2.3. 
Then we should be able to upgrade Kafka before Spark 2.3.



was (Author: zsxwing):
I did some investigation yesterday, and found one issue in 0.10.2.0:
https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may 
hang forever when deleting a topic

Our current tests will just hang forever due to KAFKA-4879. This prevents us 
from upgrading 0.10.2.0.

I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try 
to summary the current situation:

The benefits of upgrading Kafka client to 0.10.2.0:
- Forward compatibility
- Reading topics from a timestamp
- The following bug fixes:

Issues that we already have workarounds:
https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow 
some interrupts meant for the calling thread
https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an 
infinite loop if the polling thread is interrupted, and either commitSync or 
committed is called
https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw 
NullPointerException on poll when delete the relative topic

Issues related to Kafka record compression
https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native 
Memory For Longer Than Needed With Compressed Messages
https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does 
not write EndMark if flush() is not called before close()

Others:
https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope 
well with topic deletions

For 0.10.1.x, KAFKA-4547 prevents us from upgrading to 0.10.1.x.

At last, IMO, "Reading topics from a timestamp" is pretty useful and is the 
most important reason that we should upgrade Kafka. However, since the Spark 
2.2 code freeze is coming, we won't get enough time to deliver this feature to 
the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka 
release. I don't think the next Kafka release will be later than Spark 2.3.


> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905716#comment-15905716
 ] 

Shixiong Zhu edited comment on SPARK-18057 at 3/10/17 9:21 PM:
---

I did some investigation yesterday, and found one issue in 0.10.2.0:
https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may 
hang forever when deleting a topic

Our current tests will just hang forever due to KAFKA-4879. This prevents us 
from upgrading 0.10.2.0.

I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try 
to summary the current situation:

The benefits of upgrading Kafka client to 0.10.2.0:
- Forward compatibility
- Reading topics from a timestamp
- The following bug fixes:

Issues that we already have workarounds:
https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow 
some interrupts meant for the calling thread
https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an 
infinite loop if the polling thread is interrupted, and either commitSync or 
committed is called
https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw 
NullPointerException on poll when delete the relative topic

Issues related to Kafka record compression
https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native 
Memory For Longer Than Needed With Compressed Messages
https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does 
not write EndMark if flush() is not called before close()

Others:
https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope 
well with topic deletions

For 0.10.1.x, KAFKA-4547 prevents us from upgrading to 0.10.1.x.

At last, IMO, "Reading topics from a timestamp" is pretty useful and is the 
most important reason that we should upgrade Kafka. However, since the Spark 
2.2 code freeze is coming, we won't get enough time to deliver this feature to 
the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka 
release. I don't think the next Kafka release will be later than Spark 2.3.



was (Author: zsxwing):
I did some investigation yesterday, and found one issue in 0.10.2.0:
https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may 
hang forever when deleting a topic

Our current tests will just hang forever due to KAFKA-4879. This prevents us 
from upgrading 0.10.2.0.

I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try 
to summary the current situation:

The benefits of upgrading Kafka client to 0.10.2.0:
- Forward compatibility
- Reading topics from a timestamp
- The following bug fixes:

Issues that we already have workarounds:
https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow 
some interrupts meant for the calling thread
https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an 
infinite loop if the polling thread is interrupted, and either commitSync or 
committed is called
https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw 
NullPointerException on poll when delete the relative topic

Issues related to Kafka record compression
https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native 
Memory For Longer Than Needed With Compressed Messages
https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does 
not write EndMark if flush() is not called before close()

Others:
https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope 
well with topic deletions

For 0.10.1.*, KAFKA-4547 prevents us from upgrading to 0.10.1.*.

At last, IMO, "Reading topics from a timestamp" is pretty useful and is the 
most important reason that we should upgrade Kafka. However, since the Spark 
2.2 code freeze is coming, we won't get enough time to deliver this feature to 
the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka 
release. I don't think the next Kafka release will be later than Spark 2.3.


> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18057) Update structured streaming kafka from 10.0.1 to 10.2.0

2017-03-10 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905716#comment-15905716
 ] 

Shixiong Zhu commented on SPARK-18057:
--

I did some investigation yesterday, and found one issue in 0.10.2.0:
https://issues.apache.org/jira/browse/KAFKA-4879 : KafkaConsumer.position may 
hang forever when deleting a topic

Our current tests will just hang forever due to KAFKA-4879. This prevents us 
from upgrading 0.10.2.0.

I also went through the Kafka tickets between 0.10.0.1 and 0.10.2.0. Let me try 
to summary the current situation:

The benefits of upgrading Kafka client to 0.10.2.0:
- Forward compatibility
- Reading topics from a timestamp
- The following bug fixes:

Issues that we already have workarounds:
https://issues.apache.org/jira/browse/KAFKA-4375 : Kafka consumer may swallow 
some interrupts meant for the calling thread
https://issues.apache.org/jira/browse/KAFKA-4387 : KafkaConsumer will enter an 
infinite loop if the polling thread is interrupted, and either commitSync or 
committed is called
https://issues.apache.org/jira/browse/KAFKA-4536 : Kafka clients throw 
NullPointerException on poll when delete the relative topic

Issues related to Kafka record compression
https://issues.apache.org/jira/browse/KAFKA-3937 : Kafka Clients Leak Native 
Memory For Longer Than Needed With Compressed Messages
https://issues.apache.org/jira/browse/KAFKA-4549 : KafkaLZ4OutputStream does 
not write EndMark if flush() is not called before close()

Others:
https://issues.apache.org/jira/browse/KAFKA-2948 : Kafka producer does not cope 
well with topic deletions

For 0.10.1.*, KAFKA-4547 prevents us from upgrading to 0.10.1.*.

At last, IMO, "Reading topics from a timestamp" is pretty useful and is the 
most important reason that we should upgrade Kafka. However, since the Spark 
2.2 code freeze is coming, we won't get enough time to deliver this feature to 
the user, it's fine to just wait for them fixing KAFKA-4879 in the next Kafka 
release. I don't think the next Kafka release will be later than Spark 2.3.


> Update structured streaming kafka from 10.0.1 to 10.2.0
> ---
>
> Key: SPARK-18057
> URL: https://issues.apache.org/jira/browse/SPARK-18057
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Reporter: Cody Koeninger
>
> There are a couple of relevant KIPs here, 
> https://archive.apache.org/dist/kafka/0.10.1.0/RELEASE_NOTES.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19611) Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905705#comment-15905705
 ] 

Apache Spark commented on SPARK-19611:
--

User 'budde' has created a pull request for this issue:
https://github.com/apache/spark/pull/17249

> Spark 2.1.0 breaks some Hive tables backed by case-sensitive data files
> ---
>
> Key: SPARK-19611
> URL: https://issues.apache.org/jira/browse/SPARK-19611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Adam Budde
>Assignee: Adam Budde
> Fix For: 2.2.0
>
>
> This issue replaces 
> [SPARK-19455|https://issues.apache.org/jira/browse/SPARK-19455] and [PR 
> #16797|https://github.com/apache/spark/pull/16797]
> [SPARK-16980|https://issues.apache.org/jira/browse/SPARK-16980] removed the 
> schema inferrence from the HiveMetastoreCatalog class when converting a 
> MetastoreRelation to a LoigcalRelation (HadoopFsRelation, in this case) in 
> favor of simply using the schema returend by the metastore. This results in 
> an optimization as the underlying file status no longer need to be resolved 
> until after the partition pruning step, reducing the number of files to be 
> touched significantly in some cases. The downside is that the data schema 
> used may no longer match the underlying file schema for case-sensitive 
> formats such as Parquet.
> [SPARK-17183|https://issues.apache.org/jira/browse/SPARK-17183] added support 
> for saving a case-sensitive copy of the schema in the metastore table 
> properties, which HiveExternalCatalog will read in as the table's schema if 
> it is present. If it is not present, it will fall back to the 
> case-insensitive metastore schema.
> Unfortunately, this silently breaks queries over tables where the underlying 
> data fields are case-sensitive but a case-sensitive schema wasn't written to 
> the table properties by Spark. This situation will occur for any Hive table 
> that wasn't created by Spark or that was created prior to Spark 2.1.0. If a 
> user attempts to run a query over such a table containing a case-sensitive 
> field name in the query projection or in the query filter, the query will 
> return 0 results in every case.
> The change we are proposing is to bring back the schema inference that was 
> used prior to Spark 2.1.0 if a case-sensitive schema can't be read from the 
> table properties.
> - INFER_AND_SAVE: Infer a schema from the data files if no case-sensitive 
> schema can be read from the table properties. Attempt to save the inferred 
> schema in the table properties to avoid future inference.
> - INFER_ONLY: Infer the schema if no case-sensitive schema can be read but 
> don't attempt to save it.
> - NEVER_INFER: Fall back to using the case-insensitive schema returned by the 
> Hive Metatore. Useful if the user knows that none of the underlying data is 
> case-sensitive.
> See the discussion on [PR #16797|https://github.com/apache/spark/pull/16797] 
> for more discussion around this issue and the proposed solution.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19910) `stack` should not reject NULL values due to type mismatch

2017-03-10 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-19910:
-

 Summary: `stack` should not reject NULL values due to type mismatch
 Key: SPARK-19910
 URL: https://issues.apache.org/jira/browse/SPARK-19910
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.1.0, 2.0.1, 2.0.0
Reporter: Dongjoon Hyun


Since `stack` function generates a table with nullable columns, it should allow 
mixed null values.

{code}
scala> sql("select stack(3, 1, 2, 3)").printSchema
root
 |-- col0: integer (nullable = true)

scala> sql("select stack(3, 1, 2, null)").printSchema
org.apache.spark.sql.AnalysisException: cannot resolve 'stack(3, 1, 2, NULL)' 
due to data type mismatch: Argument 1 (IntegerType) != Argument 3 (NullType); 
line 1 pos 7;
'Project [unresolvedalias(stack(3, 1, 2, null), None)]
+- OneRowRelation$
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19888) Seeing offsets not resetting even when reset policy is configured explicitly

2017-03-10 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-19888:
-
Component/s: (was: Spark Core)
 DStreams

> Seeing offsets not resetting even when reset policy is configured explicitly
> 
>
> Key: SPARK-19888
> URL: https://issues.apache.org/jira/browse/SPARK-19888
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Justin Miller
>
> I was told to post this in a Spark ticket from KAFKA-4396:
> I've been seeing a curious error with kafka 0.10 (spark 2.11), these may be 
> two separate errors, I'm not sure. What's puzzling is that I'm setting 
> auto.offset.reset to latest and it's still throwing an 
> OffsetOutOfRangeException, behavior that's contrary to the code. Please help! 
> :)
> {code}
> val kafkaParams = Map[String, Object](
>   "group.id" -> consumerGroup,
>   "bootstrap.servers" -> bootstrapServers,
>   "key.deserializer" -> classOf[ByteArrayDeserializer],
>   "value.deserializer" -> classOf[MessageRowDeserializer],
>   "auto.offset.reset" -> "latest",
>   "enable.auto.commit" -> (false: java.lang.Boolean),
>   "max.poll.records" -> persisterConfig.maxPollRecords,
>   "request.timeout.ms" -> persisterConfig.requestTimeoutMs,
>   "session.timeout.ms" -> persisterConfig.sessionTimeoutMs,
>   "heartbeat.interval.ms" -> persisterConfig.heartbeatIntervalMs,
>   "connections.max.idle.ms"-> persisterConfig.connectionsMaxIdleMs
> )
> {code}
> {code}
> 16/11/09 23:10:17 INFO BlockManagerInfo: Added broadcast_154_piece0 in memory 
> on xyz (size: 146.3 KB, free: 8.4 GB)
> 16/11/09 23:10:23 WARN TaskSetManager: Lost task 15.0 in stage 151.0 (TID 
> 38837, xyz): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: 
> Offsets out of range with no configured reset policy for partitions: 
> {topic=231884473}
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.parseFetchedData(Fetcher.java:588)
> at 
> org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:354)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1000)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
> at 
> org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:227)
> at 
> org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:193)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:438)
> at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
> at org.apache.spark.scheduler.Task.run(Task.scala:85)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 16/11/09 23:10:29 INFO TaskSetManager: Finished task 10.0 in stage 154.0 (TID 
> 39388) in 12043 ms on xyz (1/16)
> 16/11/09 23:10:31 INFO TaskSetManager: Finished task 0.0 in stage 154.0 (TID 
> 39375) in 13444 ms on xyz (2/16)
> 16/11/09 23:10:44 WARN TaskSetManager: Lost task 1.0 in stage 151.0 (TID 
> 38843, xyz): java.util.ConcurrentModificationException: KafkaConsumer is not 
> safe for multi-threaded access
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1431)
> at 
> org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:929)
>  

[jira] [Commented] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905641#comment-15905641
 ] 

Apache Spark commented on SPARK-19909:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/17248

> Batches will fail in case that temporary checkpoint dir is on local file 
> system while metadata dir is on HDFS
> -
>
> Key: SPARK-19909
> URL: https://issues.apache.org/jira/browse/SPARK-19909
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When we try to run Structured Streaming in local mode but use HDFS for the 
> storage, batches will be fail because of error like as follows.
> {code}
> val handle = stream.writeStream.format("console").start()
> 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
> StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
> /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kou, access=WRITE, 
> inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
> {code}
> It's because that a temporary checkpoint directory is created on local file 
> system but metadata whose path is based on the checkpoint directory will be 
> created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19909:


Assignee: (was: Apache Spark)

> Batches will fail in case that temporary checkpoint dir is on local file 
> system while metadata dir is on HDFS
> -
>
> Key: SPARK-19909
> URL: https://issues.apache.org/jira/browse/SPARK-19909
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> When we try to run Structured Streaming in local mode but use HDFS for the 
> storage, batches will be fail because of error like as follows.
> {code}
> val handle = stream.writeStream.format("console").start()
> 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
> StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
> /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kou, access=WRITE, 
> inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
> {code}
> It's because that a temporary checkpoint directory is created on local file 
> system but metadata whose path is based on the checkpoint directory will be 
> created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19909:


Assignee: Apache Spark

> Batches will fail in case that temporary checkpoint dir is on local file 
> system while metadata dir is on HDFS
> -
>
> Key: SPARK-19909
> URL: https://issues.apache.org/jira/browse/SPARK-19909
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> When we try to run Structured Streaming in local mode but use HDFS for the 
> storage, batches will be fail because of error like as follows.
> {code}
> val handle = stream.writeStream.format("console").start()
> 17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
> StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
> /private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=kou, access=WRITE, 
> inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
> {code}
> It's because that a temporary checkpoint directory is created on local file 
> system but metadata whose path is based on the checkpoint directory will be 
> created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19909) Batches will fail in case that temporary checkpoint dir is on local file system while metadata dir is on HDFS

2017-03-10 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-19909:
--

 Summary: Batches will fail in case that temporary checkpoint dir 
is on local file system while metadata dir is on HDFS
 Key: SPARK-19909
 URL: https://issues.apache.org/jira/browse/SPARK-19909
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Kousuke Saruta
Priority: Minor


When we try to run Structured Streaming in local mode but use HDFS for the 
storage, batches will be fail because of error like as follows.

{code}
val handle = stream.writeStream.format("console").start()
17/03/09 16:54:45 ERROR StreamMetadata: Error writing stream metadata 
StreamMetadata(fc07a0b1-5423-483e-a59d-b2206a49491e) to 
/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata
org.apache.hadoop.security.AccessControlException: Permission denied: user=kou, 
access=WRITE, 
inode="/private/var/folders/4y/tmspvv353y59p3w4lknrf7ccgn/T/temporary-79d4fe05-4301-4b6d-a902-dff642d0ddca/metadata":hdfs:supergroup:drwxr-xr-x
{code}

It's because that a temporary checkpoint directory is created on local file 
system but metadata whose path is based on the checkpoint directory will be 
created on HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19905:


Assignee: Cheng Lian  (was: Apache Spark)

> Dataset.inputFiles is broken for Hive SerDe tables
> --
>
> Key: SPARK-19905
> URL: https://issues.apache.org/jira/browse/SPARK-19905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t")
> spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t")
> spark.table("u").inputFiles.foreach(println)
> {code}
> In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like
> {noformat}
> file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u
> {noformat}
> on my laptop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19905:


Assignee: Apache Spark  (was: Cheng Lian)

> Dataset.inputFiles is broken for Hive SerDe tables
> --
>
> Key: SPARK-19905
> URL: https://issues.apache.org/jira/browse/SPARK-19905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> The following snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t")
> spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t")
> spark.table("u").inputFiles.foreach(println)
> {code}
> In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like
> {noformat}
> file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u
> {noformat}
> on my laptop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905608#comment-15905608
 ] 

Apache Spark commented on SPARK-19905:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/17247

> Dataset.inputFiles is broken for Hive SerDe tables
> --
>
> Key: SPARK-19905
> URL: https://issues.apache.org/jira/browse/SPARK-19905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> The following snippet reproduces this issue:
> {code}
> spark.range(10).createOrReplaceTempView("t")
> spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t")
> spark.table("u").inputFiles.foreach(println)
> {code}
> In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like
> {noformat}
> file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u
> {noformat}
> on my laptop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19863) Whether or not use CachedKafkaConsumer need to be configured, when you use DirectKafkaInputDStream to connect the kafka in a Spark Streaming application

2017-03-10 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905603#comment-15905603
 ] 

Cody Koeninger commented on SPARK-19863:


Isn't this basically a duplicate of
SPARK-19185
with the same implications?

> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming application
> 
>
> Key: SPARK-19863
> URL: https://issues.apache.org/jira/browse/SPARK-19863
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams, Input/Output
>Affects Versions: 2.1.0
>Reporter: LvDongrong
>
> Whether or not use CachedKafkaConsumer need to be configured, when you use 
> DirectKafkaInputDStream to connect the kafka in a Spark Streaming 
> application. In Spark 2.x, the kafka consumer was replaced by 
> CachedKafkaConsumer (some KafkaConsumer will keep establishing the kafka 
> cluster), and cannot change the way. In fact ,The KafkaRDD(used by 
> DirectKafkaInputDStream to connect kafka) provide the parameter 
> useConsumerCache to choose Whether to use the CachedKafkaConsumer, but the 
> DirectKafkaInputDStream set the parameter true.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19908) Direct buffer memory OOM should not cause stage retries.

2017-03-10 Thread Zhan Zhang (JIRA)
Zhan Zhang created SPARK-19908:
--

 Summary: Direct buffer memory OOM should not cause stage retries.
 Key: SPARK-19908
 URL: https://issues.apache.org/jira/browse/SPARK-19908
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.1.0
Reporter: Zhan Zhang
Priority: Minor


Currently if there is  java.lang.OutOfMemoryError: Direct buffer memory, the 
exception will be changed to FetchFailedException, causing stage retries.

org.apache.spark.shuffle.FetchFailedException: Direct buffer memory
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:357)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:332)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:54)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:731)
at 
org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextOuterJoinRows(SortMergeJoinExec.scala:692)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceStream(SortMergeJoinExec.scala:854)
at 
org.apache.spark.sql.execution.joins.OneSideOuterIterator.advanceNext(SortMergeJoinExec.scala:887)
at 
org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:161)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:278)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:658)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:645)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:228)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:212)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:132)
at 
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:271)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:155)
at 
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:146)
at 
io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:107)
at 
io.netty.channel.AdaptiveRecvByteBufAllocator$HandleImpl.allocate(AdaptiveRecvByteBufAllocator.java:104)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:117)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 

[jira] [Updated] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website

2017-03-10 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger updated SPARK-19904:
---
Description: 
see
http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html

> SPIP Add Spark Project Improvement Proposal doc to website
> --
>
> Key: SPARK-19904
> URL: https://issues.apache.org/jira/browse/SPARK-19904
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Cody Koeninger
>  Labels: SPIP
>
> see
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-Improvement-Proposals-td19268.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-19907) Spark Submit Does not pick up the HBase Jars

2017-03-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-19907.
-

> Spark Submit Does not pick up the HBase Jars
> 
>
> Key: SPARK-19907
> URL: https://issues.apache.org/jira/browse/SPARK-19907
> Project: Spark
>  Issue Type: Task
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.0.0
> Environment: Linux, cloudera-jdk-1.7
>Reporter: Ramchandhar Rapolu
>Priority: Blocker
>
> Using properties file: 
> /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf
> Adding default property: 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> Adding default property: 
> spark.yarn.jars=local:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/jars/*:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/hbase/jars/*
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.hadoop.mapreduce.application.classpath=
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
> Adding default property: 
> spark.yarn.historyServer.address=http://instance-26765.bigstep.io:18089
> Adding default property: spark.ui.killEnabled=true
> Adding default property: 
> spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/../hive/lib/*:${env:HADOOP_COMMON_HOME}/client/*
> Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1
> Adding default property: 
> spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
> Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels
> Adding default property: 
> spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
> Adding default property: spark.submit.deployMode=client
> Adding default property: spark.shuffle.service.port=7337
> Adding default property: spark.master=yarn
> Adding default property: spark.authenticate=false
> Adding default property: 
> spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
> Adding default property: 
> spark.eventLog.dir=hdfs://instance-26765.bigstep.io:8020/user/spark/spark2ApplicationHistory
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.sql.catalogImplementation=hive
> Adding default property: spark.hadoop.yarn.application.classpath=
> Adding default property: spark.dynamicAllocation.minExecutors=0
> Adding default property: spark.dynamicAllocation.executorIdleTimeout=60
> Adding default property: spark.sql.hive.metastore.version=1.1.0
> Parsed arguments:
>   master  yarn
>   deployMode  client
>   executorMemory  2g
>   executorCores   1
>   totalExecutorCores  null
>   propertiesFile  
> /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf
>   driverMemory4g
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  
> /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   com.golfbreaks.spark.streaming.Test
>   primaryResource 
> file:/opt/golfbreaks/spark-jars/streaming-1.0-jar-with-dependencies.jar
>   namecom.golfbreaks.spark.streaming.Test
>   childArgs   []
>   jars
> 

[jira] [Resolved] (SPARK-19907) Spark Submit Does not pick up the HBase Jars

2017-03-10 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-19907.
---
  Resolution: Invalid
Target Version/s:   (was: 2.0.0)

A huge dump of your config and logs certainly isn't a suitable JIRA. Spark 
doesn't include HBase jars, no. This is invalid for several reasons. Please 
read http://spark.apache.org/contributing.html

> Spark Submit Does not pick up the HBase Jars
> 
>
> Key: SPARK-19907
> URL: https://issues.apache.org/jira/browse/SPARK-19907
> Project: Spark
>  Issue Type: Task
>  Components: DStreams, Spark Submit, YARN
>Affects Versions: 2.0.0
> Environment: Linux, cloudera-jdk-1.7
>Reporter: Ramchandhar Rapolu
>Priority: Blocker
>
> Using properties file: 
> /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf
> Adding default property: 
> spark.serializer=org.apache.spark.serializer.KryoSerializer
> Adding default property: 
> spark.yarn.jars=local:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/jars/*:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/hbase/jars/*
> Adding default property: spark.eventLog.enabled=true
> Adding default property: spark.hadoop.mapreduce.application.classpath=
> Adding default property: spark.shuffle.service.enabled=true
> Adding default property: 
> spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
> Adding default property: 
> spark.yarn.historyServer.address=http://instance-26765.bigstep.io:18089
> Adding default property: spark.ui.killEnabled=true
> Adding default property: 
> spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/../hive/lib/*:${env:HADOOP_COMMON_HOME}/client/*
> Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1
> Adding default property: 
> spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
> Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels
> Adding default property: 
> spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
> Adding default property: spark.submit.deployMode=client
> Adding default property: spark.shuffle.service.port=7337
> Adding default property: spark.master=yarn
> Adding default property: spark.authenticate=false
> Adding default property: 
> spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
> Adding default property: 
> spark.eventLog.dir=hdfs://instance-26765.bigstep.io:8020/user/spark/spark2ApplicationHistory
> Adding default property: spark.dynamicAllocation.enabled=true
> Adding default property: spark.sql.catalogImplementation=hive
> Adding default property: spark.hadoop.yarn.application.classpath=
> Adding default property: spark.dynamicAllocation.minExecutors=0
> Adding default property: spark.dynamicAllocation.executorIdleTimeout=60
> Adding default property: spark.sql.hive.metastore.version=1.1.0
> Parsed arguments:
>   master  yarn
>   deployMode  client
>   executorMemory  2g
>   executorCores   1
>   totalExecutorCores  null
>   propertiesFile  
> /opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf
>   driverMemory4g
>   driverCores null
>   driverExtraClassPathnull
>   driverExtraLibraryPath  
> /opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
>   driverExtraJavaOptions  null
>   supervise   false
>   queue   null
>   numExecutorsnull
>   files   null
>   pyFiles null
>   archivesnull
>   mainClass   com.golfbreaks.spark.streaming.Test
>   primaryResource 
> file:/opt/golfbreaks/spark-jars/streaming-1.0-jar-with-dependencies.jar
>   namecom.golfbreaks.spark.streaming.Test
>   childArgs   []
>   jars
> 

[jira] [Assigned] (SPARK-19906) Add Documentation for Kafka Write paths

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19906:


Assignee: Apache Spark

> Add Documentation for Kafka Write paths
> ---
>
> Key: SPARK-19906
> URL: https://issues.apache.org/jira/browse/SPARK-19906
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>Assignee: Apache Spark
>
> We need documentation that describes how to write streaming and batch queries 
> to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19907) Spark Submit Does not pick up the HBase Jars

2017-03-10 Thread Ramchandhar Rapolu (JIRA)
Ramchandhar Rapolu created SPARK-19907:
--

 Summary: Spark Submit Does not pick up the HBase Jars
 Key: SPARK-19907
 URL: https://issues.apache.org/jira/browse/SPARK-19907
 Project: Spark
  Issue Type: Task
  Components: DStreams, Spark Submit, YARN
Affects Versions: 2.0.0
 Environment: Linux, cloudera-jdk-1.7
Reporter: Ramchandhar Rapolu
Priority: Blocker


Using properties file: 
/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf
Adding default property: 
spark.serializer=org.apache.spark.serializer.KryoSerializer
Adding default property: 
spark.yarn.jars=local:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/jars/*:/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/hbase/jars/*
Adding default property: spark.eventLog.enabled=true
Adding default property: spark.hadoop.mapreduce.application.classpath=
Adding default property: spark.shuffle.service.enabled=true
Adding default property: 
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
Adding default property: 
spark.yarn.historyServer.address=http://instance-26765.bigstep.io:18089
Adding default property: spark.ui.killEnabled=true
Adding default property: 
spark.sql.hive.metastore.jars=${env:HADOOP_COMMON_HOME}/../hive/lib/*:${env:HADOOP_COMMON_HOME}/client/*
Adding default property: spark.dynamicAllocation.schedulerBacklogTimeout=1
Adding default property: 
spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
Adding default property: spark.yarn.config.gatewayPath=/opt/cloudera/parcels
Adding default property: 
spark.yarn.config.replacementPath={{HADOOP_COMMON_HOME}}/../../..
Adding default property: spark.submit.deployMode=client
Adding default property: spark.shuffle.service.port=7337
Adding default property: spark.master=yarn
Adding default property: spark.authenticate=false
Adding default property: 
spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
Adding default property: 
spark.eventLog.dir=hdfs://instance-26765.bigstep.io:8020/user/spark/spark2ApplicationHistory
Adding default property: spark.dynamicAllocation.enabled=true
Adding default property: spark.sql.catalogImplementation=hive
Adding default property: spark.hadoop.yarn.application.classpath=
Adding default property: spark.dynamicAllocation.minExecutors=0
Adding default property: spark.dynamicAllocation.executorIdleTimeout=60
Adding default property: spark.sql.hive.metastore.version=1.1.0
Parsed arguments:
  master  yarn
  deployMode  client
  executorMemory  2g
  executorCores   1
  totalExecutorCores  null
  propertiesFile  
/opt/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/conf/spark-defaults.conf
  driverMemory4g
  driverCores null
  driverExtraClassPathnull
  driverExtraLibraryPath  
/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib/hadoop/lib/native
  driverExtraJavaOptions  null
  supervise   false
  queue   null
  numExecutorsnull
  files   null
  pyFiles null
  archivesnull
  mainClass   com.golfbreaks.spark.streaming.Test
  primaryResource 
file:/opt/golfbreaks/spark-jars/streaming-1.0-jar-with-dependencies.jar
  namecom.golfbreaks.spark.streaming.Test
  childArgs   []
  jars

[jira] [Assigned] (SPARK-19906) Add Documentation for Kafka Write paths

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19906:


Assignee: (was: Apache Spark)

> Add Documentation for Kafka Write paths
> ---
>
> Key: SPARK-19906
> URL: https://issues.apache.org/jira/browse/SPARK-19906
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>
> We need documentation that describes how to write streaming and batch queries 
> to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19906) Add Documentation for Kafka Write paths

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905580#comment-15905580
 ] 

Apache Spark commented on SPARK-19906:
--

User 'tcondie' has created a pull request for this issue:
https://github.com/apache/spark/pull/17246

> Add Documentation for Kafka Write paths
> ---
>
> Key: SPARK-19906
> URL: https://issues.apache.org/jira/browse/SPARK-19906
> Project: Spark
>  Issue Type: Documentation
>  Components: Structured Streaming
>Affects Versions: 2.1.0
>Reporter: Tyson Condie
>
> We need documentation that describes how to write streaming and batch queries 
> to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19906) Add Documentation for Kafka Write paths

2017-03-10 Thread Tyson Condie (JIRA)
Tyson Condie created SPARK-19906:


 Summary: Add Documentation for Kafka Write paths
 Key: SPARK-19906
 URL: https://issues.apache.org/jira/browse/SPARK-19906
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Tyson Condie


We need documentation that describes how to write streaming and batch queries 
to Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19620) Incorrect exchange coordinator Id in physical plan

2017-03-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-19620:


Assignee: Carson Wang

> Incorrect exchange coordinator Id in physical plan
> --
>
> Key: SPARK-19620
> URL: https://issues.apache.org/jira/browse/SPARK-19620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Assignee: Carson Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> When adaptive execution is enabled, an exchange coordinator is used to in the 
> Exchange operators. For Join, the same exchange coordinator is used for its 
> two Exchanges. But the physical plan shows two different coordinator Ids 
> which is confusing.
> Here is an example:
> {code}
> == Physical Plan ==
> *Project [key1#3L, value2#12L]
> +- *SortMergeJoin [key1#3L], [key2#11L], Inner
>:- *Sort [key1#3L ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>: +- *Project [(id#0L % 500) AS key1#3L]
>:+- *Filter isnotnull((id#0L % 500))
>:   +- *Range (0, 1000, step=1, splits=Some(10))
>+- *Sort [key2#11L ASC NULLS FIRST], false, 0
>   +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L]
> +- *Filter isnotnull((id#8L % 500))
>+- *Range (0, 1000, step=1, splits=Some(10))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19620) Incorrect exchange coordinator Id in physical plan

2017-03-10 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-19620.
--
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 16952
[https://github.com/apache/spark/pull/16952]

> Incorrect exchange coordinator Id in physical plan
> --
>
> Key: SPARK-19620
> URL: https://issues.apache.org/jira/browse/SPARK-19620
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Carson Wang
>Priority: Minor
> Fix For: 2.2.0
>
>
> When adaptive execution is enabled, an exchange coordinator is used to in the 
> Exchange operators. For Join, the same exchange coordinator is used for its 
> two Exchanges. But the physical plan shows two different coordinator Ids 
> which is confusing.
> Here is an example:
> {code}
> == Physical Plan ==
> *Project [key1#3L, value2#12L]
> +- *SortMergeJoin [key1#3L], [key2#11L], Inner
>:- *Sort [key1#3L ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 1804587700) hashpartitioning(key1#3L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>: +- *Project [(id#0L % 500) AS key1#3L]
>:+- *Filter isnotnull((id#0L % 500))
>:   +- *Range (0, 1000, step=1, splits=Some(10))
>+- *Sort [key2#11L ASC NULLS FIRST], false, 0
>   +- Exchange(coordinator id: 793927319) hashpartitioning(key2#11L, 10), 
> coordinator[target post-shuffle partition size: 67108864]
>  +- *Project [(id#8L % 500) AS key2#11L, id#8L AS value2#12L]
> +- *Filter isnotnull((id#8L % 500))
>+- *Range (0, 1000, step=1, splits=Some(10))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19885.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

> The config ignoreCorruptFiles doesn't work for CSV
> --
>
> Key: SPARK-19885
> URL: https://issues.apache.org/jira/browse/SPARK-19885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
> Fix For: 2.2.0
>
>
> CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
> "ignoreCorruptFiles" doesn't work.
> {code}
> java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1981)
>   at 

[jira] [Created] (SPARK-19905) Dataset.inputFiles is broken for Hive SerDe tables

2017-03-10 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-19905:
--

 Summary: Dataset.inputFiles is broken for Hive SerDe tables
 Key: SPARK-19905
 URL: https://issues.apache.org/jira/browse/SPARK-19905
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Cheng Lian
Assignee: Cheng Lian


The following snippet reproduces this issue:
{code}
spark.range(10).createOrReplaceTempView("t")
spark.sql("CREATE TABLE u STORED AS RCFILE AS SELECT * FROM t")
spark.table("u").inputFiles.foreach(println)
{code}
In Spark 2.2, it prints nothing, while in Spark 2.1, it prints something like
{noformat}
file:/Users/lian/local/var/lib/hive/warehouse_1.2.1/u
{noformat}
on my laptop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19885) The config ignoreCorruptFiles doesn't work for CSV

2017-03-10 Thread Wenchen Fan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905564#comment-15905564
 ] 

Wenchen Fan commented on SPARK-19885:
-

Oh, so this issue is already fixed by SPARK-18362 in Spark 2.2

Since it's not a critical issue and SPARK-18362 is an optimization, we should 
not backport it. Let's just resolve this ticket.

> The config ignoreCorruptFiles doesn't work for CSV
> --
>
> Key: SPARK-19885
> URL: https://issues.apache.org/jira/browse/SPARK-19885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Shixiong Zhu
> Fix For: 2.2.0
>
>
> CSVFileFormat.inferSchema doesn't use FileScanRDD so the SQL 
> "ignoreCorruptFiles" doesn't work.
> {code}
> java.io.EOFException: Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:145)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
>   at java.io.InputStream.read(InputStream.java:101)
>   at org.apache.hadoop.util.LineReader.fillBuffer(LineReader.java:180)
>   at 
> org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216)
>   at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:248)
>   at 
> org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:48)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:266)
>   at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:211)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
>   at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
>   at 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:214)
>   at scala.collection.AbstractIterator.aggregate(Iterator.scala:1336)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1112)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at 
> org.apache.spark.SparkContext$$anonfun$33.apply(SparkContext.scala:1980)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> 

[jira] [Updated] (SPARK-19893) should not run DataFrame set oprations with map type

2017-03-10 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-19893:

Summary: should not run DataFrame set oprations with map type  (was: Cannot 
run intersect/except/distinct with map type)

> should not run DataFrame set oprations with map type
> 
>
> Key: SPARK-19893
> URL: https://issues.apache.org/jira/browse/SPARK-19893
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14453) Remove SPARK_JAVA_OPTS environment variable

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905518#comment-15905518
 ] 

Apache Spark commented on SPARK-14453:
--

User 'yongtang' has created a pull request for this issue:
https://github.com/apache/spark/pull/17212

> Remove SPARK_JAVA_OPTS environment variable
> ---
>
> Key: SPARK-14453
> URL: https://issues.apache.org/jira/browse/SPARK-14453
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core, YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> SPARK_JAVA_OPTS was deprecated since 1.0, with the release of major version 
> (2.0), I think it would be better to remove the support of this env variable.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Description: 
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.

  was:
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.


> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{\_\_HIVE_DEFAULT_PARTITION\_\_}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, 

[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905492#comment-15905492
 ] 

yuhao yang commented on SPARK-19899:


also cc [~podongfeng] since I recalled he mentioned to use SetInputCol in the 
original PR. 

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905484#comment-15905484
 ] 

yuhao yang commented on SPARK-19899:


Thanks for the reply. We can wait for some time to see if people like the idea.

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19887) __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in partitioned persisted tables

2017-03-10 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-19887:
---
Description: 
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned tables use the magic string 
{{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
partition directory names. However, in the case persisted partitioned table, 
this magic string is not interpreted as {{NULL}} but a regular string.

  was:
The following Spark shell snippet under Spark 2.1 reproduces this issue:

{code}
val data = Seq(
  ("p1", 1, 1),
  ("p2", 2, 2),
  (null, 3, 3)
)

// Correct case: Saving partitioned data to file system.

val path = "/tmp/partitioned"

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  parquet(path)

spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
// +---+---+---+
// |c  |a  |b  |
// +---+---+---+
// |2  |p2 |2  |
// |1  |p1 |1  |
// +---+---+---+

// Incorrect case: Saving partitioned data as persisted table.

data.
  toDF("a", "b", "c").
  write.
  mode("overwrite").
  partitionBy("a", "b").
  saveAsTable("test_null")

spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
// +---+--+---+
// |c  |a |b  |
// +---+--+---+
// |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
// |1  |p1|1  |
// |2  |p2|2  |
// +---+--+---+
{code}

Hive-style partitioned table uses magic string {{"__HIVE_DEFAULT_PARTITION__"}} 
to indicate {{NULL}} partition values in partition directory names. However, in 
the case persisted partitioned table, this magic string is not interpreted as 
{{NULL}} but a regular string.



> __HIVE_DEFAULT_PARTITION__ is not interpreted as NULL partition value in 
> partitioned persisted tables
> -
>
> Key: SPARK-19887
> URL: https://issues.apache.org/jira/browse/SPARK-19887
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Cheng Lian
>
> The following Spark shell snippet under Spark 2.1 reproduces this issue:
> {code}
> val data = Seq(
>   ("p1", 1, 1),
>   ("p2", 2, 2),
>   (null, 3, 3)
> )
> // Correct case: Saving partitioned data to file system.
> val path = "/tmp/partitioned"
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   parquet(path)
> spark.read.parquet(path).filter($"a".isNotNull).show(truncate = false)
> // +---+---+---+
> // |c  |a  |b  |
> // +---+---+---+
> // |2  |p2 |2  |
> // |1  |p1 |1  |
> // +---+---+---+
> // Incorrect case: Saving partitioned data as persisted table.
> data.
>   toDF("a", "b", "c").
>   write.
>   mode("overwrite").
>   partitionBy("a", "b").
>   saveAsTable("test_null")
> spark.table("test_null").filter($"a".isNotNull).show(truncate = false)
> // +---+--+---+
> // |c  |a |b  |
> // +---+--+---+
> // |3  |__HIVE_DEFAULT_PARTITION__|3  | <-- This line should not be here
> // |1  |p1|1  |
> // |2  |p2|2  |
> // +---+--+---+
> {code}
> Hive-style partitioned tables use the magic string 
> {{__HIVE_DEFAULT_PARTITION__}} to indicate {{NULL}} partition values in 
> partition directory names. However, in the case persisted partitioned table, 
> this magic string is not interpreted as {{NULL}} but a regular string.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: 

[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-10 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905480#comment-15905480
 ] 

Maciej Szymkiewicz commented on SPARK-19899:


This is just an idea, but I would start with:

- {{featuresCol}} - {{VectorUDT}}
- {{transactionsCol}} - {{Array<\_>}} - for frequent (unordered) pattern mining.
- {{sequencesCol}}  -  {{Array>}} - for sequential pattern ming.



> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-10 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905451#comment-15905451
 ] 

yuhao yang commented on SPARK-19899:


{quote}
 if we mix-in HasFeaturesCol the featuresCol should be VectorUDT.
{quote}

Guess I misunderstood and thought you want to support vector for FPGrowth. 
Using SparseVector to represent records is not unreasonable for me and 
supporting that is easy and straightforward. But surely we don't need to 
support that until there's a clear requirement. May I know how do you want to 
name the new trait for array>, as users will need to invoke 
setCol("...") during fitting. Then we can see if it's more intuitive.

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19904) SPIP Add Spark Project Improvement Proposal doc to website

2017-03-10 Thread Cody Koeninger (JIRA)
Cody Koeninger created SPARK-19904:
--

 Summary: SPIP Add Spark Project Improvement Proposal doc to website
 Key: SPARK-19904
 URL: https://issues.apache.org/jira/browse/SPARK-19904
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.1.0
Reporter: Cody Koeninger






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19786) Facilitate loop optimizations in a JIT compiler regarding range()

2017-03-10 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-19786.
---
   Resolution: Fixed
 Assignee: Kazuaki Ishizaki
Fix Version/s: 2.2.0

> Facilitate loop optimizations in a JIT compiler regarding range()
> -
>
> Key: SPARK-19786
> URL: https://issues.apache.org/jira/browse/SPARK-19786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Kazuaki Ishizaki
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0
>
>
> [This 
> article|https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html]
>  suggests that better generated code can improve performance by facilitating 
> compiler optimizations.
> This JIRA changes the generated code for {{range()}} to facilitate loop 
> optimizations in a JIT compiler for achieving better performance.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19850) Support aliased expressions in function parameters

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19850:


Assignee: Herman van Hovell  (was: Apache Spark)

> Support aliased expressions in function parameters
> --
>
> Key: SPARK-19850
> URL: https://issues.apache.org/jira/browse/SPARK-19850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The SQL parser currently does not allow a user to pass an aliased expression 
> as function parameter. This can be useful if we want to create a struct. For 
> example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a 
> struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19899) FPGrowth input column naming

2017-03-10 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905405#comment-15905405
 ] 

Maciej Szymkiewicz commented on SPARK-19899:


In my opinion a trait for each input category ({{Vector}}, {{array<\_>}}, 
{{array>}}) is the way to go. Development overhead is low (these 
things are small and easy to test), it is unlikely we'll need much more any 
time soon, any this gives us some way to communicate expected  input.

I am strongly against using {{Vector}}  - it is counterintuitive, requires a 
lot of additional effort and without any supported way of mapping from vector 
to features (I don't count {{Column}} metadata) it will significantly degrade 
user experience. Moreover it won't be useful for {{PrefixSpan}} at all. I 
believe that we should acknowledge that pattern mining techniques are 
significantly different from the common {{ml}} algorithms and don't hesitate to 
reflect that in the API. 

> FPGrowth input column naming
> 
>
> Key: SPARK-19899
> URL: https://issues.apache.org/jira/browse/SPARK-19899
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Maciej Szymkiewicz
>
> Current implementation extends {{HasFeaturesCol}}. Personally I find it 
> rather unfortunate. Up to this moment we used consistent conventions - if we 
> mix-in  {{HasFeaturesCol}} the {{featuresCol}} should be {{VectorUDT}}. 
> Using the same {{Param}} for an {{array}} (and possibly for 
> {{array}} once {{PrefixSpan}} is ported to {{ml}}) will be 
> confusing for the users.
> I would like to suggest adding new {{trait}} (let's say 
> {{HasTransactionsCol}}) to clearly indicate that the input type differs for 
> the other {{Estiamtors}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19850) Support aliased expressions in function parameters

2017-03-10 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-19850:


Assignee: Apache Spark  (was: Herman van Hovell)

> Support aliased expressions in function parameters
> --
>
> Key: SPARK-19850
> URL: https://issues.apache.org/jira/browse/SPARK-19850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> The SQL parser currently does not allow a user to pass an aliased expression 
> as function parameter. This can be useful if we want to create a struct. For 
> example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a 
> struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19850) Support aliased expressions in function parameters

2017-03-10 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905404#comment-15905404
 ] 

Apache Spark commented on SPARK-19850:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/17245

> Support aliased expressions in function parameters
> --
>
> Key: SPARK-19850
> URL: https://issues.apache.org/jira/browse/SPARK-19850
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The SQL parser currently does not allow a user to pass an aliased expression 
> as function parameter. This can be useful if we want to create a struct. For 
> example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a 
> struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >