[jira] [Created] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-27 Thread zenglinxi (JIRA)
zenglinxi created SPARK-21223:
-

 Summary: Thread-safety issue in FsHistoryProvider 
 Key: SPARK-21223
 URL: https://issues.apache.org/jira/browse/SPARK-21223
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.1
Reporter: zenglinxi


Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
FsHistoryProvider to store the map of eventlog path and attemptInfo. 
When use ThreadPool to Replay the log files in the list and merge the list of 
old applications with new ones, multi thread may update fileToAppInfo at the 
same time, which may cause Thread-safety issues.
{code:java}
for (file <- logInfos) {
   tasks += replayExecutor.submit(new Runnable {
override def run(): Unit = mergeApplicationListing(file)
 })
 }
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21223:


Assignee: Apache Spark

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>Assignee: Apache Spark
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21223:


Assignee: (was: Apache Spark)

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064406#comment-16064406
 ] 

Apache Spark commented on SPARK-21223:
--

User 'zenglinxi0615' has created a pull request for this issue:
https://github.com/apache/spark/pull/18430

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21224) Support a DDL-formatted string in DataFrameReader.schema in R

2017-06-27 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-21224:


 Summary: Support a DDL-formatted string in DataFrameReader.schema 
in R
 Key: SPARK-21224
 URL: https://issues.apache.org/jira/browse/SPARK-21224
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Hyukjin Kwon
Priority: Minor


This might have to be a followup for SPARK-20431 but I just decided to make 
this separate for R specifically as it might be confused.

Please refer the discussion in the PR and SPARK-20431.

In a simple view, this JIRA describes the support for a DDL-formetted string as 
schema as below:

{code}
mockLines <- c("{\"name\":\"Michael\"}",
   "{\"name\":\"Andy\", \"age\":30}",
   "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)

df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-27 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21224:
-
Summary: Support a DDL-formatted string as schema in reading for R  (was: 
Support a DDL-formatted string in DataFrameReader.schema in R)

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as it might be confused.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21208:


Assignee: (was: Apache Spark)

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-27 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21224:
-
Description: 
This might have to be a followup for SPARK-20431 but I just decided to make 
this separate for R specifically as it might be confusing.

Please refer the discussion in the PR and SPARK-20431.

In a simple view, this JIRA describes the support for a DDL-formetted string as 
schema as below:

{code}
mockLines <- c("{\"name\":\"Michael\"}",
   "{\"name\":\"Andy\", \"age\":30}",
   "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)

df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
{code}

  was:
This might have to be a followup for SPARK-20431 but I just decided to make 
this separate for R specifically as it might be confused.

Please refer the discussion in the PR and SPARK-20431.

In a simple view, this JIRA describes the support for a DDL-formetted string as 
schema as below:

{code}
mockLines <- c("{\"name\":\"Michael\"}",
   "{\"name\":\"Andy\", \"age\":30}",
   "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)

df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
{code}


> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as it might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21222) Move elimination of Distinct clause from analyzer to optimizer

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064441#comment-16064441
 ] 

Sean Owen commented on SPARK-21222:
---

Do you mean this is the same as "SELECT MAX(a) FROM ..." ? Your two examples 
are identical.

> Move elimination of Distinct clause from analyzer to optimizer
> --
>
> Key: SPARK-21222
> URL: https://issues.apache.org/jira/browse/SPARK-21222
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
>
> Distinct clause is after MAX/MIN clause 
> "Select MAX(distinct a) FROM src from"
>  is equivalent of 
> "Select MAX(distinct a) FROM src from"
> However, this optimization is implemented in analyzer. It should be in 
> optimizer.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21208:


Assignee: Apache Spark

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>Assignee: Apache Spark
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064438#comment-16064438
 ] 

Apache Spark commented on SPARK-21208:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18431

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21208) Ability to "setLocalProperty" from sc, in sparkR

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064448#comment-16064448
 ] 

Hyukjin Kwon commented on SPARK-21208:
--

Ooops, I linked wrong JIRA. I renamed it back and removed the link.

> Ability to "setLocalProperty" from sc, in sparkR
> 
>
> Key: SPARK-21208
> URL: https://issues.apache.org/jira/browse/SPARK-21208
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.1
>Reporter: Karuppayya
>
> Checked the API 
> [documentation|https://spark.apache.org/docs/latest/api/R/index.html] for 
> sparkR.
> Was not able to find a way to *setLocalProperty* on sc.
> Need ability to *setLocalProperty* on sparkContext(similar to available for 
> pyspark, scala)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064446#comment-16064446
 ] 

Apache Spark commented on SPARK-21224:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/18431

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-27 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-21224:
-
Description: 
This might have to be a followup for SPARK-20431 but I just decided to make 
this separate for R specifically as many PRs might be confusing.

Please refer the discussion in the PR and SPARK-20431.

In a simple view, this JIRA describes the support for a DDL-formetted string as 
schema as below:

{code}
mockLines <- c("{\"name\":\"Michael\"}",
   "{\"name\":\"Andy\", \"age\":30}",
   "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)

df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
{code}

  was:
This might have to be a followup for SPARK-20431 but I just decided to make 
this separate for R specifically as it might be confusing.

Please refer the discussion in the PR and SPARK-20431.

In a simple view, this JIRA describes the support for a DDL-formetted string as 
schema as below:

{code}
mockLines <- c("{\"name\":\"Michael\"}",
   "{\"name\":\"Andy\", \"age\":30}",
   "{\"name\":\"Justin\", \"age\":19}")
jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(mockLines, jsonPath)

df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
collect(df)
{code}


> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21224:


Assignee: Apache Spark

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21224) Support a DDL-formatted string as schema in reading for R

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21224:


Assignee: (was: Apache Spark)

> Support a DDL-formatted string as schema in reading for R
> -
>
> Key: SPARK-21224
> URL: https://issues.apache.org/jira/browse/SPARK-21224
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> This might have to be a followup for SPARK-20431 but I just decided to make 
> this separate for R specifically as many PRs might be confusing.
> Please refer the discussion in the PR and SPARK-20431.
> In a simple view, this JIRA describes the support for a DDL-formetted string 
> as schema as below:
> {code}
> mockLines <- c("{\"name\":\"Michael\"}",
>"{\"name\":\"Andy\", \"age\":30}",
>"{\"name\":\"Justin\", \"age\":19}")
> jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
> writeLines(mockLines, jsonPath)
> df <- read.df(jsonPath, "json", "name STRING, age DOUBLE")
> collect(df)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20977) NPE in CollectionAccumulator

2017-06-27 Thread sharkd tu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sharkd tu updated SPARK-20977:
--
Description: 

{code:java}
17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
heartbeat-receiver-event-loop-thread
java.lang.NullPointerException
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


Is that the bug of openjdk or spark? Has anybody ever hit the problem?

  was:
17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
heartbeat-receiver-event-loop-thread
java.lang.NullPointerException
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.Array

[jira] [Updated] (SPARK-20977) NPE in CollectionAccumulator

2017-06-27 Thread sharkd tu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sharkd tu updated SPARK-20977:
--
Environment: (was: OpenJDK 64-Bit Server VM (25.71-b00) for linux-ppc64 
JRE (1.8.0-internal-centos_2017_04_25_01_11-b00), built on Apr 25 2017 01:24:21 
by "centos" with gcc 6.3.1 20170110 (Advance-Toolchain-at10.0) IBM AT 10 branch)

> NPE in CollectionAccumulator
> 
>
> Key: SPARK-20977
> URL: https://issues.apache.org/jira/browse/SPARK-20977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sharkd tu
>
> {code:java}
> 17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
> heartbeat-receiver-event-loop-thread
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
>   at 
> org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
>   at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
>   at 
> org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Is that the bug of openjdk or spark? Has anybody ever hit the problem?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20977) NPE in CollectionAccumulator

2017-06-27 Thread sharkd tu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sharkd tu updated SPARK-20977:
--
Description: 
{code:java}
17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
heartbeat-receiver-event-loop-thread
java.lang.NullPointerException
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:407)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2$$anonfun$run$2.apply$mcV$sp(HeartbeatReceiver.scala:129)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at 
org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1$$anon$2.run(HeartbeatReceiver.scala:128)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


Is that the bug of spark? Has anybody ever hit the problem?

  was:

{code:java}
17/06/03 13:39:31 ERROR Utils: Uncaught exception in thread 
heartbeat-receiver-event-loop-thread
java.lang.NullPointerException
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:464)
at 
org.apache.spark.util.CollectionAccumulator.value(AccumulatorV2.scala:439)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6$$anonfun$7.apply(TaskSchedulerImpl.scala:408)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:408)
at 
org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$6.apply(TaskSchedulerImpl.scala:407)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.Arra

[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-27 Thread Peter Bykov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064469#comment-16064469
 ] 

Peter Bykov commented on SPARK-21063:
-

 [~q79969786] i have both spark hive-thriftserver and hive-thriftserver, but 
nothing works.
I have several spark installations, including HDP 2.6 with Spark2 and launched 
spark thrift server.
And the same behavior is observed on each cluster in 100% scenarios. 
I definitely confused with this and even don't know where i should look.   



> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21063) Spark return an empty result from remote hadoop cluster

2017-06-27 Thread Peter Bykov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064481#comment-16064481
 ] 

Peter Bykov commented on SPARK-21063:
-

[~q79969786] also, i can receive data using JDBC connection in Scala without 
Spark API

> Spark return an empty result from remote hadoop cluster
> ---
>
> Key: SPARK-21063
> URL: https://issues.apache.org/jira/browse/SPARK-21063
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.1.0, 2.1.1
>Reporter: Peter Bykov
>
> Spark returning empty result from when querying remote hadoop cluster.
> All firewall settings removed.
> Querying using JDBC working properly using hive-jdbc driver from version 1.1.1
> Code snippet is:
> {code:java}
> val spark = SparkSession.builder
> .appName("RemoteSparkTest")
> .master("local")
> .getOrCreate()
> val df = spark.read
>   .option("url", "jdbc:hive2://remote.hive.local:1/default")
>   .option("user", "user")
>   .option("password", "pass")
>   .option("dbtable", "test_table")
>   .option("driver", "org.apache.hive.jdbc.HiveDriver")
>   .format("jdbc")
>   .load()
>  
> df.show()
> {code}
> Result:
> {noformat}
> +---+
> |test_table.test_col|
> +---+
> +---+
> {noformat}
> All manipulations like: 
> {code:java}
> df.select(*).show()
> {code}
> returns empty result too.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-21196:
---

Assignee: Gengliang Wang

> Split codegen info of query plan into sequence
> --
>
> Key: SPARK-21196
> URL: https://issues.apache.org/jira/browse/SPARK-21196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> codegen info of query plan can be very long. 
> In debugging console / web page, it would be more readable if the subtrees 
> and corresponding codegen are split into sequence. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21196) Split codegen info of query plan into sequence

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21196.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18409
[https://github.com/apache/spark/pull/18409]

> Split codegen info of query plan into sequence
> --
>
> Key: SPARK-21196
> URL: https://issues.apache.org/jira/browse/SPARK-21196
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> codegen info of query plan can be very long. 
> In debugging console / web page, it would be more readable if the subtrees 
> and corresponding codegen are split into sequence. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-27 Thread yangZhiguo (JIRA)
yangZhiguo created SPARK-21225:
--

 Summary: decrease the Mem using for variable 'tasks' in function 
resourceOffers
 Key: SPARK-21225
 URL: https://issues.apache.org/jira/browse/SPARK-21225
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.1.1, 2.1.0
Reporter: yangZhiguo
Priority: Minor


In the function 'resourceOffers', It declare a variable 'tasks' for storage 
the tasks which have  allocated a executor. It declared like this:
*{color:#d04437}val tasks = shuffledOffers.map(o => new 
ArrayBuffer[TaskDescription](o.cores)){color}*

But, I think this code only conside a situation for that one task per core. If 
the user config the "spark.task.cpus" as 2 or 3, It really don't need so much 
space. I think It can motify as follow:

val tasks = shuffledOffers.map(o => new 
ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
06-02-53.png
Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 
06-03-29.png

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: Starscream Console on 
> OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 
> 06-03-29.png, Starscream Console on 
> OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
> 06-02-53.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: IN Predicate.png
OR Predicate.png

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png, Starscream Console 
> on OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 
> 2017-06-27 06-03-29.png, Starscream Console on 
> OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
> 06-02-53.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: (was: Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 0 2017-06-27 
06-03-29.png)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-21218:
---
Attachment: (was: Starscream Console on 
OTT---Michael-Styles---MBP-15-inch-Mid-2015 - Details for Query 1 2017-06-27 
06-02-53.png)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ] 

Michael Styles commented on SPARK-21218:


By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Pedicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 10:25 AM:
--

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Predicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Pedicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 10:30 AM:
--

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

Notice the difference in the number of output rows for the scans (see 
attachments). Also, the IN predicate test took about 1.1 minutes, while the OR 
predicate test took about 16 seconds. 


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}
!IN Predicate.png|thumbnail!

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}
!OR Predicate.png|thumbnail!

Notice the difference in the number of output rows for the scan. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21226) Save empty dataframe in pyspark prints nothing

2017-06-27 Thread Carlos M. Casas (JIRA)
Carlos M. Casas created SPARK-21226:
---

 Summary: Save empty dataframe in pyspark prints nothing
 Key: SPARK-21226
 URL: https://issues.apache.org/jira/browse/SPARK-21226
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0, 2.0.0
Reporter: Carlos M. Casas


I try the following:

schema = whatever schema you want
df1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
df1.write.parquet("as1")

and I just get a directory as1 with a _SUCCESS file in it. If I try to read 
that file, I get an exception.

On the other hand, if I run:

schema = whatever schema you want
df2 = sqlContext.createDataFrame([], schema)
df2.write.parquet("as2")

I get a directory as2 with some files on it (representing field type 
information?). If I try to read it, it works: it read an empty df with the 
proper schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-27 Thread yangZhiguo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yangZhiguo updated SPARK-21225:
---
Description: 
In the function 'resourceOffers', It declare a variable 'tasks' for storage 
the tasks which have  allocated a executor. It declared like this:
*{color:#d04437}val tasks = shuffledOffers.map(o => new 
ArrayBuffer[TaskDescription](o.cores)){color}*

But, I think this code only conside a situation for that one task per core. If 
the user config the "spark.task.cpus" as 2 or 3, It really don't need so much 
space. I think It can motify as follow:

{color:#14892c}*val tasks = shuffledOffers.map(o => new 
ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}

  was:
In the function 'resourceOffers', It declare a variable 'tasks' for storage 
the tasks which have  allocated a executor. It declared like this:
*{color:#d04437}val tasks = shuffledOffers.map(o => new 
ArrayBuffer[TaskDescription](o.cores)){color}*

But, I think this code only conside a situation for that one task per core. If 
the user config the "spark.task.cpus" as 2 or 3, It really don't need so much 
space. I think It can motify as follow:

val tasks = shuffledOffers.map(o => new 
ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))


> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064658#comment-16064658
 ] 

Apache Spark commented on SPARK-21225:
--

User 'JackYangzg' has created a pull request for this issue:
https://github.com/apache/spark/pull/18434

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21225:


Assignee: Apache Spark

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21225:


Assignee: (was: Apache Spark)

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21226) Save empty dataframe in pyspark prints nothing

2017-06-27 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21226:
--
Priority: Minor  (was: Major)

What is the error?

> Save empty dataframe in pyspark prints nothing
> --
>
> Key: SPARK-21226
> URL: https://issues.apache.org/jira/browse/SPARK-21226
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Carlos M. Casas
>Priority: Minor
>
> I try the following:
> schema = whatever schema you want
> df1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
> df1.write.parquet("as1")
> and I just get a directory as1 with a _SUCCESS file in it. If I try to read 
> that file, I get an exception.
> On the other hand, if I run:
> schema = whatever schema you want
> df2 = sqlContext.createDataFrame([], schema)
> df2.write.parquet("as2")
> I get a directory as2 with some files on it (representing field type 
> information?). If I try to read it, it works: it read an empty df with the 
> proper schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21223) Thread-safety issue in FsHistoryProvider

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064676#comment-16064676
 ] 

Sean Owen commented on SPARK-21223:
---

[~gostop_zlx] this overlaps a lot with SPARK-21078. Can you look at that one 
and possibly address the comment there too?

> Thread-safety issue in FsHistoryProvider 
> -
>
> Key: SPARK-21223
> URL: https://issues.apache.org/jira/browse/SPARK-21223
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: zenglinxi
>
> Currently, Spark HistoryServer use a HashMap named fileToAppInfo in class 
> FsHistoryProvider to store the map of eventlog path and attemptInfo. 
> When use ThreadPool to Replay the log files in the list and merge the list of 
> old applications with new ones, multi thread may update fileToAppInfo at the 
> same time, which may cause Thread-safety issues.
> {code:java}
> for (file <- logInfos) {
>tasks += replayExecutor.submit(new Runnable {
> override def run(): Unit = mergeApplicationListing(file)
>  })
>  }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21225) decrease the Mem using for variable 'tasks' in function resourceOffers

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064682#comment-16064682
 ] 

Apache Spark commented on SPARK-21225:
--

User 'JackYangzg' has created a pull request for this issue:
https://github.com/apache/spark/pull/18435

> decrease the Mem using for variable 'tasks' in function resourceOffers
> --
>
> Key: SPARK-21225
> URL: https://issues.apache.org/jira/browse/SPARK-21225
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1
>Reporter: yangZhiguo
>Priority: Minor
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> In the function 'resourceOffers', It declare a variable 'tasks' for 
> storage the tasks which have  allocated a executor. It declared like this:
> *{color:#d04437}val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](o.cores)){color}*
> But, I think this code only conside a situation for that one task per core. 
> If the user config the "spark.task.cpus" as 2 or 3, It really don't need so 
> much space. I think It can motify as follow:
> {color:#14892c}*val tasks = shuffledOffers.map(o => new 
> ArrayBuffer[TaskDescription](Math.ceil(o.cores*1.0/CPUS_PER_TASK).toInt))*{color}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21226) Save empty dataframe in pyspark prints nothing

2017-06-27 Thread Carlos M. Casas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064706#comment-16064706
 ] 

Carlos M. Casas commented on SPARK-21226:
-

The error is a different way of writing what apparently are two similar (empty) 
dataframes.

The problem is that, depending on how we create the empty dataframe, the 
spark.write.parquet method creates or doesn't create the actual .parquet files. 
In any case, it writes the _SUCCESS file. If written, they can be read; if not 
written, they can't be read. Why does this method depend on how we created the 
dataframe?

Although irrelevant, the exception I see when reading the parquet folder with 
just the _SUCCESS file is:
pyspark.sql.utils.AnalysisException: u'Unable to infer schema for Parquet. It 
must be specified manually.;'



> Save empty dataframe in pyspark prints nothing
> --
>
> Key: SPARK-21226
> URL: https://issues.apache.org/jira/browse/SPARK-21226
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Carlos M. Casas
>Priority: Minor
>
> I try the following:
> schema = whatever schema you want
> df1 = sqlContext.createDataFrame(sc.emptyRDD(), schema)
> df1.write.parquet("as1")
> and I just get a directory as1 with a _SUCCESS file in it. If I try to read 
> that file, I get an exception.
> On the other hand, if I run:
> schema = whatever schema you want
> df2 = sqlContext.createDataFrame([], schema)
> df2.write.parquet("as2")
> I get a directory as2 with some files on it (representing field type 
> information?). If I try to read it, it works: it read an empty df with the 
> proper schema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064729#comment-16064729
 ] 

Apache Spark commented on SPARK-20073:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/18436

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay, let's try to construct the exact same variantCount table manually
> // so it has no relationship to the original.
> val variantCountRowsRDD = sc.parallelize(Seq(
> Row("Fred", 2),
> Row(null, 2),
> Row("Amy", 2)))
> 
> val variantCountSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("

[jira] [Assigned] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20073:


Assignee: (was: Apache Spark)

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay, let's try to construct the exact same variantCount table manually
> // so it has no relationship to the original.
> val variantCountRowsRDD = sc.parallelize(Seq(
> Row("Fred", 2),
> Row(null, 2),
> Row("Amy", 2)))
> 
> val variantCountSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("variant_count", IntegerType, nullable = true)))
> 
> val manualVariantCounts = spark.createDataFrame(variantC

[jira] [Assigned] (SPARK-20073) Unexpected Cartesian product when using eqNullSafe in join with a derived table

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-20073:


Assignee: Apache Spark

> Unexpected Cartesian product when using eqNullSafe in join with a derived 
> table
> ---
>
> Key: SPARK-20073
> URL: https://issues.apache.org/jira/browse/SPARK-20073
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.0.2, 2.1.0
>Reporter: Everett Anderson
>Assignee: Apache Spark
>  Labels: correctness
>
> It appears that if you try to join tables A and B when B is derived from A 
> and you use the eqNullSafe / <=> operator for the join condition, Spark 
> performs a Cartesian product.
> However, if you perform the join on tables of the same data when they don't 
> have a relationship, the expected non-Cartesian product join occurs.
> {noformat}
> // Create some fake data.
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.Dataset
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions
> val peopleRowsRDD = sc.parallelize(Seq(
> Row("Fred", 8, 1),
> Row("Fred", 8, 2),
> Row(null, 10, 3),
> Row(null, 10, 4),
> Row("Amy", 12, 5),
> Row("Amy", 12, 6)))
> 
> val peopleSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("group", IntegerType, nullable = true),
> StructField("data", IntegerType, nullable = true)))
> 
> val people = spark.createDataFrame(peopleRowsRDD, peopleSchema)
> people.createOrReplaceTempView("people")
> scala> people.show
> ++-++
> |name|group|data|
> ++-++
> |Fred|8|   1|
> |Fred|8|   2|
> |null|   10|   3|
> |null|   10|   4|
> | Amy|   12|   5|
> | Amy|   12|   6|
> ++-++
> // Now create a derived table from that table. It doesn't matter much what.
> val variantCounts = spark.sql("select name, count(distinct(name, group, 
> data)) as variant_count from people group by name having variant_count > 1")
> variantCounts.show
> ++-+  
>   
> |name|variant_count|
> ++-+
> |Fred|2|
> |null|2|
> | Amy|2|
> ++-+
> // Now try an inner join using the regular equalTo that drops nulls. This 
> works fine.
> val innerJoinEqualTo = variantCounts.join(people, 
> variantCounts("name").equalTo(people("name")))
> innerJoinEqualTo.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay now lets switch to the <=> operator
> //
> // If you haven't set spark.sql.crossJoin.enabled=true, you'll get an error 
> like
> // "Cartesian joins could be prohibitively expensive and are disabled by 
> default. To explicitly enable them, please set spark.sql.crossJoin.enabled = 
> true;"
> //
> // if you have enabled them, you'll get the table below.
> //
> // However, we really don't want or expect a Cartesian product!
> val innerJoinSqlNullSafeEqOp = variantCounts.join(people, 
> variantCounts("name")<=>(people("name")))
> innerJoinSqlNullSafeEqOp.show
> ++-++-++  
>   
> |name|variant_count|name|group|data|
> ++-++-++
> |Fred|2|Fred|8|   1|
> |Fred|2|Fred|8|   2|
> |Fred|2|null|   10|   3|
> |Fred|2|null|   10|   4|
> |Fred|2| Amy|   12|   5|
> |Fred|2| Amy|   12|   6|
> |null|2|Fred|8|   1|
> |null|2|Fred|8|   2|
> |null|2|null|   10|   3|
> |null|2|null|   10|   4|
> |null|2| Amy|   12|   5|
> |null|2| Amy|   12|   6|
> | Amy|2|Fred|8|   1|
> | Amy|2|Fred|8|   2|
> | Amy|2|null|   10|   3|
> | Amy|2|null|   10|   4|
> | Amy|2| Amy|   12|   5|
> | Amy|2| Amy|   12|   6|
> ++-++-++
> // Okay, let's try to construct the exact same variantCount table manually
> // so it has no relationship to the original.
> val variantCountRowsRDD = sc.parallelize(Seq(
> Row("Fred", 2),
> Row(null, 2),
> Row("Amy", 2)))
> 
> val variantCountSchema = StructType(Seq(
> StructField("name", StringType, nullable = true),
> StructField("variant_count", IntegerType, nullable = true)))
> 
> val manualVariantCounts = spark

[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064608#comment-16064608
 ] 

Michael Styles edited comment on SPARK-21218 at 6/27/17 12:17 PM:
--

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

Notice the difference in the number of output rows for the scans (see 
attachments). Also, the IN predicate test took about 1.1 minutes, while the OR 
predicate test took about 16 seconds. 

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064747#comment-16064747
 ] 

Apache Spark commented on SPARK-21176:
--

User 'IngoSchuster' has created a pull request for this issue:
https://github.com/apache/spark/pull/18437

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21176:


Assignee: (was: Apache Spark)

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21176) Master UI hangs with spark.ui.reverseProxy=true if the master node has many CPUs

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21176:


Assignee: Apache Spark

> Master UI hangs with spark.ui.reverseProxy=true if the master node has many 
> CPUs
> 
>
> Key: SPARK-21176
> URL: https://issues.apache.org/jira/browse/SPARK-21176
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.2.1
> Environment: ppc64le GNU/Linux, POWER8, only master node is reachable 
> externally other nodes are in an internal network
>Reporter: Ingo Schuster
>Assignee: Apache Spark
>  Labels: network, web-ui
>
> In reverse proxy mode, Sparks exhausts the Jetty thread pool if the master 
> node has too many cpus or the cluster has too many executers:
> For each ProxyServlet, Jetty creates Selector threads: minimum 4, maximum 
> half the number of available CPUs:
> {{this(Math.max(1, Runtime.getRuntime().availableProcessors() / 2));}}
> (see 
> https://github.com/eclipse/jetty.project/blob/0c8273f2ca1f9bf2064cd9c4c939d2546443f759/jetty-client/src/main/java/org/eclipse/jetty/client/http/HttpClientTransportOverHTTP.java)
> In reverse proxy mode, a proxy servlet is set up for each executor.
> I have a system with 7 executors and 88 CPUs on the master node. Jetty tries 
> to instantiate 7*44 = 309 selector threads just for the reverse proxy 
> servlets, but since the QueuedThreadPool is initialized with 200 threads by 
> default, the UI gets stuck.
> I have patched JettyUtils.scala to extend the thread pool ( {{val pool = new 
> QueuedThreadPool(400)}}). With this hack, the UI works.
> Obviously, the Jetty defaults are meant for a real web server. If that has 88 
> CPUs, you do certainly expect a lot of traffic.
> For the Spark admin UI however, there will rarely be concurrent accesses for 
> the same application or the same executor.
> I therefore propose to dramatically reduce the number of selector threads 
> that get instantiated - at least by default.
> I will propose a fix in a pull request.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Seydou Dia (JIRA)
Seydou Dia created SPARK-21227:
--

 Summary: Unicode in Json field causes AnalysisException when 
selecting from Dataframe
 Key: SPARK-21227
 URL: https://issues.apache.org/jira/browse/SPARK-21227
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.1.0
Reporter: Seydou Dia


Hi,

please find below the step to reproduce the issue I am facing,
$ pyspark


{code:python}
Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , '{"city_name": "rome"}'
... , '{"city_name": "berlin"}'
... , '{"cıty_name": "new-york"}'
... , '{"cıty_name": "toronto"}'
... , '{"cıty_name": "chicago"}'
... , '{"cıty_name": "dubai"}']
>>> myRDD = sc.parallelize(js)
>>> myDF = spark.read.json(myRDD)
>>> myDF.printSchema()  
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)

>>> myDF.select(myDF['city_name'])
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
: org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
could be: city_name#29, city_name#30.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
__getitem__
jc = self._jdf.apply(item)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, could 
be: city_name#29, city_name#30.;"



{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Seydou Dia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seydou Dia updated SPARK-21227:
---
Description: 
Hi,

please find below the step to reproduce the issue I am facing,


{code:python}
$ pyspark

Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , '{"city_name": "rome"}'
... , '{"city_name": "berlin"}'
... , '{"cıty_name": "new-york"}'
... , '{"cıty_name": "toronto"}'
... , '{"cıty_name": "chicago"}'
... , '{"cıty_name": "dubai"}']
>>> myRDD = sc.parallelize(js)
>>> myDF = spark.read.json(myRDD)
>>> myDF.printSchema()  
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)

>>> myDF.select(myDF['city_name'])
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
: org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
could be: city_name#29, city_name#30.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
__getitem__
jc = self._jdf.apply(item)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, could 
be: city_name#29, city_name#30.;"



{code}


  was:
Hi,

please find below the step to reproduce the issue I am facing,
$ pyspark


{code:python}
Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , 

[jira] [Updated] (SPARK-21227) Unicode in Json field causes AnalysisException when selecting from Dataframe

2017-06-27 Thread Seydou Dia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Seydou Dia updated SPARK-21227:
---
Description: 
Hi,

please find below the step to reproduce the issue I am facing.
First I create a json with 2 fields:

* city_name
* cıty_name

The first one is valid ascii, while the second contains a unicode (ı, i without 
dot ).
When I try to select from the dataframe I have an  {noformat} AnalysisException 
{noformat}.







{code:python}
$ pyspark

Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
  /_/

Using Python version 3.4.3 (default, Sep  1 2016 23:33:38)
SparkSession available as 'spark'.

>>> sc=spark.sparkContext
>>> js = ['{"city_name": "paris"}'
... , '{"city_name": "rome"}'
... , '{"city_name": "berlin"}'
... , '{"cıty_name": "new-york"}'
... , '{"cıty_name": "toronto"}'
... , '{"cıty_name": "chicago"}'
... , '{"cıty_name": "dubai"}']
>>> myRDD = sc.parallelize(js)
>>> myDF = spark.read.json(myRDD)
>>> myDF.printSchema()  
root
 |-- city_name: string (nullable = true)
 |-- cıty_name: string (nullable = true)

>>> myDF.select(myDF['city_name'])
Traceback (most recent call last):
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 
319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o226.apply.
: org.apache.spark.sql.AnalysisException: Reference 'city_name' is ambiguous, 
could be: city_name#29, city_name#30.;
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:264)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:168)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:217)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1073)
at org.apache.spark.sql.Dataset.apply(Dataset.scala:1059)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "", line 1, in 
  File "/usr/lib/spark/python/pyspark/sql/dataframe.py", line 943, in 
__getitem__
jc = self._jdf.apply(item)
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", 
line 1133, in __call__
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 69, in deco
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: "Reference 'city_name' is ambiguous, could 
be: city_name#29, city_name#30.;"



{code}


  was:
Hi,

please find below the step to reproduce the issue I am facing,


{code:python}
$ pyspark

Python 3.4.3 (default, Sep  1 2016, 23:33:38) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4040. 
Attempting port 4041.
17/06/27 12:29:05 WARN Utils: Service 'SparkUI' could not bind on port 4041. 
Attempting port 4042.
17/06/27 12:29:08 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive 
is set, falling back to uploading libraries under SPARK_HOME.
Welcome to
    __
 / __/__  ___ _/ /__

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-06-27 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064833#comment-16064833
 ] 

Dominic Ricard commented on SPARK-21067:


[~q79969786], yes. As stated in the description, ours is set to 
"/tmp/hive-staging/{user.name}"

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.spark.sql.Dataset.(Dataset.scala:185)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala

[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-06-27 Thread Dominic Ricard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064833#comment-16064833
 ] 

Dominic Ricard edited comment on SPARK-21067 at 6/27/17 1:31 PM:
-

[~q79969786], yes. As stated in the description, ours is set to 
"/tmp/hive-staging/\{user.name\}"


was (Author: dricard):
[~q79969786], yes. As stated in the description, ours is set to 
"/tmp/hive-staging/{user.name}"

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.executio

[jira] [Created] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)
Bogdan Raducanu created SPARK-21228:
---

 Summary: InSet.doCodeGen incorrect handling of structs
 Key: SPARK-21228
 URL: https://issues.apache.org/jira/browse/SPARK-21228
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Bogdan Raducanu


In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
```
In.doCodeGen appears to be correct:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
```

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
```
In.doCodeGen appears to be correct:
```
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
```

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show -- the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show -- the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.

  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show -- the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet.doCodeGen incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.


> InSet.doCodeGen incorrect handling of structs
> -
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Summary: InSet incorrect handling of structs  (was: InSet.doCodeGen 
incorrect handling of structs)

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
> will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen appears to be correct:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
> not trigger InSet optimization at all in this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bogdan Raducanu updated SPARK-21228:

Description: 
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
doCodeGen and eval) which will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}

In.doCodeGen uses compareStructs and seems to work. In.eval might not work but 
not sure how to reproduce.

{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet or not trigger 
InSet optimization at all in this case.
Need to investigate if In.eval is affected.


  was:
In InSet it's possible that hset contains GenericInternalRows while child 
returns UnsafeRows (and vice versa). InSet.doCodeGen uses hset.contains which 
will always be false in this case.

The following code reproduces the problem:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
default is 10 which requires a longer query text to repro

spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
a").createOrReplaceTempView("A")

sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
UnsafeRows while the list of structs that will become hset will be 
GenericInternalRows
++
|minA|
++
++
{code}
In.doCodeGen appears to be correct:
{code}
spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
will not use InSet
sql("select * from (select min(a) as minA from A) A where minA in 
(named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
2L),named_struct('a', 3L, 'b', 3L))").show

+-+
| minA|
+-+
|[1,1]|
+-+
{code}

Solution could be either to do safe<->unsafe conversion in InSet.doCodeGen or 
not trigger InSet optimization at all in this case.



> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: is

[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064880#comment-16064880
 ] 

Hyukjin Kwon commented on SPARK-21218:
--

Yea, I support this for what it worth. Let's resolve this as a duplicate BTW. 
You could link that JIRA in your PR at your PR title.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064945#comment-16064945
 ] 

Bogdan Raducanu commented on SPARK-21228:
-

I tested manually (since there is no flag to disable codegen for expressions) 
that In.eval also fails, so only In.doCodeGen appears correct.

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20002) Add support for unions between streaming and batch datasets

2017-06-27 Thread Leon Pham (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064947#comment-16064947
 ] 

Leon Pham commented on SPARK-20002:
---

We're actually reading data from two different sources and one of them is a 
batch source that doesn't make sense as a streaming one. We're trying to 
concatenate the batch data to be analyzed with the streaming datasets.

> Add support for unions between streaming and batch datasets
> ---
>
> Key: SPARK-20002
> URL: https://issues.apache.org/jira/browse/SPARK-20002
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 2.0.2
>Reporter: Leon Pham
>
> Currently unions between streaming datasets and batch datasets are not 
> supported.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21228) InSet incorrect handling of structs

2017-06-27 Thread Bogdan Raducanu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064970#comment-16064970
 ] 

Bogdan Raducanu commented on SPARK-21228:
-

InSubquery.doCodeGen is using InSet directly (although InSubquery itself is 
never used) so a fix should consider this too.

> InSet incorrect handling of structs
> ---
>
> Key: SPARK-21228
> URL: https://issues.apache.org/jira/browse/SPARK-21228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Bogdan Raducanu
>
> In InSet it's possible that hset contains GenericInternalRows while child 
> returns UnsafeRows (and vice versa). InSet uses hset.contains (both in 
> doCodeGen and eval) which will always be false in this case.
> The following code reproduces the problem:
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "2") // the 
> default is 10 which requires a longer query text to repro
> spark.range(1, 10).selectExpr("named_struct('a', id, 'b', id) as 
> a").createOrReplaceTempView("A")
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show // the Aggregate here will return 
> UnsafeRows while the list of structs that will become hset will be 
> GenericInternalRows
> ++
> |minA|
> ++
> ++
> {code}
> In.doCodeGen uses compareStructs and seems to work. In.eval might not work 
> but not sure how to reproduce.
> {code}
> spark.conf.set("spark.sql.optimizer.inSetConversionThreshold", "3") // now it 
> will not use InSet
> sql("select * from (select min(a) as minA from A) A where minA in 
> (named_struct('a', 1L, 'b', 1L),named_struct('a', 2L, 'b', 
> 2L),named_struct('a', 3L, 'b', 3L))").show
> +-+
> | minA|
> +-+
> |[1,1]|
> +-+
> {code}
> Solution could be either to do safe<->unsafe conversion in InSet or not 
> trigger InSet optimization at all in this case.
> Need to investigate if In.eval is affected.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20226) Call to sqlContext.cacheTable takes an incredibly long time in some cases

2017-06-27 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16064972#comment-16064972
 ] 

Barry Becker commented on SPARK-20226:
--

Calling cache() on the dataframe on the after the addColumn used to make this 
run fast. But around the time that we upgraded to spark 2.1.1 it got very slow 
again. Calling cache on the dataframe does not seem to help any more.

If I hardcode the addColumn column expression to be 
{code}
(((CAST(Plate AS STRING) + CAST(State AS STRING)) + CAST(License Type 
AS STRING)) + CAST(Violation Time AS STRING)) + CAST(Violation AS STRING)) + 
CAST(Judgment Entry Date AS STRING)) + CAST(Issue Date AS STRING)) + 
CAST(Summons Number AS STRING)) + CAST(Fine Amount AS STRING)) + CAST(Penalty 
Amount AS STRING)) + CAST(Interest Amount AS STRING)) + CAST(Violation AS 
STRING))
{code}
instead of 
{code}
CAST(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(Plate, State), License Type), 
Violation Time), Violation), UDF(Judgment Entry Date)), UDF(Issue Date)), 
UDF(Summons Number)), UDF(Fine Amount)), UDF(Penalty Amount)), UDF(Interest 
Amount)), Violation) AS STRING)
{code}
which is what is generated by our expression parser, then the time goes from 70 
seconds down to 10 seconds. Still slow, but not nearly as slow.

> Call to sqlContext.cacheTable takes an incredibly long time in some cases
> -
>
> Key: SPARK-20226
> URL: https://issues.apache.org/jira/browse/SPARK-20226
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: linux or windows
>Reporter: Barry Becker
>  Labels: cache
> Attachments: profile_indexer2.PNG, xyzzy.csv
>
>
> I have a case where the call to sqlContext.cacheTable can take an arbitrarily 
> long time depending on the number of columns that are referenced in a 
> withColumn expression applied to a dataframe.
> The dataset is small (20 columns 7861 rows). The sequence to reproduce is the 
> following:
> 1) add a new column that references 8 - 14 of the columns in the dataset. 
>- If I add 8 columns, then the call to cacheTable is fast - like *5 
> seconds*
>- If I add 11 columns, then it is slow - like *60 seconds*
>- and if I add 14 columns, then it basically *takes forever* - I gave up 
> after 10 minutes or so.
>   The Column expression that is added, is basically just concatenating 
> the columns together in a single string. If a number is concatenated on a 
> string (or vice versa) the number is first converted to a string.
>   The expression looks something like this:
> {code}
> `Plate` + `State` + `License Type` + `Summons Number` + `Issue Date` + 
> `Violation Time` + `Violation` + `Judgment Entry Date` + `Fine Amount` + 
> `Penalty Amount` + `Interest Amount`
> {code}
> which we then convert to a Column expression that looks like this:
> {code}
> UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF(UDF('Plate, 'State), 'License Type), 
> UDF('Summons Number)), UDF('Issue Date)), 'Violation Time), 'Violation), 
> UDF('Judgment Entry Date)), UDF('Fine Amount)), UDF('Penalty Amount)), 
> UDF('Interest Amount))
> {code}
>where the UDFs are very simple functions that basically call toString 
> and + as needed.
> 2) apply a pipeline that includes some transformers that was saved earlier. 
> Here are the steps of the pipeline (extracted from parquet)
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200603,"sparkVersion":"2.1.0","uid":"strIdx_aeb04d2777cc","paramMap":{"handleInvalid":"skip","outputCol":"State_IDX__","inputCol":"State_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333200837,"sparkVersion":"2.1.0","uid":"strIdx_0164c4c13979","paramMap":{"inputCol":"License
>  Type_CLEANED__","handleInvalid":"skip","outputCol":"License 
> Type_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201068,"sparkVersion":"2.1.0","uid":"strIdx_25b6cbd02751","paramMap":{"inputCol":"Violation_CLEANED__","handleInvalid":"skip","outputCol":"Violation_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201282,"sparkVersion":"2.1.0","uid":"strIdx_aa12df0354d9","paramMap":{"handleInvalid":"skip","inputCol":"County_CLEANED__","outputCol":"County_IDX__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201552,"sparkVersion":"2.1.0","uid":"strIdx_babb120f3cc1","paramMap":{"handleInvalid":"skip","outputCol":"Issuing
>  Agency_IDX__","inputCol":"Issuing Agency_CLEANED__"}}{code}
>  - 
> {code}{"class":"org.apache.spark.ml.feature.StringIndexerModel","timestamp":1491333201759

[jira] [Commented] (SPARK-18294) Implement commit protocol to support `mapred` package's committer

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065023#comment-16065023
 ] 

Apache Spark commented on SPARK-18294:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/18438

> Implement commit protocol to support `mapred` package's committer
> -
>
> Key: SPARK-18294
> URL: https://issues.apache.org/jira/browse/SPARK-18294
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Jiang Xingbo
>
> Current `FileCommitProtocol` is based on `mapreduce` package, we should 
> implement a `HadoopMapRedCommitProtocol` that supports the older mapred 
> package's commiter.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-21229:
---

 Summary: remove QueryPlan.preCanonicalized
 Key: SPARK-21229
 URL: https://issues.apache.org/jira/browse/SPARK-21229
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065118#comment-16065118
 ] 

Apache Spark commented on SPARK-21229:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/18440

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21229:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21229) remove QueryPlan.preCanonicalized

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21229:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove QueryPlan.preCanonicalized
> -
>
> Key: SPARK-21229
> URL: https://issues.apache.org/jira/browse/SPARK-21229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-19104.
-
   Resolution: Fixed
Fix Version/s: 2.2.0

Issue resolved by pull request 18418
[https://github.com/apache/spark/pull/18418]

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
> Fix For: 2.2.0
>
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyE

[jira] [Assigned] (SPARK-19104) CompileException with Map and Case Class in Spark 2.1.0

2017-06-27 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-19104:
---

Assignee: Liang-Chi Hsieh

>  CompileException with Map and Case Class in Spark 2.1.0
> 
>
> Key: SPARK-19104
> URL: https://issues.apache.org/jira/browse/SPARK-19104
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Nils Grabbert
>Assignee: Liang-Chi Hsieh
> Fix For: 2.2.0
>
>
> The following code will run with Spark 2.0.2 but not with Spark 2.1.0:
> {code}
> case class InnerData(name: String, value: Int)
> case class Data(id: Int, param: Map[String, InnerData])
> val data = Seq.tabulate(10)(i => Data(1, Map("key" -> InnerData("name", i + 
> 100
> val ds   = spark.createDataset(data)
> {code}
> Exception:
> {code}
> Caused by: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 63, Column 46: Expression 
> "ExternalMapToCatalyst_value_isNull1" is not an rvalue 
>   at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:11004) 
>   at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6639)
>  
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5001) 
>   at org.codehaus.janino.UnitCompiler.access$10500(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$13.visitAmbiguousName(UnitCompiler.java:4984)
>  
>   at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:3633) 
>   at org.codehaus.janino.Java$Lvalue.accept(Java.java:3563) 
>   at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4956) 
>   at org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4925) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:3189) 
>   at org.codehaus.janino.UnitCompiler.access$5100(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3143) 
>   at 
> org.codehaus.janino.UnitCompiler$9.visitAssignment(UnitCompiler.java:3139) 
>   at org.codehaus.janino.Java$Assignment.accept(Java.java:3847) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3139) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:2112) 
>   at org.codehaus.janino.UnitCompiler.access$1700(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$6.visitExpressionStatement(UnitCompiler.java:1370)
>  
>   at org.codehaus.janino.Java$ExpressionStatement.accept(Java.java:2558) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1370) 
>   at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1450) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:2811) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1262)
>  
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1234)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:538) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:890) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:894) 
>   at org.codehaus.janino.UnitCompiler.access$600(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:377)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1128) 
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1209)
>  
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:564) 
>   at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:420) 
>   at org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:206) 
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:374)
>  
>   at 
> org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:369)
>  
>   at 
> org.codehaus.janino.Java$AbstractPackageMemberClassDeclaration.accept(Java.java:1309)
>  
>   at org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:369) 
>   at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.java:345) 
>   at 
> org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.java:396)
>  
>   at 
> org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.java:311)
>  
>   at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:229) 
>   at org.codehaus.janino.SimpleCompil

[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065142#comment-16065142
 ] 

Michael Styles commented on SPARK-21218:


[~hyukjin.kwon] Not sure I understand what you want me to do with my PR, 
assuming this JIRA is resolved as a duplicate?

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Andrew Duffy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065177#comment-16065177
 ] 

Andrew Duffy commented on SPARK-21218:
--

Curious, I wonder what the previous benchmarks were lacking.

 Have you tried disjunction push-down on other datatypes, e.g. strings? In any 
case, I'm also on board with this change, if it is in fact useful.

I think [~hyukjin.kwon] was saying we should close this as a dupe, and rename 
the PR with the original ticket number (#17091). You can re-open that issue and 
say that you've done tests where this now looks like it will be a big 
improvement.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Andrew Duffy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065177#comment-16065177
 ] 

Andrew Duffy edited comment on SPARK-21218 at 6/27/17 5:39 PM:
---

Curious, I wonder what the previous benchmarks were lacking.

 Have you tried disjunction push-down on other datatypes, e.g. strings? In any 
case, I'm also on board with this change, if it is in fact useful.

I think [~hyukjin.kwon] was saying we should close this as a dupe, and rename 
the PR with the original ticket number 
[17091|https://issues.apache.org/jira/browse/SPARK-17091]. You can re-open that 
issue and say that you've done tests where this now looks like it will be a big 
improvement.


was (Author: andreweduffy):
Curious, I wonder what the previous benchmarks were lacking.

 Have you tried disjunction push-down on other datatypes, e.g. strings? In any 
case, I'm also on board with this change, if it is in fact useful.

I think [~hyukjin.kwon] was saying we should close this as a dupe, and rename 
the PR with the original ticket number (#17091). You can re-open that issue and 
say that you've done tests where this now looks like it will be a big 
improvement.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065210#comment-16065210
 ] 

Michael Styles commented on SPARK-21218:


In Parquet 1.7, there as a bug involving corrupt statistics on binary columns 
(https://issues.apache.org/jira/browse/PARQUET-251). This bug prevented earlier 
versions of Spark from generating Parquet filters on any string columns. Spark 
2.1 has moved up to Parquet 1.8.2, so this issue no longer exists.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065329#comment-16065329
 ] 

Michael Kunkel commented on SPARK-21215:


The "resolution for this by [~sowen] was to put this on the spark mailing list.
But I am sure this is just a scam to toss questions because the mailing list is 
no longer accepting emails.
When sending a email to the mailing list, a reply is given

Hello, 

This employee can no longer access email on this account.  Your email will not 
be forwarded.


So, this is not resolved.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transform

[jira] [Created] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread Michael Kunkel (JIRA)
Michael Kunkel created SPARK-21230:
--

 Summary: Spark Encoder with mysql Enum and data truncated Error
 Key: SPARK-21230
 URL: https://issues.apache.org/jira/browse/SPARK-21230
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.1
 Environment: macosX
Reporter: Michael Kunkel




I am using Spark via Java for a MYSQL/ML(machine learning) project.

In the mysql database, I have a column "status_change_type" of type enum = 
{broke, fixed} in a table called "status_change" in a DB called "test".

I have an object StatusChangeDB that constructs the needed structure for the 
table, however for the "status_change_type", I constructed it as a String. I 
know the bytes from MYSQL enum to Java string are much different, but I am 
using Spark, so the encoder does not recognize enums properly. However when I 
try to set the value of the enum via a Java string, I receive the "data 
truncated" error

h5. org.apache.spark.SparkException: Job aborted due to stage failure: Task 
0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 
(TID 9, localhost, executor driver): java.sql.BatchUpdateException: Data 
truncated for column 'status_change_type' at row 1 at 
com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)

I have tried to use enum for "status_change_type", however it fails with a 
stack trace of

h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
 at 
org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
 at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
h5. 
I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
does nothing as I get the same error of "data truncated" as first stated. Here 
are my jdbc options map, in case I am using the "jdbcCompliantTruncation=false" 
incorrectly.

public static Map jdbcOptions() {
Map jdbcOptions = new HashMap();
jdbcOptions.put("url", 
"jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
jdbcOptions.put("driver", "com.mysql.jdbc.Driver");
jdbcOptions.put("dbtable", "status_change");
jdbcOptions.put("user", "root");
jdbcOptions.put("password", "");
return jdbcOptions;
}

Here is the Spark method for inserting into the mysql DB

private void insertMYSQLQuery(Dataset changeDF) {
try {

changeDF.write().mode(SaveMode.Append).jdbc(SparkManager.jdbcAppendOptions(), 
"status_change",
new java.util.Properties());
} catch (Exception e) {
System.out.println(e);
}
}

where jdbcAppendOptions uses the jdbcOptions m

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065336#comment-16065336
 ] 

Sean Owen commented on SPARK-21215:
---

I'm not sure what you're referring to. The user@ list works fine. sometimes you 
get weird auto-responses from somebody's account on the list. But, it has 
nothing to do with the list.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
> bq.   at 
> org.apac

[jira] [Commented] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065338#comment-16065338
 ] 

Sean Owen commented on SPARK-21230:
---

This does also not look like a useful JIRA. It looks like a question about 
using MySQL and JDBC. Until it's narrowed down to a Spark issue, we'd generally 
close this.

> Spark Encoder with mysql Enum and data truncated Error
> --
>
> Key: SPARK-21230
> URL: https://issues.apache.org/jira/browse/SPARK-21230
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
> Environment: macosX
>Reporter: Michael Kunkel
>
> I am using Spark via Java for a MYSQL/ML(machine learning) project.
> In the mysql database, I have a column "status_change_type" of type enum = 
> {broke, fixed} in a table called "status_change" in a DB called "test".
> I have an object StatusChangeDB that constructs the needed structure for the 
> table, however for the "status_change_type", I constructed it as a String. I 
> know the bytes from MYSQL enum to Java string are much different, but I am 
> using Spark, so the encoder does not recognize enums properly. However when I 
> try to set the value of the enum via a Java string, I receive the "data 
> truncated" error
> h5. org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: 
> Data truncated for column 'status_change_type' at row 1 at 
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)
> I have tried to use enum for "status_change_type", however it fails with a 
> stack trace of
> h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
> at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
> h5. 
> I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
> does nothing as I get the same error of "data truncated" as first stated. 
> Here are my jdbc options map, in case I am using the 
> "jdbcCompliantTruncation=false" incorrectly.
> public static Map jdbcOptions() {
> Map jdbcOptions = new HashMap();
> jdbcOptions.put("url", 
> "jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
> jdbcOptions.put("driver", "com.mysql.jdbc.Driver");
> jdbcOpti

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065345#comment-16065345
 ] 

Michael Kunkel commented on SPARK-21215:


I looked at a few months worth of posts, and it seems that the mailing list is 
not accepting new mails. This is because the mailing list is no longer valid 
according to the auto-reply from the yahoo-mailing server.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(Q

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065354#comment-16065354
 ] 

Michael Kunkel commented on SPARK-21215:


The posts go onto the list, but the owner ASF does not have access to it, as 
the reply states. If they have no access to it, how can they be the help?

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:276)
> bq.   at 
> org.apache.spark.sql.catalyst.p

[jira] [Commented] (SPARK-21230) Spark Encoder with mysql Enum and data truncated Error

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065347#comment-16065347
 ] 

Michael Kunkel commented on SPARK-21230:


The problem is with the Spark Encoder of type enum. So the problem is spark as 
the post attempts to point out.

> Spark Encoder with mysql Enum and data truncated Error
> --
>
> Key: SPARK-21230
> URL: https://issues.apache.org/jira/browse/SPARK-21230
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
> Environment: macosX
>Reporter: Michael Kunkel
>
> I am using Spark via Java for a MYSQL/ML(machine learning) project.
> In the mysql database, I have a column "status_change_type" of type enum = 
> {broke, fixed} in a table called "status_change" in a DB called "test".
> I have an object StatusChangeDB that constructs the needed structure for the 
> table, however for the "status_change_type", I constructed it as a String. I 
> know the bytes from MYSQL enum to Java string are much different, but I am 
> using Spark, so the encoder does not recognize enums properly. However when I 
> try to set the value of the enum via a Java string, I receive the "data 
> truncated" error
> h5. org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in 
> stage 4.0 (TID 9, localhost, executor driver): java.sql.BatchUpdateException: 
> Data truncated for column 'status_change_type' at row 1 at 
> com.mysql.jdbc.PreparedStatement.executeBatchSerially(PreparedStatement.java:2055)
> I have tried to use enum for "status_change_type", however it fails with a 
> stack trace of
> h5. Exception in thread "AWT-EventQueue-0" java.lang.NullPointerException 
> at org.spark_project.guava.reflect.TypeToken.method(TypeToken.java:465) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:126)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) at 
> org.apache.spark.sql.catalyst.JavaTypeInference$.org$apache$spark$sql$catalyst$JavaTypeInference$$inferDataType(JavaTypeInference.scala:125)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:127)
>  at 
> org.apache.spark.sql.catalyst.JavaTypeInference$$anonfun$2.apply(JavaTypeInference.scala:125)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at 
> scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) ... ...
> h5. 
> I have tried to use the jdbc setting "jdbcCompliantTruncation=false" but this 
> does nothing as I get the same error of "data truncated" as first stated. 
> Here are my jdbc options map, in case I am using the 
> "jdbcCompliantTruncation=false" incorrectly.
> public static Map jdbcOptions() {
> Map jdbcOptions = new HashMap();
> jdbcOptions.put("url", 
> "jdbc:mysql://localhost:3306/test?jdbcCompliantTruncation=false");
> jdbcOptions.put("driver", "com.mysql.jdbc.Driver");
> jdbcOptions.put("dbtable", "status_change");
> jdbc

[jira] [Comment Edited] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065354#comment-16065354
 ] 

Michael Kunkel edited comment on SPARK-21215 at 6/27/17 7:40 PM:
-

The posts go onto the list, but the owner, ASF, does not have access to it, as 
the reply states. If they have no access to it, how can they be the help?


was (Author: michaelkunkel):
The posts go onto the list, but the owner ASF does not have access to it, as 
the reply states. If they have no access to it, how can they be the help?

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.tra

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065351#comment-16065351
 ] 

Michael Kunkel commented on SPARK-21215:


[~sowen] I am not attempting to argue that facts. When I try to post to the 
ASF, I instantly get a reply from yahoo, with the subject matter exactly as the 
mail I sent. The reply states that the owner no longer has access to the 
mailing list.
Try it please


> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$cata

[jira] [Commented] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065348#comment-16065348
 ] 

Sean Owen commented on SPARK-21215:
---

Not sure what you're looking at, but the mailing list has posts from 10 minutes 
ago:
http://apache-spark-user-list.1001560.n3.nabble.com/

I'm not sure what you mean about Yahoo, because nothing about this project or 
the ASF uses Yahoo.

> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsUp$1.apply(QueryPlan.scala:255)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:266)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveT

[jira] [Comment Edited] (SPARK-21215) Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot resolve

2017-06-27 Thread Michael Kunkel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065351#comment-16065351
 ] 

Michael Kunkel edited comment on SPARK-21215 at 6/27/17 7:37 PM:
-

[~sowen] I am not attempting to argue the facts. When I try to post to the ASF, 
I instantly get a reply from yahoo, with the subject matter exactly as the mail 
I sent. The reply states that the owner no longer has access to the mailing 
list.
Try it please



was (Author: michaelkunkel):
[~sowen] I am not attempting to argue that facts. When I try to post to the 
ASF, I instantly get a reply from yahoo, with the subject matter exactly as the 
mail I sent. The reply states that the owner no longer has access to the 
mailing list.
Try it please


> Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve
> -
>
> Key: SPARK-21215
> URL: https://issues.apache.org/jira/browse/SPARK-21215
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, SQL
>Affects Versions: 2.1.1
> Environment: macOSX
>Reporter: Michael Kunkel
>
> First Spark project.
> I have a Java method that returns a Dataset. I want to convert this to a 
> Dataset, where the Object is named StatusChangeDB. I have created a 
> POJO StatusChangeDB.java and coded it with all the query objects found in the 
> mySQL table.
> I then create a Encoder and convert the Dataset to a 
> Dataset. However when I try to .show() the values of the 
> Dataset I receive the error
> bq. 
> bq. Exception in thread "main" org.apache.spark.sql.AnalysisException: cannot 
> resolve '`hvpinid_quad`' given input columns: [status_change_type, 
> superLayer, loclayer, sector, locwire];
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:86)
> bq.   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:83)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:290)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:289)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4$$anonfun$apply$10.apply(TreeNode.scala:324)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at 
> scala.collection.MapLike$MappedValues$$anonfun$iterator$3.apply(MapLike.scala:246)
> bq.   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> bq.   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
> bq.   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
> bq.   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> bq.   at scala.collection.IterableLike$$anon$1.foreach(IterableLike.scala:311)
> bq.   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
> bq.   at 
> scala.collection.mutable.MapBuilder.$plus$plus$eq(MapBuilder.scala:25)
> bq.   at 
> scala.collection.TraversableViewLike$class.force(TraversableViewLike.scala:88)
> bq.   at scala.collection.IterableLike$$anon$1.force(IterableLike.scala:311)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:332)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305)
> bq.   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:287)
> bq.   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsU

[jira] [Reopened] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles reopened SPARK-17091:


> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065413#comment-16065413
 ] 

Michael Styles commented on SPARK-17091:


By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.

> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17091) ParquetFilters rewrite IN to OR of Eq

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17091:
---
Attachment: IN Predicate.png
OR Predicate.png

> ParquetFilters rewrite IN to OR of Eq
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles updated SPARK-17091:
---
Summary: Convert IN predicate to equivalent Parquet filter  (was: 
ParquetFilters rewrite IN to OR of Eq)

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17091) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065413#comment-16065413
 ] 

Michael Styles edited comment on SPARK-17091 at 6/27/17 8:26 PM:
-

By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement. See attachments.


was (Author: ptkool):
By not pushing the filter to Parquet, are we not preventing Parquet from 
skipping blocks during read operations? I have tests that show big improvements 
when applying this transformation.

For instance, I have a Parquet file with 162,456,394 rows which is sorted on 
column C1.

*IN Predicate*
{noformat}
df.filter[df['C1'].isin([42, 139])).collect()
{noformat}

*OR Predicate*
{noformat}
df.filter((df['C1'] == 42) | (df['C1'] == 139)).collect()
{noformat}

I'm seeing about a 50 -75 % improvement.

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-17091
> URL: https://issues.apache.org/jira/browse/SPARK-17091
> Project: Spark
>  Issue Type: Bug
>Reporter: Andrew Duffy
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Past attempts at pushing down the InSet operation for Parquet relied on 
> user-defined predicates. It would be simpler to rewrite an IN clause into the 
> corresponding OR union of a set of equality conditions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21218) Convert IN predicate to equivalent Parquet filter

2017-06-27 Thread Michael Styles (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Styles resolved SPARK-21218.

Resolution: Duplicate

> Convert IN predicate to equivalent Parquet filter
> -
>
> Key: SPARK-21218
> URL: https://issues.apache.org/jira/browse/SPARK-21218
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.1.1
>Reporter: Michael Styles
> Attachments: IN Predicate.png, OR Predicate.png
>
>
> Convert IN predicate to equivalent expression involving equality conditions 
> to allow the filter to be pushed down to Parquet.
> For instance,
> C1 IN (10, 20) is rewritten as (C1 = 10) OR (C1 = 20)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065423#comment-16065423
 ] 

Steve Loughran commented on SPARK-21137:


Looking at this.

something is trying to get the permissions for every file, which is being dealt 
with by an exec & all the overheads of that. Looking at the code, it's in the 
constructor of {{LocatedFileStatus}}, which is building it from another 
{{FileStatus}}. Which normally is just a simple copy of a field (fast, 
efficient). Looks like on RawLocalFileSystem, it actually triggers an on demand 
execution. Been around for a long time (HADOOP-2288), surfacing here because 
you're working with the local FS. For all other filesystems it's a quick 
operation.

I think this is an issue: I don't think anybody thought this would be a 
problem, as it's just viewed as a marshalling of a LocatedFileStatus, which is 
what you get back from {{FileSystem.listLocatedStatus}}. Normally that's the 
higher performing one, not just on object stores, but because it scales better, 
being able to incrementally send back data in batches, rather than needing to 
enumerate an entire directory of files (possibly in the millions) and then send 
them around as arrays of FileStatus.  Here, it's clearly not.

What to do? I think we could consider whether it'd be possible to add this to 
the hadoop native libs & so make a fast API call. There's also the option of 
"allowing us to completely disable permissions entirely". That one appeals to 
me more from a windows perspective, where you could get rid of the hadoop 
native lib and still have (most) things work there...but as its an incomplete 
"most" it's probably an optimistic goal.



> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065428#comment-16065428
 ] 

Steve Loughran commented on SPARK-21137:


ps, for now, do it in parallel: 
{{mapreduce.input.fileinputformat.list-status.num-threads}} . There though, the 
fact that every thread will be exec()ing code can make it expensive (looks like 
a full Posix spawn on a mac. 



> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065445#comment-16065445
 ] 

Steve Loughran commented on SPARK-21137:


Filed HADOOP-14600.

Looks like a v. old codepath that's not been looked at...and since then the 
native lib fstat call should be able to do this, just retain the old code for 
when {{NativeCodeLoader.isNativeCodeLoaded() == false}}. 



Sam, you said

bq. it's likely that the underlying Hadoop APIs have some yucky code that does 
something silly, I have delved down their before and my stomach cannot handle it

oh, it's not so bad. At least you don't have to go near the assembly code bit. 
That we are all scared of. Or worse, Kerberos.

Anyway, patches welcome there, with tests

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2017-06-27 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065448#comment-16065448
 ] 

Steve Loughran commented on SPARK-12868:


I think this is the case of HADOOP-14598: once the FS has been set to 
{{FsUrlStreamHandlerFactory}} in {{org.apache.spark.sql.internal.SharedState}}, 
you can't talk to Azure. 

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>Assignee: Weiqing Yang
> Fix For: 2.2.0
>
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21137:


Assignee: Apache Spark

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Assignee: Apache Spark
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065476#comment-16065476
 ] 

Apache Spark commented on SPARK-21137:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/18441

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21137) Spark reads many small files slowly

2017-06-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16065475#comment-16065475
 ] 

Sean Owen commented on SPARK-21137:
---

OK, so it is something that could be optimized in the Hadoop API, and seems 
somewhat specific to a local filesystem.
I opened the change we both mention here in this thread as a PR.

> Spark reads many small files slowly
> ---
>
> Key: SPARK-21137
> URL: https://issues.apache.org/jira/browse/SPARK-21137
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1
>Reporter: sam
>Priority: Minor
>
> A very common use case in big data is to read a large number of small files.  
> For example the Enron email dataset has 1,227,645 small files.
> When one tries to read this data using Spark one will hit many issues.  
> Firstly, even if the data is small (each file only say 1K) any job can take a 
> very long time (I have a simple job that has been running for 3 hours and has 
> not yet got to the point of starting any tasks, I doubt if it will ever 
> finish).
> It seems all the code in Spark that manages file listing is single threaded 
> and not well optimised.  When I hand crank the code and don't use Spark, my 
> job runs much faster.
> Is it possible that I'm missing some configuration option? It seems kinda 
> surprising to me that Spark cannot read Enron data given that it's such a 
> quintessential example.
> So it takes 1 hour to output a line "1,227,645 input paths to process", it 
> then takes another hour to output the same line. Then it outputs a CSV of all 
> the input paths (so creates a text storm).
> Now it's been stuck on the following:
> {code}
> 17/06/19 09:31:07 INFO LzoCodec: Successfully loaded & initialized native-lzo 
> library [hadoop-lzo rev 154f1ef53e2d6ed126b0957d7995e0a610947608]
> {code}
> for 2.5 hours.
> So I've provided full reproduce steps here (including code and cluster setup) 
> https://github.com/samthebest/scenron, scroll down to "Bug In Spark". You can 
> easily just clone, and follow the README to reproduce exactly!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >