[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-14 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696596#comment-14696596
 ] 

Reynold Xin commented on SPARK-9213:


We just need to handle it in the analyzer to rewrite it, and also pattern match 
it in the optimizer. No need to handle it everywhere else, since the analyzer 
will take care of the rewrite.


> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9213) Improve regular expression performance (via joni)

2015-08-14 Thread Yadong Qi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696630#comment-14696630
 ] 

Yadong Qi commented on SPARK-9213:
--

I know, thanks!

> Improve regular expression performance (via joni)
> -
>
> Key: SPARK-9213
> URL: https://issues.apache.org/jira/browse/SPARK-9213
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Reporter: Reynold Xin
>
> I'm creating an umbrella ticket to improve regular expression performance for 
> string expressions. Right now our use of regular expressions is inefficient 
> for two reasons:
> 1. Java regex in general is slow.
> 2. We have to convert everything from UTF8 encoded bytes into Java String, 
> and then run regex on it, and then convert it back.
> There are libraries in Java that provide regex support directly on UTF8 
> encoded bytes. One prominent example is joni, used in JRuby.
> Note: all regex functions are in 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9961:
---

Assignee: Apache Spark

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696640#comment-14696640
 ] 

Apache Spark commented on SPARK-9961:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8190

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9961:
---

Assignee: (was: Apache Spark)

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696641#comment-14696641
 ] 

Apache Spark commented on SPARK-9661:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8190

> ML 1.5 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-9661
> URL: https://issues.apache.org/jira/browse/SPARK-9661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
> Fix For: 1.5.0
>
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-9661) ML 1.5 QA: API: Java compatibility, docs

2015-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-9661:
--

re-open for some clean-ups

> ML 1.5 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-9661
> URL: https://issues.apache.org/jira/browse/SPARK-9661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
> Fix For: 1.5.0
>
>
> Check Java compatibility for MLlib for this release.
> Checking compatibility means:
> * comparing with the Scala doc
> * verifying that Java docs are not messed up by Scala type incompatibilities. 
>  Some items to look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (The correctness can be checked in Scala.)
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here.
> Note that we should not break APIs from previous releases.  So if you find a 
> problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9961) ML prediction abstractions should have defaultEvaluator fields

2015-08-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-9961:
-
Comment: was deleted

(was: User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8190)

> ML prediction abstractions should have defaultEvaluator fields
> --
>
> Key: SPARK-9961
> URL: https://issues.apache.org/jira/browse/SPARK-9961
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Predictor and PredictionModel should have abstract defaultEvaluator methods 
> which return Evaluators.  Subclasses like Regressor, Classifier, etc. should 
> all provide natural evaluators, set to use the correct input columns and 
> metrics.  Concrete classes may later be modified to 
> The initial implementation should be marked as DeveloperApi since we may need 
> to change the defaults later on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9899:
---

Assignee: Apache Spark  (was: Cheng Lian)

> JSON/Parquet writing on retry or speculation broken with direct output 
> committer
> 
>
> Key: SPARK-9899
> URL: https://issues.apache.org/jira/browse/SPARK-9899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> If the first task fails all subsequent tasks will.  We probably need to set a 
> different boolean when calling create.
> {code}
> java.io.IOException: File already exists: ...
> ...
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
>   at 
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9899:
---

Assignee: Cheng Lian  (was: Apache Spark)

> JSON/Parquet writing on retry or speculation broken with direct output 
> committer
> 
>
> Key: SPARK-9899
> URL: https://issues.apache.org/jira/browse/SPARK-9899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
>
> If the first task fails all subsequent tasks will.  We probably need to set a 
> different boolean when calling create.
> {code}
> java.io.IOException: File already exists: ...
> ...
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
>   at 
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9899) JSON/Parquet writing on retry or speculation broken with direct output committer

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696672#comment-14696672
 ] 

Apache Spark commented on SPARK-9899:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8191

> JSON/Parquet writing on retry or speculation broken with direct output 
> committer
> 
>
> Key: SPARK-9899
> URL: https://issues.apache.org/jira/browse/SPARK-9899
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
>
> If the first task fails all subsequent tasks will.  We probably need to set a 
> different boolean when calling create.
> {code}
> java.io.IOException: File already exists: ...
> ...
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:564)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:545)
>   at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:452)
>   at 
> org.apache.hadoop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:128)
>   at 
> org.apache.spark.sql.execution.datasources.json.JsonOutputWriter.(JSONRelation.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.json.JSONRelation$$anon$1.newInstance(JSONRelation.scala:160)
>   at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:217)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client

2015-08-14 Thread Saisai Shao (JIRA)
Saisai Shao created SPARK-9969:
--

 Summary: Remove old Yarn MR classpath api support for Spark Yarn 
client
 Key: SPARK-9969
 URL: https://issues.apache.org/jira/browse/SPARK-9969
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Saisai Shao
Priority: Minor


Since now we only support Yarn stable API, so here propose to remove old 
MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-08-14 Thread ding (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696677#comment-14696677
 ] 

ding commented on SPARK-5556:
-

The code can be found https://github.com/intel-analytics/TopicModeling. 

There is an example in the package, you can try gibbs sampling lda or online 
lda by setting --optimizer as "gibbs" or "online"


> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9969:
---

Assignee: Apache Spark

> Remove old Yarn MR classpath api support for Spark Yarn client
> --
>
> Key: SPARK-9969
> URL: https://issues.apache.org/jira/browse/SPARK-9969
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Saisai Shao
>Assignee: Apache Spark
>Priority: Minor
>
> Since now we only support Yarn stable API, so here propose to remove old 
> MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696681#comment-14696681
 ] 

Apache Spark commented on SPARK-9969:
-

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/8192

> Remove old Yarn MR classpath api support for Spark Yarn client
> --
>
> Key: SPARK-9969
> URL: https://issues.apache.org/jira/browse/SPARK-9969
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Since now we only support Yarn stable API, so here propose to remove old 
> MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9969) Remove old Yarn MR classpath api support for Spark Yarn client

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9969:
---

Assignee: (was: Apache Spark)

> Remove old Yarn MR classpath api support for Spark Yarn client
> --
>
> Key: SPARK-9969
> URL: https://issues.apache.org/jira/browse/SPARK-9969
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Saisai Shao
>Priority: Minor
>
> Since now we only support Yarn stable API, so here propose to remove old 
> MRConfig.DEFAULT_MAPREDUCE_APPLICATION_CLASSPATH api support in 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names

2015-08-14 Thread JIRA
Maciej Bryński created SPARK-9970:
-

 Summary: SQLContext.createDataFrame failed to properly determine 
column names
 Key: SPARK-9970
 URL: https://issues.apache.org/jira/browse/SPARK-9970
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.0
Reporter: Maciej Bryński
Priority: Minor


Hi,
I'm trying to do "nested join" of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = "left_outer"):
print "Joining tables %s %s" % (namestr(tableLeft), namestr(tableRight))
sys.stdout.flush()
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable["joinColumn"], joinType).drop("joinColumn")

user = sqlContext.read.json(path + "user.json")
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + "order.json")
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + "lines.json")
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)

orders = joinTable(order, lines, "id", "orderid", "lines")
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, "id", "userid", "orders")
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)


I tried to check if groupBy isn't the root of the problem but it's looks right
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9725) spark sql query string field return empty/garbled string

2015-08-14 Thread Zhongshuai Pei (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696684#comment-14696684
 ] 

Zhongshuai Pei commented on SPARK-9725:
---

In my environment,"offset" did not change according to BYTE_ARRAY_OFFSET.
when set spark.executor.memory=36g, "BYTE_ARRAY_OFFSET" is 24, "offset" is 16 , 
but set spark.executor.memory=30g, "BYTE_ARRAY_OFFSET" is 16,  "offset" is 16 
too. So "if (offset == BYTE_ARRAY_OFFSET && base instanceof byte[]  && 
((byte[]) base).length == numBytes)" will get different result. [~rxin] 

> spark sql query string field return empty/garbled string
> 
>
> Key: SPARK-9725
> URL: https://issues.apache.org/jira/browse/SPARK-9725
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Fei Wang
>Assignee: Davies Liu
>Priority: Blocker
>
> to reproduce it:
> 1 deploy spark cluster mode, i use standalone mode locally
> 2 set executor memory >= 32g, set following config in spark-default.xml
>spark.executor.memory36g 
> 3 run spark-sql.sh with "show tables" it return empty/garbled string



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names

2015-08-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
--
Description: 
Hi,
I'm trying to do "nested join" of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable["joinColumn"], joinType).drop("joinColumn")


user = sqlContext.read.json(path + "user.json")
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + "order.json")
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + "lines.json")
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:

{code:java}
orders = joinTable(order, lines, "id", "orderid", "lines")
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, "id", "userid", "orders")
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)
{code}

I tried to check if groupBy isn't the root of the problem but it's looks right
{code:java}
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]
{code}

So I assume that the problem is with createDataFrame.

  was:
Hi,
I'm trying to do "nested join" of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = "left_outer"):
print "Joining tables %s %s" % (namestr(tableLeft), namestr(tableRight))
sys.stdout.flush()
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable["joinColumn"], joinType).drop("joinColumn")

user = sqlContext.read.json(path + "user.json")
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + "order.json")
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + "lines.json")
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)

orders = joinTable(order, lines, "id", "orderid", "lines")
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, "id", "userid", "orders")
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||

[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names

2015-08-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
--
Description: 
Hi,
I'm trying to do "nested join" of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable["joinColumn"], joinType).drop("joinColumn")


user = sqlContext.read.json(path + "user.json")
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + "order.json")
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + "lines.json")
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:
Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
'id', 'orderid', 'product'

{code:java}
orders = joinTable(order, lines, "id", "orderid", "lines")
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, "id", "userid", "orders")
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)
{code}

I tried to check if groupBy isn't the root of the problem but it's looks right
{code:java}
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]
{code}

So I assume that the problem is with createDataFrame.

  was:
Hi,
I'm trying to do "nested join" of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable["joinColumn"], joinType).drop("joinColumn")


user = sqlContext.read.json(path + "user.json")
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + "order.json")
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + "lines.json")
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:

{code:java}
orders = joinTable(order, lines, "id", "orderid", "lines")
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, "id", "userid", "orders")
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)

[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696700#comment-14696700
 ] 

Apache Spark commented on SPARK-6624:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8193

> Convert filters into CNF for data sources
> -
>
> Key: SPARK-6624
> URL: https://issues.apache.org/jira/browse/SPARK-6624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We should turn filters into conjunctive normal form (CNF) before we pass them 
> to data sources. Otherwise, filters are not very useful if there is a single 
> filter with a bunch of ORs.
> Note that we already try to do some of these in BooleanSimplification, but I 
> think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6624:
---

Assignee: (was: Apache Spark)

> Convert filters into CNF for data sources
> -
>
> Key: SPARK-6624
> URL: https://issues.apache.org/jira/browse/SPARK-6624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We should turn filters into conjunctive normal form (CNF) before we pass them 
> to data sources. Otherwise, filters are not very useful if there is a single 
> filter with a bunch of ORs.
> Note that we already try to do some of these in BooleanSimplification, but I 
> think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6624) Convert filters into CNF for data sources

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6624:
---

Assignee: Apache Spark

> Convert filters into CNF for data sources
> -
>
> Key: SPARK-6624
> URL: https://issues.apache.org/jira/browse/SPARK-6624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> We should turn filters into conjunctive normal form (CNF) before we pass them 
> to data sources. Otherwise, filters are not very useful if there is a single 
> filter with a bunch of ORs.
> Note that we already try to do some of these in BooleanSimplification, but I 
> think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names

2015-08-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
--
Description: 
Hi,
I'm trying to do "nested join" of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable["joinColumn"], joinType).drop("joinColumn")


user = sqlContext.read.json(path + "user.json")
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + "order.json")
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + "lines.json")
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:
Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
'id', 'orderid', 'product'

{code:java}
orders = joinTable(order, lines, "id", "orderid", "lines")
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, "id", "userid", "orders")
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- price: double (nullable = true)
 |||-- userid: long (nullable = true)
 |||-- lines: array (nullable = true)
 ||||-- element: struct (containsNull = true)
 |||||-- _1: long (nullable = true)
 |||||-- _2: long (nullable = true)
 |||||-- _3: string (nullable = true)
{code}

I tried to check if groupBy isn't the cause of the problem but it looks right
{code:java}
grouped =  orders.rdd.groupBy(lambda r: r.userid)
print grouped.map(lambda x: list(x[1])).collect()

[[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, product=u'XXX'), 
Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, price=343.99, userid=1, 
lines=[Row(id=3, orderid=2, product=u'XXX')])], [Row(id=3, price=399.99, 
userid=2, lines=[Row(id=4, orderid=3, product=u'XXX')])]]
{code}

So I assume that the problem is with createDataFrame.

  was:
Hi,
I'm trying to do "nested join" of tables.
After first join everything is ok, but second join made some of the column 
names lost.

My code is following:

{code:java}
def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
joinType = "left_outer"):
tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
r.asDict()[columnRight]))
tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
tmpTable._2.data.alias(columnNested))
return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
tmpTable["joinColumn"], joinType).drop("joinColumn")


user = sqlContext.read.json(path + "user.json")
user.printSchema()
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

order =  sqlContext.read.json(path + "order.json")
order.printSchema();
root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)

lines =  sqlContext.read.json(path + "lines.json")
lines.printSchema();
root
 |-- id: long (nullable = true)
 |-- orderid: long (nullable = true)
 |-- product: string (nullable = true)
{code}

And joining code:
Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
'id', 'orderid', 'product'

{code:java}
orders = joinTable(order, lines, "id", "orderid", "lines")
orders.printSchema()

root
 |-- id: long (nullable = true)
 |-- price: double (nullable = true)
 |-- userid: long (nullable = true)
 |-- lines: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (nullable = true)
 |||-- orderid: long (nullable = true)
 |||-- product: string (nullable = true)

clients = joinTable(user, orders, "id", "userid", "orders")
clients.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- orders: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- id: long (null

[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter

2015-08-14 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696706#comment-14696706
 ] 

Yu Ishikawa commented on SPARK-9871:


I think it would be better to deal with {{struct}} on another issue. Since 
{{collect}} doesn't work with a DataFrame which has a column of Struct type. So 
we need to improve {{collect}} method.

When I tried to implement {{struct}} function, a strict column converted by 
{{dfToCols}} consists of {{jobj}} 
{noformat}
List of 1
 $ structed:List of 2
  ..$ :Class 'jobj' 
  ..$ :Class 'jobj' 
{noformat}

> Add expression functions into SparkR which have a variable parameter
> 
>
> Key: SPARK-9871
> URL: https://issues.apache.org/jira/browse/SPARK-9871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add expression functions into SparkR which has a variable parameter, like 
> {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Frank Rosner (JIRA)
Frank Rosner created SPARK-9971:
---

 Summary: MaxFunction not working correctly with columns containing 
Double.NaN
 Key: SPARK-9971
 URL: https://issues.apache.org/jira/browse/SPARK-9971
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: Frank Rosner


h5. Problem Description

When using the {{max}} function on a {{DoubleType}} column that contains 
{{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 

This is because it compares all values with the running maximum. However, {{x < 
Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
Double.NaN}}.

h5. How to Reproduce

{code}
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.sql.types._

val sql = new SQLContext(sc)
val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
val dataFrame = sql.createDataFrame(rdd, StructType(List(
  StructField("col", DoubleType, false)
)))
dataFrame.select(max("col")).first
// returns org.apache.spark.sql.Row = [NaN]
{code}

h5. Solution

The {{max}} and {{min}} functions should ignore NaN values, as they are not 
numbers. If a column contains only NaN values, then the maximum and minimum is 
not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9973) wrong buffle size

2015-08-14 Thread xukun (JIRA)
xukun created SPARK-9973:


 Summary: wrong buffle size
 Key: SPARK-9973
 URL: https://issues.apache.org/jira/browse/SPARK-9973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: xukun






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9972) Add `struct` function in SparkR

2015-08-14 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-9972:
--

 Summary: Add `struct` function in SparkR
 Key: SPARK-9972
 URL: https://issues.apache.org/jira/browse/SPARK-9972
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa


Support {{struct}} function on a DataFrame in SparkR. However, I think we need 
to improve {{collect}} function in SparkR in order to implement {{struct}} 
function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-9971:

Priority: Minor  (was: Major)

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h5. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h5. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h5. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-9971:

Description: 
h4. Problem Description

When using the {{max}} function on a {{DoubleType}} column that contains 
{{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 

This is because it compares all values with the running maximum. However, {{x < 
Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
Double.NaN}}.

h4. How to Reproduce

{code}
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.sql.types._

val sql = new SQLContext(sc)
val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
val dataFrame = sql.createDataFrame(rdd, StructType(List(
  StructField("col", DoubleType, false)
)))
dataFrame.select(max("col")).first
// returns org.apache.spark.sql.Row = [NaN]
{code}

h4. Solution

The {{max}} and {{min}} functions should ignore NaN values, as they are not 
numbers. If a column contains only NaN values, then the maximum and minimum is 
not defined.

  was:
h5. Problem Description

When using the {{max}} function on a {{DoubleType}} column that contains 
{{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 

This is because it compares all values with the running maximum. However, {{x < 
Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
Double.NaN}}.

h5. How to Reproduce

{code}
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.sql.types._

val sql = new SQLContext(sc)
val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
val dataFrame = sql.createDataFrame(rdd, StructType(List(
  StructField("col", DoubleType, false)
)))
dataFrame.select(max("col")).first
// returns org.apache.spark.sql.Row = [NaN]
{code}

h5. Solution

The {{max}} and {{min}} functions should ignore NaN values, as they are not 
numbers. If a column contains only NaN values, then the maximum and minimum is 
not defined.


> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9973) wrong buffle size

2015-08-14 Thread xukun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xukun updated SPARK-9973:
-
Description: allocate wrong buffer size (4+ size * columnType.defaultSize  
* columnType.defaultSize), right buffer size is  4+ size * 
columnType.defaultSize.

> wrong buffle size
> -
>
> Key: SPARK-9973
> URL: https://issues.apache.org/jira/browse/SPARK-9973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: xukun
>
> allocate wrong buffer size (4+ size * columnType.defaultSize  * 
> columnType.defaultSize), right buffer size is  4+ size * 
> columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9973) wrong buffle size

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9973:
---

Assignee: Apache Spark

> wrong buffle size
> -
>
> Key: SPARK-9973
> URL: https://issues.apache.org/jira/browse/SPARK-9973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: xukun
>Assignee: Apache Spark
>
> allocate wrong buffer size (4+ size * columnType.defaultSize  * 
> columnType.defaultSize), right buffer size is  4+ size * 
> columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9973) wrong buffle size

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696714#comment-14696714
 ] 

Apache Spark commented on SPARK-9973:
-

User 'viper-kun' has created a pull request for this issue:
https://github.com/apache/spark/pull/8189

> wrong buffle size
> -
>
> Key: SPARK-9973
> URL: https://issues.apache.org/jira/browse/SPARK-9973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: xukun
>
> allocate wrong buffer size (4+ size * columnType.defaultSize  * 
> columnType.defaultSize), right buffer size is  4+ size * 
> columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9973) wrong buffle size

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9973:
---

Assignee: (was: Apache Spark)

> wrong buffle size
> -
>
> Key: SPARK-9973
> URL: https://issues.apache.org/jira/browse/SPARK-9973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: xukun
>
> allocate wrong buffer size (4+ size * columnType.defaultSize  * 
> columnType.defaultSize), right buffer size is  4+ size * 
> columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696716#comment-14696716
 ] 

Sean Owen commented on SPARK-9971:
--

My instinct is that this should in fact result in NaN; NaN is not in general 
ignored. For example {{math.max(1.0, Double.NaN)}} and {{math.min(1.0, 
Double.NaN)}} are both NaN.

But are you ready for some weird?

{code}
scala> Seq(1.0, Double.NaN).max
res23: Double = NaN

scala> Seq(Double.NaN, 1.0).max
res24: Double = 1.0

scala> Seq(5.0, Double.NaN, 1.0).max
res25: Double = 1.0

scala> Seq(5.0, Double.NaN, 1.0, 6.0).max
res26: Double = 6.0

scala> Seq(5.0, Double.NaN, 1.0, 6.0, Double.NaN).max
res27: Double = NaN
{code}

Not sure what to make of that, other than the Scala collection library isn't a 
good reference for behavior. Java?

{code}
scala> java.util.Collections.max(java.util.Arrays.asList(new 
java.lang.Double(1.0), new java.lang.Double(Double.NaN)))
res33: Double = NaN

scala> java.util.Collections.max(java.util.Arrays.asList(new 
java.lang.Double(Double.NaN), new java.lang.Double(1.0)))
res34: Double = NaN
{code}

Makes more sense at least. I think this is correct behavior and you would 
filter NaN if you want to ignore them, as it is generally not something the 
language ignores.

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9973) wrong buffle size

2015-08-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696717#comment-14696717
 ] 

Sean Owen commented on SPARK-9973:
--

[~xukun] This would be more useful if you added a descriptive title and more 
complete description. What code is affected? what buffer? you explained this in 
the PR to some degree so a brief note about the problem to be solved is 
appropriate. Right now someone reading this doesn't know what it refers to.

> wrong buffle size
> -
>
> Key: SPARK-9973
> URL: https://issues.apache.org/jira/browse/SPARK-9973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: xukun
>
> allocate wrong buffer size (4+ size * columnType.defaultSize  * 
> columnType.defaultSize), right buffer size is  4+ size * 
> columnType.defaultSize.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9871:
---

Assignee: (was: Apache Spark)

> Add expression functions into SparkR which have a variable parameter
> 
>
> Key: SPARK-9871
> URL: https://issues.apache.org/jira/browse/SPARK-9871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add expression functions into SparkR which has a variable parameter, like 
> {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9871) Add expression functions into SparkR which have a variable parameter

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9871:
---

Assignee: Apache Spark

> Add expression functions into SparkR which have a variable parameter
> 
>
> Key: SPARK-9871
> URL: https://issues.apache.org/jira/browse/SPARK-9871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>Assignee: Apache Spark
>
> Add expression functions into SparkR which has a variable parameter, like 
> {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9871) Add expression functions into SparkR which have a variable parameter

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696720#comment-14696720
 ] 

Apache Spark commented on SPARK-9871:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/8194

> Add expression functions into SparkR which have a variable parameter
> 
>
> Key: SPARK-9871
> URL: https://issues.apache.org/jira/browse/SPARK-9871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add expression functions into SparkR which has a variable parameter, like 
> {{concat}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696760#comment-14696760
 ] 

Frank Rosner commented on SPARK-9971:
-

I would like to provide a patch to make the following unit tests in 
{{DataFrameFunctionsSuite}} succeed:

{code}
  test("max function ignoring Double.NaN") {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF("col1", "col2")
checkAnswer(
  df.select(max("col1")),
  Seq(Row(10d))
)
checkAnswer(
  df.select(max("col2")),
  Seq(Row(Double.NaN))
)
  }

  test("min function ignoring Double.NaN") {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF("col1", "col2")
checkAnswer(
  df.select(min("col1")),
  Seq(Row(-10d))
)
checkAnswer(
  df.select(min("col1")),
  Seq(Row(Double.NaN))
)
  }
{code}

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696762#comment-14696762
 ] 

Sean Owen commented on SPARK-9971:
--

Hey Frank as I say I don't think that's the correct answer, at least in that 
it's not consistent with Java/Scala APIs... well, the Scala APIs are kind of 
confused, but none of them consistently ignore NaN.

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696760#comment-14696760
 ] 

Frank Rosner edited comment on SPARK-9971 at 8/14/15 9:34 AM:
--

I would like to provide a patch to make the following unit tests in 
{{DataFrameFunctionsSuite}} succeed:

{code}
  test("max function ignoring Double.NaN") {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF("col1", "col2")
checkAnswer(
  df.select(max("col1")),
  Seq(Row(10d))
)
checkAnswer(
  df.select(max("col2")),
  Seq(Row(null))
)
  }

  test("min function ignoring Double.NaN") {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF("col1", "col2")
checkAnswer(
  df.select(min("col1")),
  Seq(Row(-10d))
)
checkAnswer(
  df.select(min("col1")),
  Seq(Row(null))
)
  }
{code}


was (Author: frosner):
I would like to provide a patch to make the following unit tests in 
{{DataFrameFunctionsSuite}} succeed:

{code}
  test("max function ignoring Double.NaN") {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF("col1", "col2")
checkAnswer(
  df.select(max("col1")),
  Seq(Row(10d))
)
checkAnswer(
  df.select(max("col2")),
  Seq(Row(Double.NaN))
)
  }

  test("min function ignoring Double.NaN") {
val df = Seq(
  (Double.NaN, Double.NaN),
  (-10d, Double.NaN),
  (10d, Double.NaN)
).toDF("col1", "col2")
checkAnswer(
  df.select(min("col1")),
  Seq(Row(-10d))
)
checkAnswer(
  df.select(min("col1")),
  Seq(Row(Double.NaN))
)
  }
{code}

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696766#comment-14696766
 ] 

Frank Rosner commented on SPARK-9971:
-

[~srowen] I get your point. But given your "weird" examples, I don't think that 
being consistent with the Scala collection library is desirable. When I work on 
a data set and there are values which are not a number (e.g. divided by zero) 
and I want to compute the maximum, I find it convenient not to give me NaN. NaN 
is not a number so it cannot be a maximum number by definition.

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696768#comment-14696768
 ] 

Sean Owen commented on SPARK-9971:
--

My argument is being consistent with Java really, and the not-weird parts of 
Scala. Within either language, NaNs are not ignored; products and sums of 
things with NaN become NaN. None of them appear to ignore NaN, so this behavior 
would not be consistent with anything else. I get that it's convenient, but 
it's also easy to filter.

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9780) In case of invalid initialization of KafkaDirectStream, NPE is thrown

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9780.
--
   Resolution: Fixed
 Assignee: Cody Koeninger
Fix Version/s: 1.5.0

> In case of invalid initialization of KafkaDirectStream, NPE is thrown
> -
>
> Key: SPARK-9780
> URL: https://issues.apache.org/jira/browse/SPARK-9780
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.1, 1.4.1
>Reporter: Grigory Turunov
>Assignee: Cody Koeninger
>Priority: Minor
> Fix For: 1.5.0
>
>
> [o.a.s.streaming.kafka.KafkaRDD.scala#L143|https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaRDD.scala#L143]
> In initialization of KafkaRDDIterator, there is an addition of 
> TaskCompletionListener to the context, which calls close() to the consumer, 
> which is not initialized yet (and will be initialized 12 lines after that).
> If something happens in this 12 lines (in my case there was a private 
> constructor for valueDecoder), an Exception, which is thrown, triggers 
> context.markTaskCompleted() in
> [o.a.s.scheduler.Task.scala#L90|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L90]
> which throws NullPointerException, when tries to call close() for 
> non-initialized consumer.
> This masks original exception - so it is very hard to understand, what is 
> happening.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar

2015-08-14 Thread Cheng Lian (JIRA)
Cheng Lian created SPARK-9974:
-

 Summary: SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not 
packaged into the assembly jar
 Key: SPARK-9974
 URL: https://issues.apache.org/jira/browse/SPARK-9974
 Project: Spark
  Issue Type: Bug
  Components: Build, SQL
Affects Versions: 1.5.0
Reporter: Cheng Lian
Priority: Blocker


Git commit: 
[69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]

Build with SBT and check the assembly jar for 
{{parquet.hadoop.api.WriteSupport}}:
{noformat}
$ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
clean assembly/assembly
...
$ jar tf 
assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
fgrep "parquet/hadoop/api"
org/apache/parquet/hadoop/api/
org/apache/parquet/hadoop/api/DelegatingReadSupport.class
org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
org/apache/parquet/hadoop/api/InitContext.class
org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
org/apache/parquet/hadoop/api/ReadSupport.class
org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
org/apache/parquet/hadoop/api/WriteSupport.class
{noformat}
Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
{{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
namespace.

Build with Maven and check the assembly jar for 
{{parquet.hadoop.api.WriteSupport}}:
{noformat}
$ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
-DskipTests clean package
$ jar tf 
assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
fgrep "parquet/hadoop/api"
org/apache/parquet/hadoop/api/
org/apache/parquet/hadoop/api/DelegatingReadSupport.class
org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
org/apache/parquet/hadoop/api/InitContext.class
org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
org/apache/parquet/hadoop/api/ReadSupport.class
org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
org/apache/parquet/hadoop/api/WriteSupport.class
parquet/hadoop/api/
parquet/hadoop/api/DelegatingReadSupport.class
parquet/hadoop/api/DelegatingWriteSupport.class
parquet/hadoop/api/InitContext.class
parquet/hadoop/api/ReadSupport$ReadContext.class
parquet/hadoop/api/ReadSupport.class
parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
parquet/hadoop/api/WriteSupport$WriteContext.class
parquet/hadoop/api/WriteSupport.class
{noformat}
Expected classes are packaged properly.

One of the consequence of this issue is that Parquet tables created in Hive are 
not accessible from Spark SQL built with SBT. To reproduce this issue, first 
create a Parquet table with Hive (say 0.13.1):
{noformat}
hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
{noformat}
Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
{noformat}
scala> sqlContext.table("parquet_test").show()
15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test
15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=parquet_test
java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at 
org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298)
at 
org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
at 
org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:397)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spar

[jira] [Updated] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar

2015-08-14 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-9974:
--
Description: 
One of the consequence of this issue is that Parquet tables created in Hive are 
not accessible from Spark SQL built with SBT. Maven build is OK.

Git commit: 
[69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]

Build with SBT and check the assembly jar for 
{{parquet.hadoop.api.WriteSupport}}:
{noformat}
$ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
clean assembly/assembly
...
$ jar tf 
assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
fgrep "parquet/hadoop/api"
org/apache/parquet/hadoop/api/
org/apache/parquet/hadoop/api/DelegatingReadSupport.class
org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
org/apache/parquet/hadoop/api/InitContext.class
org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
org/apache/parquet/hadoop/api/ReadSupport.class
org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
org/apache/parquet/hadoop/api/WriteSupport.class
{noformat}
Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
{{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
namespace.

Build with Maven and check the assembly jar for 
{{parquet.hadoop.api.WriteSupport}}:
{noformat}
$ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
-DskipTests clean package
$ jar tf 
assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
fgrep "parquet/hadoop/api"
org/apache/parquet/hadoop/api/
org/apache/parquet/hadoop/api/DelegatingReadSupport.class
org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
org/apache/parquet/hadoop/api/InitContext.class
org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
org/apache/parquet/hadoop/api/ReadSupport.class
org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
org/apache/parquet/hadoop/api/WriteSupport.class
parquet/hadoop/api/
parquet/hadoop/api/DelegatingReadSupport.class
parquet/hadoop/api/DelegatingWriteSupport.class
parquet/hadoop/api/InitContext.class
parquet/hadoop/api/ReadSupport$ReadContext.class
parquet/hadoop/api/ReadSupport.class
parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
parquet/hadoop/api/WriteSupport$WriteContext.class
parquet/hadoop/api/WriteSupport.class
{noformat}
Expected classes are packaged properly.

To reproduce the Parquet table access issue, first create a Parquet table with 
Hive (say 0.13.1):
{noformat}
hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
{noformat}
Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
{noformat}
scala> sqlContext.table("parquet_test").show()
15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default tbl=parquet_test
15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table : 
db=default tbl=parquet_test
java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:270)
at 
org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
at scala.Option.map(Option.scala:145)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
at 
org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
at 
org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
at 
org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
at 
org.apache.spark.sql.hive.client.ClientWrapper.getTableOption(ClientWrapper.scala:298)
at 
org.apache.spark.sql.hive.client.ClientInterface$class.getTable(ClientInterface.scala:123)
at 
org.apache.spark.sql.hive.client.ClientWrapper.getTable(ClientWrapper.scala:60)
at 
org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:397)
at 
org.apache.spark.sql.hive.HiveContext$$anon$1.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(HiveContext.scala:403)
at 
org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:170)
at 

[jira] [Created] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX

2015-08-14 Thread Kenny Bastani (JIRA)
Kenny Bastani created SPARK-9975:


 Summary: Add Normalized Closeness Centrality to Spark GraphX
 Key: SPARK-9975
 URL: https://issues.apache.org/jira/browse/SPARK-9975
 Project: Spark
  Issue Type: New Feature
  Components: GraphX
Reporter: Kenny Bastani
Priority: Minor


“Closeness centrality” is also defined as a proportion. First, the distance of 
a vertex from all other vertices in the network is counted. Normalization is 
achieved by defining closeness centrality as the number of other vertices 
divided by this sum (De Nooy et al., 2005, p. 127). Because of this 
normalization, closeness centrality provides a global measure about the 
position of a vertex in the network, while betweenness centrality is defined 
with reference to the local position of a vertex. -- Cited from 
http://arxiv.org/pdf/0911.2719.pdf

This request is to add normalized closeness centrality as a core graph 
algorithm in the GraphX library. I implemented this algorithm for a graph 
processing extension to Neo4j 
(https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I would 
like to put it up for review for inclusion into Spark. This algorithm is very 
straight forward and builds on top of the included ShortestPaths (SSSP) 
algorithm already in the library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9975:
---

Assignee: Apache Spark

> Add Normalized Closeness Centrality to Spark GraphX
> ---
>
> Key: SPARK-9975
> URL: https://issues.apache.org/jira/browse/SPARK-9975
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Kenny Bastani
>Assignee: Apache Spark
>Priority: Minor
>  Labels: features
>
> “Closeness centrality” is also defined as a proportion. First, the distance 
> of a vertex from all other vertices in the network is counted. Normalization 
> is achieved by defining closeness centrality as the number of other vertices 
> divided by this sum (De Nooy et al., 2005, p. 127). Because of this 
> normalization, closeness centrality provides a global measure about the 
> position of a vertex in the network, while betweenness centrality is defined 
> with reference to the local position of a vertex. -- Cited from 
> http://arxiv.org/pdf/0911.2719.pdf
> This request is to add normalized closeness centrality as a core graph 
> algorithm in the GraphX library. I implemented this algorithm for a graph 
> processing extension to Neo4j 
> (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I 
> would like to put it up for review for inclusion into Spark. This algorithm 
> is very straight forward and builds on top of the included ShortestPaths 
> (SSSP) algorithm already in the library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696799#comment-14696799
 ] 

Apache Spark commented on SPARK-9975:
-

User 'kbastani' has created a pull request for this issue:
https://github.com/apache/spark/pull/8195

> Add Normalized Closeness Centrality to Spark GraphX
> ---
>
> Key: SPARK-9975
> URL: https://issues.apache.org/jira/browse/SPARK-9975
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Kenny Bastani
>Priority: Minor
>  Labels: features
>
> “Closeness centrality” is also defined as a proportion. First, the distance 
> of a vertex from all other vertices in the network is counted. Normalization 
> is achieved by defining closeness centrality as the number of other vertices 
> divided by this sum (De Nooy et al., 2005, p. 127). Because of this 
> normalization, closeness centrality provides a global measure about the 
> position of a vertex in the network, while betweenness centrality is defined 
> with reference to the local position of a vertex. -- Cited from 
> http://arxiv.org/pdf/0911.2719.pdf
> This request is to add normalized closeness centrality as a core graph 
> algorithm in the GraphX library. I implemented this algorithm for a graph 
> processing extension to Neo4j 
> (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I 
> would like to put it up for review for inclusion into Spark. This algorithm 
> is very straight forward and builds on top of the included ShortestPaths 
> (SSSP) algorithm already in the library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9975) Add Normalized Closeness Centrality to Spark GraphX

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9975:
---

Assignee: (was: Apache Spark)

> Add Normalized Closeness Centrality to Spark GraphX
> ---
>
> Key: SPARK-9975
> URL: https://issues.apache.org/jira/browse/SPARK-9975
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Kenny Bastani
>Priority: Minor
>  Labels: features
>
> “Closeness centrality” is also defined as a proportion. First, the distance 
> of a vertex from all other vertices in the network is counted. Normalization 
> is achieved by defining closeness centrality as the number of other vertices 
> divided by this sum (De Nooy et al., 2005, p. 127). Because of this 
> normalization, closeness centrality provides a global measure about the 
> position of a vertex in the network, while betweenness centrality is defined 
> with reference to the local position of a vertex. -- Cited from 
> http://arxiv.org/pdf/0911.2719.pdf
> This request is to add normalized closeness centrality as a core graph 
> algorithm in the GraphX library. I implemented this algorithm for a graph 
> processing extension to Neo4j 
> (https://github.com/kbastani/neo4j-mazerunner#supported-algorithms) and I 
> would like to put it up for review for inclusion into Spark. This algorithm 
> is very straight forward and builds on top of the included ShortestPaths 
> (SSSP) algorithm already in the library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8922) Add @since tags to mllib.evaluation

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-8922:
-
Assignee: Xiangrui Meng

> Add @since tags to mllib.evaluation
> ---
>
> Key: SPARK-8922
> URL: https://issues.apache.org/jira/browse/SPARK-8922
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>  Labels: starter
> Fix For: 1.5.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9726) PySpark Regression: DataFrame join no longer accepts None as join expression

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9726:
-
Assignee: Brennan Ashton

> PySpark Regression: DataFrame join no longer accepts None as join expression
> 
>
> Key: SPARK-9726
> URL: https://issues.apache.org/jira/browse/SPARK-9726
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Brennan Ashton
>Assignee: Brennan Ashton
> Fix For: 1.5.0
>
>
> The patch to add methods to support equi-join broke joins where on is None.  
> Rather than ending the branch  with  jdf = self._jdf.join(other._jdf)  it 
> continues to another branch where it fails because you cannot take the index 
> of None.if isinstance(on[0], basestring):
> This was valid in 1.4.1:
> df3 = df.join(df2,on=None) 
> This is a trivial fix in the attached PR.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9826) Cannot use custom classes in log4j.properties

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9826:
-
Assignee: Michel Lemay

> Cannot use custom classes in log4j.properties
> -
>
> Key: SPARK-9826
> URL: https://issues.apache.org/jira/browse/SPARK-9826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Michel Lemay
>Assignee: Michel Lemay
>Priority: Minor
>  Labels: regression
> Fix For: 1.4.2, 1.5.0
>
>
> log4j is initialized before spark class loader is set on the thread context.
> Therefore it cannot use classes embedded in fat-jars submitted to spark.
> While parsing arguments, spark calls methods on Utils class and triggers 
> ShutdownHookManager static initialization.  This then leads to log4j being 
> initialized before spark gets the chance to specify custom class 
> MutableURLClassLoader on the thread context.
> See detailed explanation here:
> http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9936) decimal precision lost when loading DataFrame from RDD

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9936:
-
Assignee: Liang-Chi Hsieh

> decimal precision lost when loading DataFrame from RDD
> --
>
> Key: SPARK-9936
> URL: https://issues.apache.org/jira/browse/SPARK-9936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Tzach Zohar
>Assignee: Liang-Chi Hsieh
> Fix For: 1.5.0
>
>
> It seems that when converting an RDD that contains BigDecimals into a 
> DataFrame (using SQLContext.createDataFrame without specifying schema), 
> precision info is lost, which means saving as Parquet file will fail (Parquet 
> tries to verify precision < 18, so fails if it's unset).
> This seems to be similar to 
> [SPARK-7196|https://issues.apache.org/jira/browse/SPARK-7196], which fixed 
> the same issue for DataFrames created via JDBC.
> To reproduce:
> {code:none}
> scala> val rdd: RDD[(String, BigDecimal)] = sc.parallelize(Seq(("a", 
> BigDecimal.valueOf(0.234
> rdd: org.apache.spark.rdd.RDD[(String, BigDecimal)] = 
> ParallelCollectionRDD[0] at parallelize at :23
> scala> val df: DataFrame = new SQLContext(rdd.context).createDataFrame(rdd)
> df: org.apache.spark.sql.DataFrame = [_1: string, _2: decimal(10,0)]
> scala> df.write.parquet("/data/parquet-file")
> 15/08/13 10:30:07 ERROR executor.Executor: Exception in task 0.0 in stage 0.0 
> (TID 0)
> java.lang.RuntimeException: Unsupported datatype DecimalType()
> {code}
> To verify this is indeed caused by the precision being lost, I've tried 
> manually changing the schema to include precision (by traversing the 
> StructFields and replacing the DecimalTypes with altered DecimalTypes), 
> creating a new DataFrame using this updated schema - and indeed it fixes the 
> problem.
> I'm using Spark 1.4.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9931) Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. test_training_and_prediction

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9931:
-
Component/s: Tests
 PySpark

> Flaky test: mllib/tests.py StreamingLogisticRegressionWithSGDTests. 
> test_training_and_prediction
> 
>
> Key: SPARK-9931
> URL: https://issues.apache.org/jira/browse/SPARK-9931
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Tests
>Reporter: Davies Liu
>Priority: Critical
>
> {code}
> FAIL: test_training_and_prediction 
> (__main__.StreamingLogisticRegressionWithSGDTests)
> Test that the model improves on toy data with no. of batches
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/workspace/NewSparkPullRequestBuilder/python/pyspark/mllib/tests.py",
>  line 1250, in test_training_and_prediction
> self.assertTrue(errors[1] - errors[-1] > 0.3)
> AssertionError: False is not true
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9934) Deprecate NIO ConnectionManager

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9934:
-
Component/s: Spark Core

> Deprecate NIO ConnectionManager
> ---
>
> Key: SPARK-9934
> URL: https://issues.apache.org/jira/browse/SPARK-9934
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> We should deprecate ConnectionManager in 1.5 before removing it in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9878) ReduceByKey + FullOuterJoin return 0 element if using an empty RDD

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9878:
-
Component/s: Spark Core

>  ReduceByKey + FullOuterJoin return 0 element if using an empty RDD
> ---
>
> Key: SPARK-9878
> URL: https://issues.apache.org/jira/browse/SPARK-9878
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: linux ubuntu 64b spark-hadoop
> launched with Local[2]
>Reporter: durand remi
>Priority: Minor
>
> code to reproduce:
> println("ok 
> :"+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])]).count)
> println("ko: 
> "+sc.parallelize(List((3,4),(4,5))).fullOuterJoin(sc.emptyRDD[(Int,Seq[Int])].reduceByKey((e1,
>  e2) => e1 ++ e2)).count)
> what i expect: 
> ok: 2
> ko: 2
> but what i have:
> ok: 2
> ko: 0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9637) Add interface for implementing scheduling algorithm for standalone deployment

2015-08-14 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-9637.
--
Resolution: Duplicate

> Add interface for implementing scheduling algorithm for standalone deployment
> -
>
> Key: SPARK-9637
> URL: https://issues.apache.org/jira/browse/SPARK-9637
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>
> We want to abstract the interface of scheduling algorithm for standalone 
> deployment mode. It can benefit for implementing different scheduling 
> algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9396) Spark yarn allocator does not call "removeContainerRequest" for allocated Container requests, resulting in bloated ask[] toYarn RM.

2015-08-14 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-9396.
--
Resolution: Won't Fix

This is fixed in 1.3+ by other changes; unless a decision is taken to make 
another 1.2.x release I think this is a WontFix.

> Spark yarn allocator does not call "removeContainerRequest" for allocated 
> Container requests, resulting in bloated ask[] toYarn RM.
> ---
>
> Key: SPARK-9396
> URL: https://issues.apache.org/jira/browse/SPARK-9396
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.2.1, 1.2.2
> Environment: Spark-1.2.1 on hadoop-yarn-2.4.0 cluster. All servers in 
> cluster running Linux version 2.6.32.
>Reporter: prakhar jauhari
>Priority: Minor
>
> Note : Attached logs contain logs that i added (spark yarn allocator side and 
> Yarn client side) for debugging purpose.
> ! My spark job is configured for 2 executors, on killing 1 executor the 
> ask is of 3 !!!
> On killing a executor  - resource request logs :
> *Killed container: ask for 3 containers, instead for 1***
> 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Will allocate 1 executor 
> containers, each with 2432 MB memory including 384 MB overhead
> 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: numExecutors: 1
> 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: host preferences is empty
> 15/07/15 10:49:01 INFO yarn.YarnAllocationHandler: Container request (host: 
> Any, priority: 1, capability: 
> 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : 
> allocate: this.ask = [{Priority: 1, Capability: , # 
> Containers: 3, Location: *, Relax Locality: true}]
> 15/07/15 10:49:01 INFO impl.AMRMClientImpl: prakhar : AMRMClientImpl : 
> allocate: allocateRequest = ask { priority{ priority: 1 } resource_name: "*" 
> capability { memory: 2432 virtual_cores: 4 } num_containers: 3 
> relax_locality: true } blacklist_request { } response_id: 354 progress: 0.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8118) Turn off noisy log output produced by Parquet 1.7.0

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696849#comment-14696849
 ] 

Apache Spark commented on SPARK-8118:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8196

> Turn off noisy log output produced by Parquet 1.7.0
> ---
>
> Key: SPARK-8118
> URL: https://issues.apache.org/jira/browse/SPARK-8118
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: 1.5.0
>
>
> Parquet 1.7.0 renames package name to "org.apache.parquet", need to adjust 
> {{ParquetRelation.enableLogForwarding}} accordingly to avoid noisy log output.
> A better approach than simply muting these log lines is to redirect Parquet 
> logs via SLF4J, so that we can handle them consistently. In general these 
> logs are very useful. Esp. when used to diagnosing Parquet memory issue and 
> filter push-down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9826) Cannot use custom classes in log4j.properties

2015-08-14 Thread Michel Lemay (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michel Lemay closed SPARK-9826.
---

Verified working with a local build of branch-1.4 (1.4.2-snapshot)

> Cannot use custom classes in log4j.properties
> -
>
> Key: SPARK-9826
> URL: https://issues.apache.org/jira/browse/SPARK-9826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: Michel Lemay
>Assignee: Michel Lemay
>Priority: Minor
>  Labels: regression
> Fix For: 1.4.2, 1.5.0
>
>
> log4j is initialized before spark class loader is set on the thread context.
> Therefore it cannot use classes embedded in fat-jars submitted to spark.
> While parsing arguments, spark calls methods on Utils class and triggers 
> ShutdownHookManager static initialization.  This then leads to log4j being 
> initialized before spark gets the chance to specify custom class 
> MutableURLClassLoader on the thread context.
> See detailed explanation here:
> http://apache-spark-user-list.1001560.n3.nabble.com/log4j-custom-appender-ClassNotFoundException-with-spark-1-4-1-tt24159.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696889#comment-14696889
 ] 

Frank Rosner commented on SPARK-9971:
-

Ok so shall we close it as a wontfix?

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9906:
---

Assignee: Manoj Kumar  (was: Apache Spark)

> User guide for LogisticRegressionSummary
> 
>
> Key: SPARK-9906
> URL: https://issues.apache.org/jira/browse/SPARK-9906
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Manoj Kumar
>
> SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
> statistics to ML pipeline logistic regression models. This feature is not 
> present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9906) User guide for LogisticRegressionSummary

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9906:
---

Assignee: Apache Spark  (was: Manoj Kumar)

> User guide for LogisticRegressionSummary
> 
>
> Key: SPARK-9906
> URL: https://issues.apache.org/jira/browse/SPARK-9906
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Apache Spark
>
> SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
> statistics to ML pipeline logistic regression models. This feature is not 
> present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9906) User guide for LogisticRegressionSummary

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696891#comment-14696891
 ] 

Apache Spark commented on SPARK-9906:
-

User 'MechCoder' has created a pull request for this issue:
https://github.com/apache/spark/pull/8197

> User guide for LogisticRegressionSummary
> 
>
> Key: SPARK-9906
> URL: https://issues.apache.org/jira/browse/SPARK-9906
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: Manoj Kumar
>
> SPARK-9112 introduces {{LogisticRegressionSummary}} to provide R-like model 
> statistics to ML pipeline logistic regression models. This feature is not 
> present in mllib and should be documented within {{ml-guide}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9971) MaxFunction not working correctly with columns containing Double.NaN

2015-08-14 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696894#comment-14696894
 ] 

Sean Owen commented on SPARK-9971:
--

I'd always prefer to leave it open at least ~1 week to see if anyone else has 
comments. If not, yes I personally would argue that the current behavior is the 
right one.

> MaxFunction not working correctly with columns containing Double.NaN
> 
>
> Key: SPARK-9971
> URL: https://issues.apache.org/jira/browse/SPARK-9971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Frank Rosner
>Priority: Minor
>
> h4. Problem Description
> When using the {{max}} function on a {{DoubleType}} column that contains 
> {{Double.NaN}} values, the returned maximum value will be {{Double.NaN}}. 
> This is because it compares all values with the running maximum. However, {{x 
> < Double.NaN}} will always lead false for all {{x: Double}}, so will {{x > 
> Double.NaN}}.
> h4. How to Reproduce
> {code}
> import org.apache.spark.sql.{SQLContext, Row}
> import org.apache.spark.sql.types._
> val sql = new SQLContext(sc)
> val rdd = sc.makeRDD(List(Row(Double.NaN), Row(-10d), Row(0d)))
> val dataFrame = sql.createDataFrame(rdd, StructType(List(
>   StructField("col", DoubleType, false)
> )))
> dataFrame.select(max("col")).first
> // returns org.apache.spark.sql.Row = [NaN]
> {code}
> h4. Solution
> The {{max}} and {{min}} functions should ignore NaN values, as they are not 
> numbers. If a column contains only NaN values, then the maximum and minimum 
> is not defined.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9974:
---

Assignee: Apache Spark

> SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the 
> assembly jar
> 
>
> Key: SPARK-9974
> URL: https://issues.apache.org/jira/browse/SPARK-9974
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Blocker
>
> One of the consequence of this issue is that Parquet tables created in Hive 
> are not accessible from Spark SQL built with SBT. Maven build is OK.
> Git commit: 
> [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]
> Build with SBT and check the assembly jar for 
> {{parquet.hadoop.api.WriteSupport}}:
> {noformat}
> $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
> clean assembly/assembly
> ...
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
> fgrep "parquet/hadoop/api"
> org/apache/parquet/hadoop/api/
> org/apache/parquet/hadoop/api/DelegatingReadSupport.class
> org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
> org/apache/parquet/hadoop/api/InitContext.class
> org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
> org/apache/parquet/hadoop/api/ReadSupport.class
> org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport.class
> {noformat}
> Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
> {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
> namespace.
> Build with Maven and check the assembly jar for 
> {{parquet.hadoop.api.WriteSupport}}:
> {noformat}
> $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
> -DskipTests clean package
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
> fgrep "parquet/hadoop/api"
> org/apache/parquet/hadoop/api/
> org/apache/parquet/hadoop/api/DelegatingReadSupport.class
> org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
> org/apache/parquet/hadoop/api/InitContext.class
> org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
> org/apache/parquet/hadoop/api/ReadSupport.class
> org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport.class
> parquet/hadoop/api/
> parquet/hadoop/api/DelegatingReadSupport.class
> parquet/hadoop/api/DelegatingWriteSupport.class
> parquet/hadoop/api/InitContext.class
> parquet/hadoop/api/ReadSupport$ReadContext.class
> parquet/hadoop/api/ReadSupport.class
> parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> parquet/hadoop/api/WriteSupport$WriteContext.class
> parquet/hadoop/api/WriteSupport.class
> {noformat}
> Expected classes are packaged properly.
> To reproduce the Parquet table access issue, first create a Parquet table 
> with Hive (say 0.13.1):
> {noformat}
> hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
> {noformat}
> Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
> {noformat}
> scala> sqlContext.table("parquet_test").show()
> 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=parquet_test
> 15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table 
> : db=default tbl=parquet_test
> java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientW

[jira] [Commented] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696919#comment-14696919
 ] 

Apache Spark commented on SPARK-9974:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/8198

> SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the 
> assembly jar
> 
>
> Key: SPARK-9974
> URL: https://issues.apache.org/jira/browse/SPARK-9974
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> One of the consequence of this issue is that Parquet tables created in Hive 
> are not accessible from Spark SQL built with SBT. Maven build is OK.
> Git commit: 
> [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]
> Build with SBT and check the assembly jar for 
> {{parquet.hadoop.api.WriteSupport}}:
> {noformat}
> $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
> clean assembly/assembly
> ...
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
> fgrep "parquet/hadoop/api"
> org/apache/parquet/hadoop/api/
> org/apache/parquet/hadoop/api/DelegatingReadSupport.class
> org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
> org/apache/parquet/hadoop/api/InitContext.class
> org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
> org/apache/parquet/hadoop/api/ReadSupport.class
> org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport.class
> {noformat}
> Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
> {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
> namespace.
> Build with Maven and check the assembly jar for 
> {{parquet.hadoop.api.WriteSupport}}:
> {noformat}
> $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
> -DskipTests clean package
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
> fgrep "parquet/hadoop/api"
> org/apache/parquet/hadoop/api/
> org/apache/parquet/hadoop/api/DelegatingReadSupport.class
> org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
> org/apache/parquet/hadoop/api/InitContext.class
> org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
> org/apache/parquet/hadoop/api/ReadSupport.class
> org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport.class
> parquet/hadoop/api/
> parquet/hadoop/api/DelegatingReadSupport.class
> parquet/hadoop/api/DelegatingWriteSupport.class
> parquet/hadoop/api/InitContext.class
> parquet/hadoop/api/ReadSupport$ReadContext.class
> parquet/hadoop/api/ReadSupport.class
> parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> parquet/hadoop/api/WriteSupport$WriteContext.class
> parquet/hadoop/api/WriteSupport.class
> {noformat}
> Expected classes are packaged properly.
> To reproduce the Parquet table access issue, first create a Parquet table 
> with Hive (say 0.13.1):
> {noformat}
> hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
> {noformat}
> Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
> {noformat}
> scala> sqlContext.table("parquet_test").show()
> 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=parquet_test
> 15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table 
> : db=default tbl=parquet_test
> java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala

[jira] [Assigned] (SPARK-9974) SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the assembly jar

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9974:
---

Assignee: (was: Apache Spark)

> SBT build: com.twitter:parquet-hadoop-bundle:1.6.0 is not packaged into the 
> assembly jar
> 
>
> Key: SPARK-9974
> URL: https://issues.apache.org/jira/browse/SPARK-9974
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 1.5.0
>Reporter: Cheng Lian
>Priority: Blocker
>
> One of the consequence of this issue is that Parquet tables created in Hive 
> are not accessible from Spark SQL built with SBT. Maven build is OK.
> Git commit: 
> [69930310115501f0de094fe6f5c6c60dade342bd|https://github.com/apache/spark/commit/69930310115501f0de094fe6f5c6c60dade342bd]
> Build with SBT and check the assembly jar for 
> {{parquet.hadoop.api.WriteSupport}}:
> {noformat}
> $ ./build/sbt -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
> clean assembly/assembly
> ...
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
> fgrep "parquet/hadoop/api"
> org/apache/parquet/hadoop/api/
> org/apache/parquet/hadoop/api/DelegatingReadSupport.class
> org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
> org/apache/parquet/hadoop/api/InitContext.class
> org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
> org/apache/parquet/hadoop/api/ReadSupport.class
> org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport.class
> {noformat}
> Only classes of {{org.apache.parquet:parquet-mr:1.7.0}}. Note that classes in 
> {{com.twitter:parquet-hadoop-bundle:1.6.0}} are not under the {{org.apache}} 
> namespace.
> Build with Maven and check the assembly jar for 
> {{parquet.hadoop.api.WriteSupport}}:
> {noformat}
> $ ./build/mvn -Phive -Phive-thriftserver -Phadoop-1 -Dhadoop.version=1.2.1 
> -DskipTests clean package
> $ jar tf 
> assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar | 
> fgrep "parquet/hadoop/api"
> org/apache/parquet/hadoop/api/
> org/apache/parquet/hadoop/api/DelegatingReadSupport.class
> org/apache/parquet/hadoop/api/DelegatingWriteSupport.class
> org/apache/parquet/hadoop/api/InitContext.class
> org/apache/parquet/hadoop/api/ReadSupport$ReadContext.class
> org/apache/parquet/hadoop/api/ReadSupport.class
> org/apache/parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport$WriteContext.class
> org/apache/parquet/hadoop/api/WriteSupport.class
> parquet/hadoop/api/
> parquet/hadoop/api/DelegatingReadSupport.class
> parquet/hadoop/api/DelegatingWriteSupport.class
> parquet/hadoop/api/InitContext.class
> parquet/hadoop/api/ReadSupport$ReadContext.class
> parquet/hadoop/api/ReadSupport.class
> parquet/hadoop/api/WriteSupport$FinalizedWriteContext.class
> parquet/hadoop/api/WriteSupport$WriteContext.class
> parquet/hadoop/api/WriteSupport.class
> {noformat}
> Expected classes are packaged properly.
> To reproduce the Parquet table access issue, first create a Parquet table 
> with Hive (say 0.13.1):
> {noformat}
> hive> CREATE TABLE parquet_test STORED AS PARQUET AS SELECT 1;
> {noformat}
> Build Spark assembly jar with the SBT command above, start {{spark-shell}}:
> {noformat}
> scala> sqlContext.table("parquet_test").show()
> 15/08/14 17:52:50 INFO HiveMetaStore: 0: get_table : db=default 
> tbl=parquet_test
> 15/08/14 17:52:50 INFO audit: ugi=lian  ip=unknown-ip-addr  cmd=get_table 
> : db=default tbl=parquet_test
> java.lang.NoClassDefFoundError: parquet/hadoop/api/WriteSupport
> at java.lang.Class.forName0(Native Method)
> at java.lang.Class.forName(Class.java:270)
> at 
> org.apache.hadoop.hive.ql.metadata.Table.getOutputFormatClass(Table.java:328)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:320)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1$$anonfun$2.apply(ClientWrapper.scala:303)
> at scala.Option.map(Option.scala:145)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:303)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$getTableOption$1.apply(ClientWrapper.scala:298)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper$$anonfun$withHiveState$1.apply(ClientWrapper.scala:256)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.retryLocked(ClientWrapper.scala:211)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.withHiveState(ClientWrapper.scala:248)
>  

[jira] [Updated] (SPARK-6624) Convert filters into CNF for data sources

2015-08-14 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-6624:
--
Assignee: Yijie Shen

> Convert filters into CNF for data sources
> -
>
> Key: SPARK-6624
> URL: https://issues.apache.org/jira/browse/SPARK-6624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>
> We should turn filters into conjunctive normal form (CNF) before we pass them 
> to data sources. Otherwise, filters are not very useful if there is a single 
> filter with a bunch of ORs.
> Note that we already try to do some of these in BooleanSimplification, but I 
> think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696928#comment-14696928
 ] 

Apache Spark commented on SPARK-9966:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8199

> Handle a couple of corner cases in the PID rate estimator
> -
>
> Key: SPARK-9966
> URL: https://issues.apache.org/jira/browse/SPARK-9966
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> 1. The rate estimator should not estimate any rate when there are no records 
> in the batch, as there is no data to estimate the rate. In the current state, 
> it estimates and set the rate to zero. That is incorrect.
> 2. The rate estimator should not never set the rate to zero under any 
> circumstances. Otherwise the system will stop receiving data, and stop 
> generating useful estimates (see reason 1). So the fix is to define a 
> parameters that sets a lower bound on the estimated rate, so that the system 
> always receives some data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9966:
---

Assignee: Tathagata Das  (was: Apache Spark)

> Handle a couple of corner cases in the PID rate estimator
> -
>
> Key: SPARK-9966
> URL: https://issues.apache.org/jira/browse/SPARK-9966
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> 1. The rate estimator should not estimate any rate when there are no records 
> in the batch, as there is no data to estimate the rate. In the current state, 
> it estimates and set the rate to zero. That is incorrect.
> 2. The rate estimator should not never set the rate to zero under any 
> circumstances. Otherwise the system will stop receiving data, and stop 
> generating useful estimates (see reason 1). So the fix is to define a 
> parameters that sets a lower bound on the estimated rate, so that the system 
> always receives some data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9966) Handle a couple of corner cases in the PID rate estimator

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9966:
---

Assignee: Apache Spark  (was: Tathagata Das)

> Handle a couple of corner cases in the PID rate estimator
> -
>
> Key: SPARK-9966
> URL: https://issues.apache.org/jira/browse/SPARK-9966
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Blocker
>
> 1. The rate estimator should not estimate any rate when there are no records 
> in the batch, as there is no data to estimate the rate. In the current state, 
> it estimates and set the rate to zero. That is incorrect.
> 2. The rate estimator should not never set the rate to zero under any 
> circumstances. Otherwise the system will stop receiving data, and stop 
> generating useful estimates (see reason 1). So the fix is to define a 
> parameters that sets a lower bound on the estimated rate, so that the system 
> always receives some data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696936#comment-14696936
 ] 

Apache Spark commented on SPARK-6624:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8200

> Convert filters into CNF for data sources
> -
>
> Key: SPARK-6624
> URL: https://issues.apache.org/jira/browse/SPARK-6624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>
> We should turn filters into conjunctive normal form (CNF) before we pass them 
> to data sources. Otherwise, filters are not very useful if there is a single 
> filter with a bunch of ORs.
> Note that we already try to do some of these in BooleanSimplification, but I 
> think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8887) Explicitly define which data types can be used as dynamic partition columns

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696937#comment-14696937
 ] 

Apache Spark commented on SPARK-8887:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8201

> Explicitly define which data types can be used as dynamic partition columns
> ---
>
> Key: SPARK-8887
> URL: https://issues.apache.org/jira/browse/SPARK-8887
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> {{InsertIntoHadoopFsRelation}} implements Hive compatible dynamic 
> partitioning insertion, which uses {{String.valueOf}} to write encode 
> partition column values into dynamic partition directories. This actually 
> limits the data types that can be used in partition column. For example, 
> string representation of {{StructType}} values is not well defined. However, 
> this limitation is not explicitly enforced.
> There are several things we can improve:
> # Enforce dynamic column data type requirements by adding analysis rules and 
> throws {{AnalysisException}} when violation occurs.
> # Abstract away string representation of various data types, so that we don't 
> need to convert internal representation types (e.g. {{UTF8String}}) to 
> external types (e.g. {{String}}). A set of Hive compatible implementations 
> should be provided to ensure compatibility with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696938#comment-14696938
 ] 

Apache Spark commented on SPARK-8670:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8202

> Nested columns can't be referenced (but they can be selected)
> -
>
> Key: SPARK-8670
> URL: https://issues.apache.org/jira/browse/SPARK-8670
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Nicholas Chammas
>Assignee: Wenchen Fan
>Priority: Blocker
>
> This is strange and looks like a regression from 1.3.
> {code}
> import json
> daterz = [
>   {
> 'name': 'Nick',
> 'stats': {
>   'age': 28
> }
>   },
>   {
> 'name': 'George',
> 'stats': {
>   'age': 31
> }
>   }
> ]
> df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
> df.select('stats.age').show()
> df['stats.age']  # 1.4 fails on this line
> {code}
> On 1.3 this works and yields:
> {code}
> age
> 28 
> 31 
> Out[1]: Column
> {code}
> On 1.4, however, this gives an error on the last line:
> {code}
> +---+
> |age|
> +---+
> | 28|
> | 31|
> +---+
> ---
> IndexErrorTraceback (most recent call last)
>  in ()
>  19 
>  20 df.select('stats.age').show()
> ---> 21 df['stats.age']
> /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
> 678 if isinstance(item, basestring):
> 679 if item not in self.columns:
> --> 680 raise IndexError("no such column: %s" % item)
> 681 jc = self._jdf.apply(item)
> 682 return Column(jc)
> IndexError: no such column: stats.age
> {code}
> This means, among other things, that you can't join DataFrames on nested 
> columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9970) SQLContext.createDataFrame failed to properly determine column names

2015-08-14 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-9970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-9970:
--
Priority: Major  (was: Minor)

> SQLContext.createDataFrame failed to properly determine column names
> 
>
> Key: SPARK-9970
> URL: https://issues.apache.org/jira/browse/SPARK-9970
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.0
>Reporter: Maciej Bryński
>
> Hi,
> I'm trying to do "nested join" of tables.
> After first join everything is ok, but second join made some of the column 
> names lost.
> My code is following:
> {code:java}
> def joinTable(tableLeft, tableRight, columnLeft, columnRight, columnNested, 
> joinType = "left_outer"):
> tmpTable = sqlCtx.createDataFrame(tableRight.rdd.groupBy(lambda r: 
> r.asDict()[columnRight]))
> tmpTable = tmpTable.select(tmpTable._1.alias("joinColumn"), 
> tmpTable._2.data.alias(columnNested))
> return tableLeft.join(tmpTable, tableLeft[columnLeft] == 
> tmpTable["joinColumn"], joinType).drop("joinColumn")
> user = sqlContext.read.json(path + "user.json")
> user.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- name: string (nullable = true)
> order =  sqlContext.read.json(path + "order.json")
> order.printSchema();
> root
>  |-- id: long (nullable = true)
>  |-- price: double (nullable = true)
>  |-- userid: long (nullable = true)
> lines =  sqlContext.read.json(path + "lines.json")
> lines.printSchema();
> root
>  |-- id: long (nullable = true)
>  |-- orderid: long (nullable = true)
>  |-- product: string (nullable = true)
> {code}
> And joining code:
> Please look for the result of 2nd join. There are columns _1,_2,_3. Should be 
> 'id', 'orderid', 'product'
> {code:java}
> orders = joinTable(order, lines, "id", "orderid", "lines")
> orders.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- price: double (nullable = true)
>  |-- userid: long (nullable = true)
>  |-- lines: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: long (nullable = true)
>  |||-- orderid: long (nullable = true)
>  |||-- product: string (nullable = true)
> clients = joinTable(user, orders, "id", "userid", "orders")
> clients.printSchema()
> root
>  |-- id: long (nullable = true)
>  |-- name: string (nullable = true)
>  |-- orders: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- id: long (nullable = true)
>  |||-- price: double (nullable = true)
>  |||-- userid: long (nullable = true)
>  |||-- lines: array (nullable = true)
>  ||||-- element: struct (containsNull = true)
>  |||||-- _1: long (nullable = true)
>  |||||-- _2: long (nullable = true)
>  |||||-- _3: string (nullable = true)
> {code}
> I tried to check if groupBy isn't the cause of the problem but it looks right
> {code:java}
> grouped =  orders.rdd.groupBy(lambda r: r.userid)
> print grouped.map(lambda x: list(x[1])).collect()
> [[Row(id=1, price=202.3, userid=1, lines=[Row(id=1, orderid=1, 
> product=u'XXX'), Row(id=2, orderid=1, product=u'YYY')]), Row(id=2, 
> price=343.99, userid=1, lines=[Row(id=3, orderid=2, product=u'XXX')])], 
> [Row(id=3, price=399.99, userid=2, lines=[Row(id=4, orderid=3, 
> product=u'XXX')])]]
> {code}
> So I assume that the problem is with createDataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9976) create function do not work

2015-08-14 Thread cen yuhai (JIRA)
cen yuhai created SPARK-9976:


 Summary: create function do not work
 Key: SPARK-9976
 URL: https://issues.apache.org/jira/browse/SPARK-9976
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1, 1.4.0, 1.5.0
 Environment: spark 1.4.1 yarn 2.2.0
Reporter: cen yuhai
 Fix For: 1.4.2, 1.5.0


I use beeline to connect to ThriftServer, but add jar can not work, so I use 
create function , see the link below.
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html

I do as blew:
create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 

It returns Ok, and I connect to the metastore, I see records in table FUNCS.

select gdecodeorder(t1)  from tableX  limit 1;
It returns error 'Couldn't find function default.gdecodeorder'



15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
java.lang.RuntimeException: Couldn't find function default.gdecodeorder
15/08/14 15:04:47 ERROR RetryingHMSHandler: 
MetaException(message:NoSuchObjectException(message:Function 
default.t_gdecodeorder does not exist))
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at com.sun.proxy.$Proxy21.get_function(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
at 
org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
at 
org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.Arra

[jira] [Commented] (SPARK-9955) TPCDS Q8 failed in 1.5

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696975#comment-14696975
 ] 

Apache Spark commented on SPARK-9955:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/8203

> TPCDS Q8 failed in 1.5 
> ---
>
> Key: SPARK-9955
> URL: https://issues.apache.org/jira/browse/SPARK-9955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>Priority: Critical
>
> {code}
> -- start query 1 in stream 0 using template query8.tpl
> select
>   s_store_name,
>   sum(ss_net_profit)
> from
>   store_sales
>   join store on (store_sales.ss_store_sk = store.s_store_sk)
>   -- join date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
>   join
>   (select
> a.ca_zip
>   from
> (select
>   substr(ca_zip, 1, 5) ca_zip,
>   count( *) cnt
> from
>   customer_address
>   join customer on (customer_address.ca_address_sk = 
> customer.c_current_addr_sk)
> where
>   c_preferred_cust_flag = 'Y'
> group by
>   ca_zip
> having
>   count( *) > 10
> ) a
> left semi join
> (select
>   substr(ca_zip, 1, 5) ca_zip
> from
>   customer_address
> where
>   substr(ca_zip, 1, 5) in ('89436', '30868', '65085', '22977', '83927', 
> '77557', '58429', '40697', '80614', '10502', '32779',
>   '91137', '61265', '98294', '17921', '18427', '21203', '59362', '87291', 
> '84093', '21505', '17184', '10866', '67898', '25797',
>   '28055', '18377', '80332', '74535', '21757', '29742', '90885', '29898', 
> '17819', '40811', '25990', '47513', '89531', '91068',
>   '10391', '18846', '99223', '82637', '41368', '83658', '86199', '81625', 
> '26696', '89338', '88425', '32200', '81427', '19053',
>   '77471', '36610', '99823', '43276', '41249', '48584', '83550', '82276', 
> '18842', '78890', '14090', '38123', '40936', '34425',
>   '19850', '43286', '80072', '79188', '54191', '11395', '50497', '84861', 
> '90733', '21068', '57666', '37119', '25004', '57835',
>   '70067', '62878', '95806', '19303', '18840', '19124', '29785', '16737', 
> '16022', '49613', '89977', '68310', '60069', '98360',
>   '48649', '39050', '41793', '25002', '27413', '39736', '47208', '16515', 
> '94808', '57648', '15009', '80015', '42961', '63982',
>   '21744', '71853', '81087', '67468', '34175', '64008', '20261', '11201', 
> '51799', '48043', '45645', '61163', '48375', '36447',
>   '57042', '21218', '41100', '89951', '22745', '35851', '83326', '61125', 
> '78298', '80752', '49858', '52940', '96976', '63792',
>   '11376', '53582', '18717', '90226', '50530', '94203', '99447', '27670', 
> '96577', '57856', '56372', '16165', '23427', '54561',
>   '28806', '44439', '22926', '30123', '61451', '92397', '56979', '92309', 
> '70873', '13355', '21801', '46346', '37562', '56458',
>   '28286', '47306', '99555', '69399', '26234', '47546', '49661', '88601', 
> '35943', '39936', '25632', '24611', '44166', '56648',
>   '30379', '59785', '0', '14329', '93815', '52226', '71381', '13842', 
> '25612', '63294', '14664', '21077', '82626', '18799',
>   '60915', '81020', '56447', '76619', '11433', '13414', '42548', '92713', 
> '70467', '30884', '47484', '16072', '38936', '13036',
>   '88376', '45539', '35901', '19506', '65690', '73957', '71850', '49231', 
> '14276', '20005', '18384', '76615', '11635', '38177',
>   '55607', '41369', '95447', '58581', '58149', '91946', '33790', '76232', 
> '75692', '95464', '22246', '51061', '56692', '53121',
>   '77209', '15482', '10688', '14868', '45907', '73520', '72666', '25734', 
> '17959', '24677', '66446', '94627', '53535', '15560',
>   '41967', '69297', '11929', '59403', '33283', '52232', '57350', '43933', 
> '40921', '36635', '10827', '71286', '19736', '80619',
>   '25251', '95042', '15526', '36496', '55854', '49124', '81980', '35375', 
> '49157', '63512', '28944', '14946', '36503', '54010',
>   '18767', '23969', '43905', '66979', '33113', '21286', '58471', '59080', 
> '13395', '79144', '70373', '67031', '38360', '26705',
>   '50906', '52406', '26066', '73146', '15884', '31897', '30045', '61068', 
> '45550', '92454', '13376', '14354', '19770', '22928',
>   '97790', '50723', '46081', '30202', '14410', '20223', '88500', '67298', 
> '13261', '14172', '81410', '93578', '83583', '46047',
>   '94167', '82564', '21156', '15799', '86709', '37931', '74703', '83103', 
> '23054', '70470', '72008', '49247', '91911', '69998',
>   '20961', '70070', '63197', '54853', '88191', '91830', '49521', '19454', 
> '81450', '89091', '62378', '25683', '61869', '51744',
>   '36580', '85778', '36871', '48121', '28810', '83712', '45486', '67393', 
> '26935', '42393', '20132', '

[jira] [Assigned] (SPARK-9955) TPCDS Q8 failed in 1.5

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9955:
---

Assignee: Apache Spark  (was: Wenchen Fan)

> TPCDS Q8 failed in 1.5 
> ---
>
> Key: SPARK-9955
> URL: https://issues.apache.org/jira/browse/SPARK-9955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Critical
>
> {code}
> -- start query 1 in stream 0 using template query8.tpl
> select
>   s_store_name,
>   sum(ss_net_profit)
> from
>   store_sales
>   join store on (store_sales.ss_store_sk = store.s_store_sk)
>   -- join date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
>   join
>   (select
> a.ca_zip
>   from
> (select
>   substr(ca_zip, 1, 5) ca_zip,
>   count( *) cnt
> from
>   customer_address
>   join customer on (customer_address.ca_address_sk = 
> customer.c_current_addr_sk)
> where
>   c_preferred_cust_flag = 'Y'
> group by
>   ca_zip
> having
>   count( *) > 10
> ) a
> left semi join
> (select
>   substr(ca_zip, 1, 5) ca_zip
> from
>   customer_address
> where
>   substr(ca_zip, 1, 5) in ('89436', '30868', '65085', '22977', '83927', 
> '77557', '58429', '40697', '80614', '10502', '32779',
>   '91137', '61265', '98294', '17921', '18427', '21203', '59362', '87291', 
> '84093', '21505', '17184', '10866', '67898', '25797',
>   '28055', '18377', '80332', '74535', '21757', '29742', '90885', '29898', 
> '17819', '40811', '25990', '47513', '89531', '91068',
>   '10391', '18846', '99223', '82637', '41368', '83658', '86199', '81625', 
> '26696', '89338', '88425', '32200', '81427', '19053',
>   '77471', '36610', '99823', '43276', '41249', '48584', '83550', '82276', 
> '18842', '78890', '14090', '38123', '40936', '34425',
>   '19850', '43286', '80072', '79188', '54191', '11395', '50497', '84861', 
> '90733', '21068', '57666', '37119', '25004', '57835',
>   '70067', '62878', '95806', '19303', '18840', '19124', '29785', '16737', 
> '16022', '49613', '89977', '68310', '60069', '98360',
>   '48649', '39050', '41793', '25002', '27413', '39736', '47208', '16515', 
> '94808', '57648', '15009', '80015', '42961', '63982',
>   '21744', '71853', '81087', '67468', '34175', '64008', '20261', '11201', 
> '51799', '48043', '45645', '61163', '48375', '36447',
>   '57042', '21218', '41100', '89951', '22745', '35851', '83326', '61125', 
> '78298', '80752', '49858', '52940', '96976', '63792',
>   '11376', '53582', '18717', '90226', '50530', '94203', '99447', '27670', 
> '96577', '57856', '56372', '16165', '23427', '54561',
>   '28806', '44439', '22926', '30123', '61451', '92397', '56979', '92309', 
> '70873', '13355', '21801', '46346', '37562', '56458',
>   '28286', '47306', '99555', '69399', '26234', '47546', '49661', '88601', 
> '35943', '39936', '25632', '24611', '44166', '56648',
>   '30379', '59785', '0', '14329', '93815', '52226', '71381', '13842', 
> '25612', '63294', '14664', '21077', '82626', '18799',
>   '60915', '81020', '56447', '76619', '11433', '13414', '42548', '92713', 
> '70467', '30884', '47484', '16072', '38936', '13036',
>   '88376', '45539', '35901', '19506', '65690', '73957', '71850', '49231', 
> '14276', '20005', '18384', '76615', '11635', '38177',
>   '55607', '41369', '95447', '58581', '58149', '91946', '33790', '76232', 
> '75692', '95464', '22246', '51061', '56692', '53121',
>   '77209', '15482', '10688', '14868', '45907', '73520', '72666', '25734', 
> '17959', '24677', '66446', '94627', '53535', '15560',
>   '41967', '69297', '11929', '59403', '33283', '52232', '57350', '43933', 
> '40921', '36635', '10827', '71286', '19736', '80619',
>   '25251', '95042', '15526', '36496', '55854', '49124', '81980', '35375', 
> '49157', '63512', '28944', '14946', '36503', '54010',
>   '18767', '23969', '43905', '66979', '33113', '21286', '58471', '59080', 
> '13395', '79144', '70373', '67031', '38360', '26705',
>   '50906', '52406', '26066', '73146', '15884', '31897', '30045', '61068', 
> '45550', '92454', '13376', '14354', '19770', '22928',
>   '97790', '50723', '46081', '30202', '14410', '20223', '88500', '67298', 
> '13261', '14172', '81410', '93578', '83583', '46047',
>   '94167', '82564', '21156', '15799', '86709', '37931', '74703', '83103', 
> '23054', '70470', '72008', '49247', '91911', '69998',
>   '20961', '70070', '63197', '54853', '88191', '91830', '49521', '19454', 
> '81450', '89091', '62378', '25683', '61869', '51744',
>   '36580', '85778', '36871', '48121', '28810', '83712', '45486', '67393', 
> '26935', '42393', '20132', '55349', '86057', '21309',
>   '80218', '10094', '11357', '48819', '39734', '40758', '30432', '21204',

[jira] [Assigned] (SPARK-9955) TPCDS Q8 failed in 1.5

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9955:
---

Assignee: Wenchen Fan  (was: Apache Spark)

> TPCDS Q8 failed in 1.5 
> ---
>
> Key: SPARK-9955
> URL: https://issues.apache.org/jira/browse/SPARK-9955
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Wenchen Fan
>Priority: Critical
>
> {code}
> -- start query 1 in stream 0 using template query8.tpl
> select
>   s_store_name,
>   sum(ss_net_profit)
> from
>   store_sales
>   join store on (store_sales.ss_store_sk = store.s_store_sk)
>   -- join date_dim on (store_sales.ss_sold_date_sk = date_dim.d_date_sk)
>   join
>   (select
> a.ca_zip
>   from
> (select
>   substr(ca_zip, 1, 5) ca_zip,
>   count( *) cnt
> from
>   customer_address
>   join customer on (customer_address.ca_address_sk = 
> customer.c_current_addr_sk)
> where
>   c_preferred_cust_flag = 'Y'
> group by
>   ca_zip
> having
>   count( *) > 10
> ) a
> left semi join
> (select
>   substr(ca_zip, 1, 5) ca_zip
> from
>   customer_address
> where
>   substr(ca_zip, 1, 5) in ('89436', '30868', '65085', '22977', '83927', 
> '77557', '58429', '40697', '80614', '10502', '32779',
>   '91137', '61265', '98294', '17921', '18427', '21203', '59362', '87291', 
> '84093', '21505', '17184', '10866', '67898', '25797',
>   '28055', '18377', '80332', '74535', '21757', '29742', '90885', '29898', 
> '17819', '40811', '25990', '47513', '89531', '91068',
>   '10391', '18846', '99223', '82637', '41368', '83658', '86199', '81625', 
> '26696', '89338', '88425', '32200', '81427', '19053',
>   '77471', '36610', '99823', '43276', '41249', '48584', '83550', '82276', 
> '18842', '78890', '14090', '38123', '40936', '34425',
>   '19850', '43286', '80072', '79188', '54191', '11395', '50497', '84861', 
> '90733', '21068', '57666', '37119', '25004', '57835',
>   '70067', '62878', '95806', '19303', '18840', '19124', '29785', '16737', 
> '16022', '49613', '89977', '68310', '60069', '98360',
>   '48649', '39050', '41793', '25002', '27413', '39736', '47208', '16515', 
> '94808', '57648', '15009', '80015', '42961', '63982',
>   '21744', '71853', '81087', '67468', '34175', '64008', '20261', '11201', 
> '51799', '48043', '45645', '61163', '48375', '36447',
>   '57042', '21218', '41100', '89951', '22745', '35851', '83326', '61125', 
> '78298', '80752', '49858', '52940', '96976', '63792',
>   '11376', '53582', '18717', '90226', '50530', '94203', '99447', '27670', 
> '96577', '57856', '56372', '16165', '23427', '54561',
>   '28806', '44439', '22926', '30123', '61451', '92397', '56979', '92309', 
> '70873', '13355', '21801', '46346', '37562', '56458',
>   '28286', '47306', '99555', '69399', '26234', '47546', '49661', '88601', 
> '35943', '39936', '25632', '24611', '44166', '56648',
>   '30379', '59785', '0', '14329', '93815', '52226', '71381', '13842', 
> '25612', '63294', '14664', '21077', '82626', '18799',
>   '60915', '81020', '56447', '76619', '11433', '13414', '42548', '92713', 
> '70467', '30884', '47484', '16072', '38936', '13036',
>   '88376', '45539', '35901', '19506', '65690', '73957', '71850', '49231', 
> '14276', '20005', '18384', '76615', '11635', '38177',
>   '55607', '41369', '95447', '58581', '58149', '91946', '33790', '76232', 
> '75692', '95464', '22246', '51061', '56692', '53121',
>   '77209', '15482', '10688', '14868', '45907', '73520', '72666', '25734', 
> '17959', '24677', '66446', '94627', '53535', '15560',
>   '41967', '69297', '11929', '59403', '33283', '52232', '57350', '43933', 
> '40921', '36635', '10827', '71286', '19736', '80619',
>   '25251', '95042', '15526', '36496', '55854', '49124', '81980', '35375', 
> '49157', '63512', '28944', '14946', '36503', '54010',
>   '18767', '23969', '43905', '66979', '33113', '21286', '58471', '59080', 
> '13395', '79144', '70373', '67031', '38360', '26705',
>   '50906', '52406', '26066', '73146', '15884', '31897', '30045', '61068', 
> '45550', '92454', '13376', '14354', '19770', '22928',
>   '97790', '50723', '46081', '30202', '14410', '20223', '88500', '67298', 
> '13261', '14172', '81410', '93578', '83583', '46047',
>   '94167', '82564', '21156', '15799', '86709', '37931', '74703', '83103', 
> '23054', '70470', '72008', '49247', '91911', '69998',
>   '20961', '70070', '63197', '54853', '88191', '91830', '49521', '19454', 
> '81450', '89091', '62378', '25683', '61869', '51744',
>   '36580', '85778', '36871', '48121', '28810', '83712', '45486', '67393', 
> '26935', '42393', '20132', '55349', '86057', '21309',
>   '80218', '10094', '11357', '48819', '39734', '40758', '30432', '21204', 

[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9968:
---

Assignee: Apache Spark  (was: Tathagata Das)

> BlockGenerator lock structure can cause lock starvation of the block updating 
> thread
> 
>
> Key: SPARK-9968
> URL: https://issues.apache.org/jira/browse/SPARK-9968
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>
> When the rate limiter is actually limiting the rate at which data is inserted 
> into the buffer, the synchronized block of BlockGenerator.addData stays 
> blocked for long time. This causes the thread switching the buffer and 
> generating blocks (synchronized with addData) to starve and not generate 
> blocks for seconds. The correct solution is to not block on the rate limiter 
> within the synchronized block for adding data to the buffer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9968:
---

Assignee: Tathagata Das  (was: Apache Spark)

> BlockGenerator lock structure can cause lock starvation of the block updating 
> thread
> 
>
> Key: SPARK-9968
> URL: https://issues.apache.org/jira/browse/SPARK-9968
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When the rate limiter is actually limiting the rate at which data is inserted 
> into the buffer, the synchronized block of BlockGenerator.addData stays 
> blocked for long time. This causes the thread switching the buffer and 
> generating blocks (synchronized with addData) to starve and not generate 
> blocks for seconds. The correct solution is to not block on the rate limiter 
> within the synchronized block for adding data to the buffer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9968) BlockGenerator lock structure can cause lock starvation of the block updating thread

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696987#comment-14696987
 ] 

Apache Spark commented on SPARK-9968:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/8204

> BlockGenerator lock structure can cause lock starvation of the block updating 
> thread
> 
>
> Key: SPARK-9968
> URL: https://issues.apache.org/jira/browse/SPARK-9968
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>
> When the rate limiter is actually limiting the rate at which data is inserted 
> into the buffer, the synchronized block of BlockGenerator.addData stays 
> blocked for long time. This causes the thread switching the buffer and 
> generating blocks (synchronized with addData) to starve and not generate 
> blocks for seconds. The correct solution is to not block on the rate limiter 
> within the synchronized block for adding data to the buffer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6624) Convert filters into CNF for data sources

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14696988#comment-14696988
 ] 

Apache Spark commented on SPARK-6624:
-

User 'yjshen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8193

> Convert filters into CNF for data sources
> -
>
> Key: SPARK-6624
> URL: https://issues.apache.org/jira/browse/SPARK-6624
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Yijie Shen
>
> We should turn filters into conjunctive normal form (CNF) before we pass them 
> to data sources. Otherwise, filters are not very useful if there is a single 
> filter with a bunch of ORs.
> Note that we already try to do some of these in BooleanSimplification, but I 
> think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9734) java.lang.IllegalArgumentException: Don't know how to save StructField(sal,DecimalType(7,2),true) to JDBC

2015-08-14 Thread Rama Mullapudi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697003#comment-14697003
 ] 

Rama Mullapudi commented on SPARK-9734:
---

Current 1.5 still gives error when creating table using dataframe write.jdbc

Create statement issued by spark looks as below
CREATE TABLE foo (TKT_GID DECIMAL(10},0}) NOT NULL)

There are closing braces } in the decimal format which causing database to 
throw error.

I looked into the code on github and found in jdbcutils class schemaString 
function has the extra closing braces } which is causing the issue.


  /**
   * Compute the schema string for this RDD.
   */
  def schemaString(df: DataFrame, url: String): String = {
 .
case BooleanType => "BIT(1)"
case StringType => "TEXT"
case BinaryType => "BLOB"
case TimestampType => "TIMESTAMP"
case DateType => "DATE"
case t: DecimalType => s"DECIMAL(${t.precision}},${t.scale}})"
case _ => throw new IllegalArgumentException(s"Don't know how to 
save $field to JDBC")
  })

  }

> java.lang.IllegalArgumentException: Don't know how to save 
> StructField(sal,DecimalType(7,2),true) to JDBC
> -
>
> Key: SPARK-9734
> URL: https://issues.apache.org/jira/browse/SPARK-9734
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Greg Rahn
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>
> When using a basic example of reading the EMP table from Redshift via 
> spark-redshift, and writing the data back to Redshift, Spark fails with the 
> below error, related to Numeric/Decimal data types.
> Redshift table:
> {code}
> testdb=# \d emp
>   Table "public.emp"
>   Column  | Type  | Modifiers
> --+---+---
>  empno| integer   |
>  ename| character varying(10) |
>  job  | character varying(9)  |
>  mgr  | integer   |
>  hiredate | date  |
>  sal  | numeric(7,2)  |
>  comm | numeric(7,2)  |
>  deptno   | integer   |
> testdb=# select * from emp;
>  empno | ename  |job| mgr  |  hiredate  |   sal   |  comm   | deptno
> ---++---+--++-+-+
>   7369 | SMITH  | CLERK | 7902 | 1980-12-17 |  800.00 |NULL | 20
>   7521 | WARD   | SALESMAN  | 7698 | 1981-02-22 | 1250.00 |  500.00 | 30
>   7654 | MARTIN | SALESMAN  | 7698 | 1981-09-28 | 1250.00 | 1400.00 | 30
>   7782 | CLARK  | MANAGER   | 7839 | 1981-06-09 | 2450.00 |NULL | 10
>   7839 | KING   | PRESIDENT | NULL | 1981-11-17 | 5000.00 |NULL | 10
>   7876 | ADAMS  | CLERK | 7788 | 1983-01-12 | 1100.00 |NULL | 20
>   7902 | FORD   | ANALYST   | 7566 | 1981-12-03 | 3000.00 |NULL | 20
>   7499 | ALLEN  | SALESMAN  | 7698 | 1981-02-20 | 1600.00 |  300.00 | 30
>   7566 | JONES  | MANAGER   | 7839 | 1981-04-02 | 2975.00 |NULL | 20
>   7698 | BLAKE  | MANAGER   | 7839 | 1981-05-01 | 2850.00 |NULL | 30
>   7788 | SCOTT  | ANALYST   | 7566 | 1982-12-09 | 3000.00 |NULL | 20
>   7844 | TURNER | SALESMAN  | 7698 | 1981-09-08 | 1500.00 |0.00 | 30
>   7900 | JAMES  | CLERK | 7698 | 1981-12-03 |  950.00 |NULL | 30
>   7934 | MILLER | CLERK | 7782 | 1982-01-23 | 1300.00 |NULL | 10
> (14 rows)
> {code}
> Spark Code:
> {code}
> val url = "jdbc:redshift://rshost:5439/testdb?user=xxx&password=xxx"
> val driver = "com.amazon.redshift.jdbc41.Driver"
> val t = 
> sqlContext.read.format("com.databricks.spark.redshift").option("jdbcdriver", 
> driver).option("url", url).option("dbtable", "emp").option("tempdir", 
> "s3n://spark-temp-dir").load()
> t.registerTempTable("SparkTempTable")
> val t1 = sqlContext.sql("select * from SparkTempTable")
> t1.write.format("com.databricks.spark.redshift").option("driver", 
> driver).option("url", url).option("dbtable", "t1").option("tempdir", 
> "s3n://spark-temp-dir").option("avrocompression", 
> "snappy").mode("error").save()
> {code}
> Error Stack:
> {code}
> java.lang.IllegalArgumentException: Don't know how to save 
> StructField(sal,DecimalType(7,2),true) to JDBC
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:149)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1$$anonfun$2.apply(jdbc.scala:136)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.sql.jdbc.package$JDBCWriteDetails$$anonfun$schemaString$1.apply(jdbc.scala:135)
>   at 
> org.apache.spark.sql.jdbc.package$JDBC

[jira] [Created] (SPARK-9977) The usage of a label generated by StringIndexer

2015-08-14 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-9977:
-

 Summary: The usage of a label generated by StringIndexer
 Key: SPARK-9977
 URL: https://issues.apache.org/jira/browse/SPARK-9977
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.4.1
Reporter: Kai Sasaki
Priority: Trivial


By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
following estimator should use this new column through pipeline if it wants to 
use string indexed label. 
I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9976) create function do not work

2015-08-14 Thread cen yuhai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

cen yuhai updated SPARK-9976:
-
Description: 
I use beeline to connect to ThriftServer, but add jar can not work, so I use 
create function , see the link below.
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html

I do as blow:
create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 

It returns Ok, and I connect to the metastore, I see records in table FUNCS.

select gdecodeorder(t1)  from tableX  limit 1;
It returns error 'Couldn't find function default.gdecodeorder'


This is the Exception
15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
java.lang.RuntimeException: Couldn't find function default.gdecodeorder
15/08/14 15:04:47 ERROR RetryingHMSHandler: 
MetaException(message:NoSuchObjectException(message:Function 
default.t_gdecodeorder does not exist))
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
at 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
at com.sun.proxy.$Proxy21.get_function(Unknown Source)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
at 
org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
at 
org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
at 
org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.Trave

[jira] [Created] (SPARK-9978) Window functions require partitionBy to work as expected

2015-08-14 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-9978:
-

 Summary: Window functions require partitionBy to work as expected
 Key: SPARK-9978
 URL: https://issues.apache.org/jira/browse/SPARK-9978
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
Reporter: Maciej Szymkiewicz


I am trying to reproduce following query:

{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()

++--+
|   x|rn|
++--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
++--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
df["x"],
rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive 
considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9978) Window functions require partitionBy to work as expected

2015-08-14 Thread Maciej Szymkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-9978:
--
Description: 
I am trying to reproduce following SQL query:

{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()

++--+
|   x|rn|
++--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
++--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
df["x"],
rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive 
considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}


  was:
I am trying to reproduce following query:

{code}
df.registerTempTable("df")
sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM df").show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

using PySpark API. Unfortunately it doesn't work as expected:

{code}
from pyspark.sql.window import Window
from pyspark.sql.functions import rowNumber

df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()

++--+
|   x|rn|
++--+
| 0.5| 1|
|0.25| 1|
|0.75| 1|
++--+
{code}

As a workaround It is possible to call partitionBy without additional arguments:

{code}
df.select(
df["x"],
rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
).show()

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}

but as far as I can tell it is not documented and is rather counterintuitive 
considering SQL syntax. Moreover this problem doesn't affect Scala API:

{code}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.rowNumber

case class Record(x: Double)
val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: Record(0.75))
df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show

++--+
|   x|rn|
++--+
|0.25| 1|
| 0.5| 2|
|0.75| 3|
++--+
{code}



> Window functions require partitionBy to work as expected
> 
>
> Key: SPARK-9978
> URL: https://issues.apache.org/jira/browse/SPARK-9978
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Maciej Szymkiewicz
>
> I am trying to reproduce following SQL query:
> {code}
> df.registerTempTable("df")
> sqlContext.sql("SELECT x, row_number() OVER (ORDER BY x) as rn FROM 
> df").show()
> ++--+
> |   x|rn|
> ++--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> ++--+
> {code}
> using PySpark API. Unfortunately it doesn't work as expected:
> {code}
> from pyspark.sql.window import Window
> from pyspark.sql.functions import rowNumber
> df = sqlContext.createDataFrame([{"x": 0.25}, {"x": 0.5}, {"x": 0.75}])
> df.select(df["x"], rowNumber().over(Window.orderBy("x")).alias("rn")).show()
> ++--+
> |   x|rn|
> ++--+
> | 0.5| 1|
> |0.25| 1|
> |0.75| 1|
> ++--+
> {code}
> As a workaround It is possible to call partitionBy without additional 
> arguments:
> {code}
> df.select(
> df["x"],
> rowNumber().over(Window.partitionBy().orderBy("x")).alias("rn")
> ).show()
> ++--+
> |   x|rn|
> ++--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> ++--+
> {code}
> but as far as I can tell it is not documented and is rather counterintuitive 
> considering SQL syntax. Moreover this problem doesn't affect Scala API:
> {code}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.functions.rowNumber
> case class Record(x: Double)
> val df = sqlContext.createDataFrame(Record(0.25) :: Record(0.5) :: 
> Record(0.75))
> df.select($"x", rowNumber().over(Window.orderBy($"x")).alias("rn")).show
> ++--+
> |   x|rn|
> ++--+
> |0.25| 1|
> | 0.5| 2|
> |0.75| 3|
> ++--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332

[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697098#comment-14697098
 ] 

Cody Koeninger commented on SPARK-9947:
---

You already have access to offsets and can save them however you want.  You can 
provide those offsets on restart, regardless of whether checkpointing was 
enabled.

> Separate Metadata and State Checkpoint Data
> ---
>
> Key: SPARK-9947
> URL: https://issues.apache.org/jira/browse/SPARK-9947
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9977:
---

Assignee: Apache Spark

> The usage of a label generated by StringIndexer
> ---
>
> Key: SPARK-9977
> URL: https://issues.apache.org/jira/browse/SPARK-9977
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.1
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: documentaion
>
> By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
> following estimator should use this new column through pipeline if it wants 
> to use string indexed label. 
> I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9977) The usage of a label generated by StringIndexer

2015-08-14 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9977:
---

Assignee: (was: Apache Spark)

> The usage of a label generated by StringIndexer
> ---
>
> Key: SPARK-9977
> URL: https://issues.apache.org/jira/browse/SPARK-9977
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.1
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: documentaion
>
> By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
> following estimator should use this new column through pipeline if it wants 
> to use string indexed label. 
> I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9977) The usage of a label generated by StringIndexer

2015-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697104#comment-14697104
 ] 

Apache Spark commented on SPARK-9977:
-

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/8205

> The usage of a label generated by StringIndexer
> ---
>
> Key: SPARK-9977
> URL: https://issues.apache.org/jira/browse/SPARK-9977
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.4.1
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: documentaion
>
> By using {{StringIndexer}}, we can obtain indexed label on new column. So a 
> following estimator should use this new column through pipeline if it wants 
> to use string indexed label. 
> I think it is better to make it explicit on documentation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9979) Unable to request addition to "powered by spark"

2015-08-14 Thread Jeff Palmucci (JIRA)
Jeff Palmucci created SPARK-9979:


 Summary: Unable to request addition to "powered by spark"
 Key: SPARK-9979
 URL: https://issues.apache.org/jira/browse/SPARK-9979
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Jeff Palmucci
Priority: Minor


The "powered by spark" page asks to submit new listing requests to 
"u...@apache.spark.org". However, when I do that, I get:
{quote}
Hi. This is the qmail-send program at apache.org.
I'm afraid I wasn't able to deliver your message to the following addresses.
This is a permanent error; I've given up. Sorry it didn't work out.
:
Must be sent from an @apache.org address or a subscriber address or an address 
in LDAP.
{quote}

The project I wanted to list is here: 
http://engineering.tripadvisor.com/using-apache-spark-for-massively-parallel-nlp/




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697157#comment-14697157
 ] 

Dan Dutrow commented on SPARK-9947:
---

The desire is to continue using checkpointing for everything but allow 
selective deleting of certain types of checkpoint data. Using an external 
database to duplicate that functionality is not desired.

> Separate Metadata and State Checkpoint Data
> ---
>
> Key: SPARK-9947
> URL: https://issues.apache.org/jira/browse/SPARK-9947
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697166#comment-14697166
 ] 

Cody Koeninger commented on SPARK-9947:
---

You can't re-use checkpoint data across application upgrades anyway, so
that honestly seems kind of pointless.







> Separate Metadata and State Checkpoint Data
> ---
>
> Key: SPARK-9947
> URL: https://issues.apache.org/jira/browse/SPARK-9947
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Dan Dutrow (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697180#comment-14697180
 ] 

Dan Dutrow commented on SPARK-9947:
---

I want to maintain the data in updateStateByKey between upgrades. This data can 
be recovered between upgrades so long as objects don't change.

> Separate Metadata and State Checkpoint Data
> ---
>
> Key: SPARK-9947
> URL: https://issues.apache.org/jira/browse/SPARK-9947
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9947) Separate Metadata and State Checkpoint Data

2015-08-14 Thread Cody Koeninger (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14697193#comment-14697193
 ] 

Cody Koeninger commented on SPARK-9947:
---

Didn't you already say that you were saving updateStateByKey state yourself?




> Separate Metadata and State Checkpoint Data
> ---
>
> Key: SPARK-9947
> URL: https://issues.apache.org/jira/browse/SPARK-9947
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >