date:20160615

[jira] [Commented] (SPARK-12113) Add timing metrics to blocking phases for spark sql

2016-06-15 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333160#comment-15333160
 ] 

Takeshi Yamamuro commented on SPARK-12113:
--

[~rxin] okay, I'll rework based on the #10116 patch.

> Add timing metrics to blocking phases for spark sql
> ---
>
> Key: SPARK-12113
> URL: https://issues.apache.org/jira/browse/SPARK-12113
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nong Li
>
> It's currently not easy to look at the SQL page to get any sense of how long 
> different parts of the plan take. This is in general difficult to do with the 
> row at a time pipelining. We can however, include the timing information for 
> blocking phases. Including these will be useful to get a sense of what is 
> going on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15983) Remove FileFormat.prepareRead()

2016-06-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15983:
---
Summary: Remove FileFormat.prepareRead()  (was: Remove 
FileFormat.prepareRead)

> Remove FileFormat.prepareRead()
> ---
>
> Key: SPARK-15983
> URL: https://issues.apache.org/jira/browse/SPARK-15983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Interface method {{FileFormat.prepareRead()}} was added in [PR 
> #12088|https://github.com/apache/spark/pull/12088] to handle a special case 
> in the LibSVM data source.
> However, the semantics of this interface method isn't intuitive: it returns a 
> modified version of the data source options map. Considering that the LibSVM 
> case can be easily handled using schema metadata inside {{inferSchema}}, we 
> can remove this interface method to keep the {{FileFormat}} interface clean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15983) Remove FileFormat.prepareRead

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15983:


Assignee: Cheng Lian  (was: Apache Spark)

> Remove FileFormat.prepareRead
> -
>
> Key: SPARK-15983
> URL: https://issues.apache.org/jira/browse/SPARK-15983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Interface method {{FileFormat.prepareRead()}} was added in [PR 
> #12088|https://github.com/apache/spark/pull/12088] to handle a special case 
> in the LibSVM data source.
> However, the semantics of this interface method isn't intuitive: it returns a 
> modified version of the data source options map. Considering that the LibSVM 
> case can be easily handled using schema metadata inside {{inferSchema}}, we 
> can remove this interface method to keep the {{FileFormat}} interface clean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15983) Remove FileFormat.prepareRead

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15983:


Assignee: Apache Spark  (was: Cheng Lian)

> Remove FileFormat.prepareRead
> -
>
> Key: SPARK-15983
> URL: https://issues.apache.org/jira/browse/SPARK-15983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>
> Interface method {{FileFormat.prepareRead()}} was added in [PR 
> #12088|https://github.com/apache/spark/pull/12088] to handle a special case 
> in the LibSVM data source.
> However, the semantics of this interface method isn't intuitive: it returns a 
> modified version of the data source options map. Considering that the LibSVM 
> case can be easily handled using schema metadata inside {{inferSchema}}, we 
> can remove this interface method to keep the {{FileFormat}} interface clean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15983) Remove FileFormat.prepareRead

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333141#comment-15333141
 ] 

Apache Spark commented on SPARK-15983:
--

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13698

> Remove FileFormat.prepareRead
> -
>
> Key: SPARK-15983
> URL: https://issues.apache.org/jira/browse/SPARK-15983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Interface method {{FileFormat.prepareRead()}} was added in [PR 
> #12088|https://github.com/apache/spark/pull/12088] to handle a special case 
> in the LibSVM data source.
> However, the semantics of this interface method isn't intuitive: it returns a 
> modified version of the data source options map. Considering that the LibSVM 
> case can be easily handled using schema metadata inside {{inferSchema}}, we 
> can remove this interface method to keep the {{FileFormat}} interface clean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15983) Remove FileFormat.prepareRead

2016-06-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15983:
---
Description: 
Interface method {{FileFormat.prepareRead()}} was added in [PR 
#12088|https://github.com/apache/spark/pull/12088] to handle a special case in 
the LibSVM data source.

However, the semantics of this interface method isn't intuitive: it returns a 
modified version of the data source options map. Considering that the LibSVM 
case can be easily handled using schema metadata inside {{inferSchema}}, we can 
remove this interface method to keep the {{FileFormat}} interface clean.


  was:
Interface method {{FileFormat.prepareRead()}} was added in [PR 
#12088|https://github.com/apache/spark/pull/12088] to handle a special case in 
the LibSVM data source.

However, the semantics of this interface method isn't intuitive: it returns a 
modified version of the data source options map. Considering that the LibSVM 
case can be easily handled using schema metadata inside inferSchema, we can 
remove this interface method to keep the FileFormat interface clean.



> Remove FileFormat.prepareRead
> -
>
> Key: SPARK-15983
> URL: https://issues.apache.org/jira/browse/SPARK-15983
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> Interface method {{FileFormat.prepareRead()}} was added in [PR 
> #12088|https://github.com/apache/spark/pull/12088] to handle a special case 
> in the LibSVM data source.
> However, the semantics of this interface method isn't intuitive: it returns a 
> modified version of the data source options map. Considering that the LibSVM 
> case can be easily handled using schema metadata inside {{inferSchema}}, we 
> can remove this interface method to keep the {{FileFormat}} interface clean.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15983) Remove FileFormat.prepareRead

2016-06-15 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-15983:
--

 Summary: Remove FileFormat.prepareRead
 Key: SPARK-15983
 URL: https://issues.apache.org/jira/browse/SPARK-15983
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian


Interface method {{FileFormat.prepareRead()}} was added in [PR 
#12088|https://github.com/apache/spark/pull/12088] to handle a special case in 
the LibSVM data source.

However, the semantics of this interface method isn't intuitive: it returns a 
modified version of the data source options map. Considering that the LibSVM 
case can be easily handled using schema metadata inside inferSchema, we can 
remove this interface method to keep the FileFormat interface clean.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-15 Thread Narine Kokhlikyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333125#comment-15333125
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 6/16/16 5:25 AM:


FYI, [~olarayej], [~aloknsingh], [~vijayrb] :)


was (Author: narine):
FYI, [~olarayej], [~aloknsingh], [~vijayrb]!

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-15 Thread Narine Kokhlikyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333125#comment-15333125
 ] 

Narine Kokhlikyan commented on SPARK-12922:
---

FYI, [~olarayej], [~aloknsingh], [~vijayrb]!

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15919) DStream "saveAsTextFile" doesn't update the prefix after each checkpoint

2016-06-15 Thread Aamir Abbas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333119#comment-15333119
 ] 

Aamir Abbas commented on SPARK-15919:
-

I have tried the solution you suggested, i-e window() function. Here's my code.

{code}
Duration batchInterval = new Duration(30); // 5 minutes
javaStream.window(batchInterval, 
batchInterval).dstream().saveAsTextFiles(getBaseOutputPath(), "");
{code}

The actual output of this snippet is that it gets the base output path once, 
creates folders in that path, and saves each record from RDDs as a separate 
file.

The expected output was to get new base output path every time the window() 
function is applied, and save all the records from RDDs in a single file.

Please let me know if I am applying the window() function wrongly, and how to 
do that right.

> DStream "saveAsTextFile" doesn't update the prefix after each checkpoint
> 
>
> Key: SPARK-15919
> URL: https://issues.apache.org/jira/browse/SPARK-15919
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.6.1
> Environment: Amazon EMR
>Reporter: Aamir Abbas
>
> I have a Spark streaming job that reads a data stream, and saves it as a text 
> file after a predefined time interval. In the function 
> stream.dstream().repartition(1).saveAsTextFiles(getOutputPath(), "");
> The function getOutputPath() generates a new path every time the function is 
> called, depending on the current system time.
> However, the output path prefix remains the same for all the batches, which 
> effectively means that function is not called again for the next batch of the 
> stream, although the files are being saved after each checkpoint interval. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15981) Fix bug in python DataStreamReader

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15981:


Assignee: Tathagata Das  (was: Apache Spark)

> Fix bug in python DataStreamReader
> --
>
> Key: SPARK-15981
> URL: https://issues.apache.org/jira/browse/SPARK-15981
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Bug in Python DataStreamReader API made it unusable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15981) Fix bug in python DataStreamReader

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333113#comment-15333113
 ] 

Apache Spark commented on SPARK-15981:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/13703

> Fix bug in python DataStreamReader
> --
>
> Key: SPARK-15981
> URL: https://issues.apache.org/jira/browse/SPARK-15981
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Blocker
>
> Bug in Python DataStreamReader API made it unusable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15981) Fix bug in python DataStreamReader

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15981:


Assignee: Apache Spark  (was: Tathagata Das)

> Fix bug in python DataStreamReader
> --
>
> Key: SPARK-15981
> URL: https://issues.apache.org/jira/browse/SPARK-15981
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Streaming
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Blocker
>
> Bug in Python DataStreamReader API made it unusable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread MIN-FU YANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIN-FU YANG updated SPARK-15906:

Description: 
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Mahout & WEKA both have Complementary Naive Bayes implementations.

https://mahout.apache.org/users/classification/bayesian.html
http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html

Besides, this paper is referenced by other papers & books 600+ times, I think 
it's result is solid.
https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567=high=1=UTF-8=1197073324019480518

  was:
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Mahout & WEKA both have Complementary Naive Bayes implementations.

https://mahout.apache.org/users/classification/bayesian.html
http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html

Besides, this paper is referenced by other papers & books 600+ times, I think 
it's result is solid.


> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
> Mahout & WEKA both have Complementary Naive Bayes implementations.
> https://mahout.apache.org/users/classification/bayesian.html
> http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html
> Besides, this paper is referenced by other papers & books 600+ times, I think 
> it's result is solid.
> https://scholar.google.com.tw/scholar?rlz=1C5CHFA_enTW567TW567=high=1=UTF-8=1197073324019480518



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread MIN-FU YANG (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

MIN-FU YANG updated SPARK-15906:

Description: 
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf

Mahout & WEKA both have Complementary Naive Bayes implementations.

https://mahout.apache.org/users/classification/bayesian.html
http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html

Besides, this paper is referenced by other papers & books 600+ times, I think 
it's result is solid.

  was:
Improve the Naive Bayes algorithm on skew data according to 
"Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf


> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
> Mahout & WEKA both have Complementary Naive Bayes implementations.
> https://mahout.apache.org/users/classification/bayesian.html
> http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html
> Besides, this paper is referenced by other papers & books 600+ times, I think 
> it's result is solid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15906) Complementary Naive Bayes Algorithm Implementation

2016-06-15 Thread MIN-FU YANG (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333098#comment-15333098
 ] 

MIN-FU YANG commented on SPARK-15906:
-

OK, the description is updated.

> Complementary Naive Bayes Algorithm Implementation
> --
>
> Key: SPARK-15906
> URL: https://issues.apache.org/jira/browse/SPARK-15906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: MIN-FU YANG
>Priority: Minor
>
> Improve the Naive Bayes algorithm on skew data according to 
> "Tackling the Poor Assumptions of Naive Bayes Text Classifers" chapter 3.2
> http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
> Mahout & WEKA both have Complementary Naive Bayes implementations.
> https://mahout.apache.org/users/classification/bayesian.html
> http://weka.sourceforge.net/doc.packages/complementNaiveBayes/weka/classifiers/bayes/ComplementNaiveBayes.html
> Besides, this paper is referenced by other papers & books 600+ times, I think 
> it's result is solid.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15982) DataFrameReader.orc() should support varargs like json, csv, and parquet

2016-06-15 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-15982:
-

 Summary: DataFrameReader.orc() should support varargs like json, 
csv, and parquet
 Key: SPARK-15982
 URL: https://issues.apache.org/jira/browse/SPARK-15982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Tathagata Das
Assignee: Tathagata Das






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-15 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12922.
---
   Resolution: Fixed
 Assignee: Narine Kokhlikyan
Fix Version/s: 2.0.0

Resolved by https://github.com/apache/spark/pull/12836

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15981) Fix bug in python DataStreamReader

2016-06-15 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-15981:
-

 Summary: Fix bug in python DataStreamReader
 Key: SPARK-15981
 URL: https://issues.apache.org/jira/browse/SPARK-15981
 Project: Spark
  Issue Type: Sub-task
  Components: SQL, Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Blocker


Bug in Python DataStreamReader API made it unusable. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15980:


Assignee: (was: Apache Spark)

> Add PushPredicateThroughObjectConsumer rule to Optimizer.
> -
>
> Key: SPARK-15980
> URL: https://issues.apache.org/jira/browse/SPARK-15980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates 
> through {{ObjectConsumer}}.
> And as an example, I implemented push-down typed filter through 
> {{SerializeFromObject}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15980:


Assignee: Apache Spark

> Add PushPredicateThroughObjectConsumer rule to Optimizer.
> -
>
> Key: SPARK-15980
> URL: https://issues.apache.org/jira/browse/SPARK-15980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>
> I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates 
> through {{ObjectConsumer}}.
> And as an example, I implemented push-down typed filter through 
> {{SerializeFromObject}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333089#comment-15333089
 ] 

Apache Spark commented on SPARK-15980:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13702

> Add PushPredicateThroughObjectConsumer rule to Optimizer.
> -
>
> Key: SPARK-15980
> URL: https://issues.apache.org/jira/browse/SPARK-15980
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Takuya Ueshin
>
> I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates 
> through {{ObjectConsumer}}.
> And as an example, I implemented push-down typed filter through 
> {{SerializeFromObject}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15980) Add PushPredicateThroughObjectConsumer rule to Optimizer.

2016-06-15 Thread Takuya Ueshin (JIRA)

Takuya Ueshin created SPARK-15980:
-

 Summary: Add PushPredicateThroughObjectConsumer rule to Optimizer.
 Key: SPARK-15980
 URL: https://issues.apache.org/jira/browse/SPARK-15980
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Takuya Ueshin


I added {{PushPredicateThroughObjectConsumer}} rule to push-down predicates 
through {{ObjectConsumer}}.
And as an example, I implemented push-down typed filter through 
{{SerializeFromObject}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-15 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333076#comment-15333076
 ] 

Simeon Simeonov commented on SPARK-14048:
-

Yes, I get the exact same failure with 1.6.1. 

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-15 Thread Simeon Simeonov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333076#comment-15333076
 ] 

Simeon Simeonov edited comment on SPARK-14048 at 6/16/16 4:46 AM:
--

Yes, I get the exact same failure with 1.6.1 running on Databricks.


was (Author: simeons):
Yes, I get the exact same failure with 1.6.1. 

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15817) Spark client picking hive 1.2.1 by default which failed to alter a table name

2016-06-15 Thread Nataraj Gorantla (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333069#comment-15333069
 ] 

Nataraj Gorantla commented on SPARK-15817:
--

Can some one please add an update . 

Thanks,
Nataraj 

> Spark client picking hive 1.2.1 by default which failed to alter a table name
> -
>
> Key: SPARK-15817
> URL: https://issues.apache.org/jira/browse/SPARK-15817
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.6.1
>Reporter: Nataraj Gorantla
>
> Some of our scala scripts are failing with below error. 
> FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.DDLTask. Unable to alter table. Invalid
> method name: 'alter_table_with_cascade'
> msg: org.apache.spark.sql.execution.QueryExecutionException: FAILED:
> Spark when invoked is trying to initiate Hive 1.2.1 by default. We have Hive 
> 0.14 installed. Some backgroud investigation from our side explained this. 
> Analysis
> "alter_table_with_cascade" error occurs because of metastore version mismatch 
> of Spark. 
> To correct this error set proper version of metastore in Spark config.
> I tried to add a couple of parameters to spark-default-conf file. 
> spark.sql.hive.metastore.version 0.14.0
> #spark.sql.hive.metastore.jars maven
> spark.sql.hive.metastore.jars =/usr/hdp/current/hive-client/lib
> Still I see issues. Can you please let me know if you have any alternative to 
> fix this issue. 
> Thanks,
> Nataraj G



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver

2016-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15824.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13678
[https://github.com/apache/spark/pull/13678]

> Run 'with ... insert ... select' failed when use spark thriftserver
> ---
>
> Key: SPARK-15824
> URL: https://issues.apache.org/jira/browse/SPARK-15824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weizhong
>Priority: Minor
> Fix For: 2.0.0
>
>
> {code:sql}
> create table src(k int, v int);
> create table src_parquet(k int, v int);
> with v as (select 1, 2) insert into table src_parquet from src;
> {code}
> Will throw exception: spark.sql.execution.id is already set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15824) Run 'with ... insert ... select' failed when use spark thriftserver

2016-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15824:

Assignee: Herman van Hovell

> Run 'with ... insert ... select' failed when use spark thriftserver
> ---
>
> Key: SPARK-15824
> URL: https://issues.apache.org/jira/browse/SPARK-15824
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Weizhong
>Assignee: Herman van Hovell
>Priority: Minor
> Fix For: 2.0.0
>
>
> {code:sql}
> create table src(k int, v int);
> create table src_parquet(k int, v int);
> with v as (select 1, 2) insert into table src_parquet from src;
> {code}
> Will throw exception: spark.sql.execution.id is already set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12492) SQL page of Spark-sql is always blank

2016-06-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12492:
-
Issue Type: Improvement  (was: Bug)

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Reporter: meiyoula
>Assignee: KaiXinXIaoLei
> Fix For: 2.0.0
>
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12492) SQL page of Spark-sql is always blank

2016-06-15 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12492.
--
   Resolution: Fixed
 Assignee: KaiXinXIaoLei
Fix Version/s: 2.0.0

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
>Assignee: KaiXinXIaoLei
> Fix For: 2.0.0
>
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15639) Try to push down filter at RowGroups level for parquet reader

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333030#comment-15333030
 ] 

Apache Spark commented on SPARK-15639:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/13701

> Try to push down filter at RowGroups level for parquet reader
> -
>
> Key: SPARK-15639
> URL: https://issues.apache.org/jira/browse/SPARK-15639
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>
> When we use vecterized parquet reader, although the base reader (i.e., 
> SpecificParquetRecordReaderBase) will retrieve pushed-down filters for 
> RowGroups-level filtering, we seems not really set up the filters to be 
> pushed down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15977:


Assignee: Herman van Hovell  (was: Apache Spark)

> TRUNCATE TABLE does not work with Datasource tables outside of Hive
> ---
>
> Key: SPARK-15977
> URL: https://issues.apache.org/jira/browse/SPARK-15977
> Project: Spark
>  Issue Type: Bug
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The {{TRUNCATE TABLE}} command does not work with datasource tables without 
> Hive support. For example the following doesn't work:
> {noformat}
> DROP TABLE IF EXISTS test
> CREATE TABLE test(a INT, b STRING) USING JSON
> INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c')
> SELECT * FROM test
> TRUNCATE TABLE test
> SELECT * FROM test
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333022#comment-15333022
 ] 

Apache Spark commented on SPARK-15977:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/13697

> TRUNCATE TABLE does not work with Datasource tables outside of Hive
> ---
>
> Key: SPARK-15977
> URL: https://issues.apache.org/jira/browse/SPARK-15977
> Project: Spark
>  Issue Type: Bug
>Reporter: Herman van Hovell
>Assignee: Herman van Hovell
>
> The {{TRUNCATE TABLE}} command does not work with datasource tables without 
> Hive support. For example the following doesn't work:
> {noformat}
> DROP TABLE IF EXISTS test
> CREATE TABLE test(a INT, b STRING) USING JSON
> INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c')
> SELECT * FROM test
> TRUNCATE TABLE test
> SELECT * FROM test
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15977:


Assignee: Apache Spark  (was: Herman van Hovell)

> TRUNCATE TABLE does not work with Datasource tables outside of Hive
> ---
>
> Key: SPARK-15977
> URL: https://issues.apache.org/jira/browse/SPARK-15977
> Project: Spark
>  Issue Type: Bug
>Reporter: Herman van Hovell
>Assignee: Apache Spark
>
> The {{TRUNCATE TABLE}} command does not work with datasource tables without 
> Hive support. For example the following doesn't work:
> {noformat}
> DROP TABLE IF EXISTS test
> CREATE TABLE test(a INT, b STRING) USING JSON
> INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c')
> SELECT * FROM test
> TRUNCATE TABLE test
> SELECT * FROM test
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15979) Rename various Parquet support classes

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332993#comment-15332993
 ] 

Apache Spark commented on SPARK-15979:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13700

> Rename various Parquet support classes
> --
>
> Key: SPARK-15979
> URL: https://issues.apache.org/jira/browse/SPARK-15979
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This patch renames various Parquet support classes from CatalystAbc to 
> ParquetAbc. This new naming makes more sense for two reasons:
> 1. These are not optimizer related (i.e. Catalyst) classes.
> 2. We are in the Spark code base, and as a result it'd be more clear to call 
> out these are Parquet support classes, rather than some Spark classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15851.
-
   Resolution: Fixed
 Assignee: Reynold Xin
Fix Version/s: 2.0.0

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-13498) JDBCRDD should update some input metrics

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-13498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-13498.
-
   Resolution: Fixed
 Assignee: Wayne Song
Fix Version/s: 2.0.0

> JDBCRDD should update some input metrics
> 
>
> Key: SPARK-13498
> URL: https://issues.apache.org/jira/browse/SPARK-13498
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wayne Song
>Assignee: Wayne Song
>Priority: Minor
> Fix For: 2.0.0
>
>
> The JDBCRDD does not update any input metrics, which makes it difficult to 
> see its progress in the web UI.  It should be simple to at least update 
> recordsRead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15979) Rename various Parquet support classes

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15979:


Assignee: Apache Spark  (was: Reynold Xin)

> Rename various Parquet support classes
> --
>
> Key: SPARK-15979
> URL: https://issues.apache.org/jira/browse/SPARK-15979
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> This patch renames various Parquet support classes from CatalystAbc to 
> ParquetAbc. This new naming makes more sense for two reasons:
> 1. These are not optimizer related (i.e. Catalyst) classes.
> 2. We are in the Spark code base, and as a result it'd be more clear to call 
> out these are Parquet support classes, rather than some Spark classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15979) Rename various Parquet support classes

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15979:


Assignee: Reynold Xin  (was: Apache Spark)

> Rename various Parquet support classes
> --
>
> Key: SPARK-15979
> URL: https://issues.apache.org/jira/browse/SPARK-15979
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This patch renames various Parquet support classes from CatalystAbc to 
> ParquetAbc. This new naming makes more sense for two reasons:
> 1. These are not optimizer related (i.e. Catalyst) classes.
> 2. We are in the Spark code base, and as a result it'd be more clear to call 
> out these are Parquet support classes, rather than some Spark classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15979) Rename various Parquet support classes

2016-06-15 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-15979:
---

 Summary: Rename various Parquet support classes
 Key: SPARK-15979
 URL: https://issues.apache.org/jira/browse/SPARK-15979
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


This patch renames various Parquet support classes from CatalystAbc to 
ParquetAbc. This new naming makes more sense for two reasons:

1. These are not optimizer related (i.e. Catalyst) classes.
2. We are in the Spark code base, and as a result it'd be more clear to call 
out these are Parquet support classes, rather than some Spark classes.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15958) Make initial buffer size for the Sorter configurable

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15958:


Assignee: Apache Spark

> Make initial buffer size for the Sorter configurable
> 
>
> Key: SPARK-15958
> URL: https://issues.apache.org/jira/browse/SPARK-15958
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>Assignee: Apache Spark
>
> Currently the initial buffer size in the sorter is hard coded inside the code 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88)
>  and is too small for large workload. As a result, the sorter spends 
> significant time expanding the buffer size and copying the data. It would be 
> useful to have it configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15979) Rename various Parquet support classes

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15979.
-
Resolution: Fixed

> Rename various Parquet support classes
> --
>
> Key: SPARK-15979
> URL: https://issues.apache.org/jira/browse/SPARK-15979
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This patch renames various Parquet support classes from CatalystAbc to 
> ParquetAbc. This new naming makes more sense for two reasons:
> 1. These are not optimizer related (i.e. Catalyst) classes.
> 2. We are in the Spark code base, and as a result it'd be more clear to call 
> out these are Parquet support classes, rather than some Spark classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15979) Rename various Parquet support classes

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332968#comment-15332968
 ] 

Apache Spark commented on SPARK-15979:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13696

> Rename various Parquet support classes
> --
>
> Key: SPARK-15979
> URL: https://issues.apache.org/jira/browse/SPARK-15979
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>
> This patch renames various Parquet support classes from CatalystAbc to 
> ParquetAbc. This new naming makes more sense for two reasons:
> 1. These are not optimizer related (i.e. Catalyst) classes.
> 2. We are in the Spark code base, and as a result it'd be more clear to call 
> out these are Parquet support classes, rather than some Spark classes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15958) Make initial buffer size for the Sorter configurable

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15958:


Assignee: (was: Apache Spark)

> Make initial buffer size for the Sorter configurable
> 
>
> Key: SPARK-15958
> URL: https://issues.apache.org/jira/browse/SPARK-15958
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently the initial buffer size in the sorter is hard coded inside the code 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88)
>  and is too small for large workload. As a result, the sorter spends 
> significant time expanding the buffer size and copying the data. It would be 
> useful to have it configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15958) Make initial buffer size for the Sorter configurable

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332966#comment-15332966
 ] 

Apache Spark commented on SPARK-15958:
--

User 'sitalkedia' has created a pull request for this issue:
https://github.com/apache/spark/pull/13699

> Make initial buffer size for the Sorter configurable
> 
>
> Key: SPARK-15958
> URL: https://issues.apache.org/jira/browse/SPARK-15958
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Currently the initial buffer size in the sorter is hard coded inside the code 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/execution/UnsafeExternalRowSorter.java#L88)
>  and is too small for large workload. As a result, the sorter spends 
> significant time expanding the buffer size and copying the data. It would be 
> useful to have it configurable. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-15 Thread Joseph Fourny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332960#comment-15332960
 ] 

Joseph Fourny edited comment on SPARK-15690 at 6/16/16 3:00 AM:


I am trying to develop single-node clusters on large servers (30+ CPU cores) 
with 2-3 TB or RAM. Our use cases involve small to medium size datasets, but 
with a huge amount of concurrent jobs (shared, multi-tenant environments). 
Efficiency and sub-second response times are the primary requirements. This 
shuffle between stages is the current bottleneck. Writing anything to disk is 
just a waste of time if all computations are done in the same JVM (or a small 
set of JVMs on the same machine). We tried using RAMFS to avoid disk I/O, but 
still a lot of CPU time is spent in compression and serialization. I would 
rather not hack my way out of this one. Is it wishful thinking to have this 
officially supported?


was (Author: josephfourny):
+1 on this. I am trying to develop single-node clusters on large servers (30+ 
CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size 
datasets, but with a huge amount of concurrent jobs (shared, multi-tenant 
environments). Efficiency and sub-second response times are the primary 
requirements. This shuffle between stages is the current bottleneck. Writing 
anything to disk is just a waste of time if all computations are done in the 
same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to 
avoid disk I/O, but still a lot of CPU time is spent in compression and 
serialization. I would rather not hack my way out of this one. Is it wishful 
thinking to have this officially supported?

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15690) Fast single-node (single-process) in-memory shuffle

2016-06-15 Thread Joseph Fourny (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332960#comment-15332960
 ] 

Joseph Fourny commented on SPARK-15690:
---

+1 on this. I am trying to develop single-node clusters on large servers (30+ 
CPU cores) with 2-3 TB or RAM. Our use cases involve small to medium size 
datasets, but with a huge amount of concurrent jobs (shared, multi-tenant 
environments). Efficiency and sub-second response times are the primary 
requirements. This shuffle between stages is the current bottleneck. Writing 
anything to disk is just a waste of time if all computations are done in the 
same JVM (or a small set of JVMs on the same machine). We tried using RAMFS to 
avoid disk I/O, but still a lot of CPU time is spent in compression and 
serialization. I would rather not hack my way out of this one. Is it wishful 
thinking to have this officially supported?

> Fast single-node (single-process) in-memory shuffle
> ---
>
> Key: SPARK-15690
> URL: https://issues.apache.org/jira/browse/SPARK-15690
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, SQL
>Reporter: Reynold Xin
>
> Spark's current shuffle implementation sorts all intermediate data by their 
> partition id, and then write the data to disk. This is not a big bottleneck 
> because the network throughput on commodity clusters tend to be low. However, 
> an increasing number of Spark users are using the system to process data on a 
> single-node. When in a single node operating against intermediate data that 
> fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort 
> to do data shuffling on a single node, and still gracefully fallback to disk 
> if the data size does not fit in memory. Given the number of partitions is 
> usually small (say less than 256), it'd require only a single pass do to the 
> radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This 
> ticket has a smaller scope (single-process), and aims to actually 
> productionize this code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15978) Some improvement of "Show Tables"

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15978:


Assignee: Apache Spark

> Some improvement of "Show Tables"
> -
>
> Key: SPARK-15978
> URL: https://issues.apache.org/jira/browse/SPARK-15978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>Assignee: Apache Spark
>Priority: Minor
>
> I've found some minor issues in "show tables" command:
> 1. In the SessionCatalog.scala, listTables(db: String) method will call 
> listTables(formatDatabaseName(db), "*") to list all the tables for certain 
> db, but in the method listTables(db: String, pattern: String), this db name 
> is formatted once more. So I think we should remove formatDatabaseName() in 
> the caller.
> 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, 
> just like listDatabases().
> I will make a PR shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15978) Some improvement of "Show Tables"

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15978:


Assignee: (was: Apache Spark)

> Some improvement of "Show Tables"
> -
>
> Key: SPARK-15978
> URL: https://issues.apache.org/jira/browse/SPARK-15978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>Priority: Minor
>
> I've found some minor issues in "show tables" command:
> 1. In the SessionCatalog.scala, listTables(db: String) method will call 
> listTables(formatDatabaseName(db), "*") to list all the tables for certain 
> db, but in the method listTables(db: String, pattern: String), this db name 
> is formatted once more. So I think we should remove formatDatabaseName() in 
> the caller.
> 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, 
> just like listDatabases().
> I will make a PR shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15978) Some improvement of "Show Tables"

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15978:


Assignee: Apache Spark

> Some improvement of "Show Tables"
> -
>
> Key: SPARK-15978
> URL: https://issues.apache.org/jira/browse/SPARK-15978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>Assignee: Apache Spark
>Priority: Minor
>
> I've found some minor issues in "show tables" command:
> 1. In the SessionCatalog.scala, listTables(db: String) method will call 
> listTables(formatDatabaseName(db), "*") to list all the tables for certain 
> db, but in the method listTables(db: String, pattern: String), this db name 
> is formatted once more. So I think we should remove formatDatabaseName() in 
> the caller.
> 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, 
> just like listDatabases().
> I will make a PR shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15978) Some improvement of "Show Tables"

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332802#comment-15332802
 ] 

Apache Spark commented on SPARK-15978:
--

User 'bomeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/13695

> Some improvement of "Show Tables"
> -
>
> Key: SPARK-15978
> URL: https://issues.apache.org/jira/browse/SPARK-15978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Bo Meng
>Priority: Minor
>
> I've found some minor issues in "show tables" command:
> 1. In the SessionCatalog.scala, listTables(db: String) method will call 
> listTables(formatDatabaseName(db), "*") to list all the tables for certain 
> db, but in the method listTables(db: String, pattern: String), this db name 
> is formatted once more. So I think we should remove formatDatabaseName() in 
> the caller.
> 2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, 
> just like listDatabases().
> I will make a PR shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Nezih Yigitbasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332797#comment-15332797
 ] 

Nezih Yigitbasi commented on SPARK-15782:
-

reopened, will submit a PR including Marcelo's fix on top of mine.

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Nezih Yigitbasi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nezih Yigitbasi reopened SPARK-15782:
-

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15782:


Assignee: Apache Spark  (was: Nezih Yigitbasi)

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Apache Spark
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15782:


Assignee: Nezih Yigitbasi  (was: Apache Spark)

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15978) Some improvement of "Show Tables"

2016-06-15 Thread Bo Meng (JIRA)

Bo Meng created SPARK-15978:
---

 Summary: Some improvement of "Show Tables"
 Key: SPARK-15978
 URL: https://issues.apache.org/jira/browse/SPARK-15978
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Bo Meng
Priority: Minor


I've found some minor issues in "show tables" command:
1. In the SessionCatalog.scala, listTables(db: String) method will call 
listTables(formatDatabaseName(db), "*") to list all the tables for certain db, 
but in the method listTables(db: String, pattern: String), this db name is 
formatted once more. So I think we should remove formatDatabaseName() in the 
caller.
2. I suggest to add sort to listTables(db: String) in InMemoryCatalog.scala, 
just like listDatabases().

I will make a PR shortly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-13498) JDBCRDD should update some input metrics

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-13498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332776#comment-15332776
 ] 

Apache Spark commented on SPARK-13498:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13694

> JDBCRDD should update some input metrics
> 
>
> Key: SPARK-13498
> URL: https://issues.apache.org/jira/browse/SPARK-13498
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wayne Song
>Priority: Minor
>
> The JDBCRDD does not update any input metrics, which makes it difficult to 
> see its progress in the web UI.  It should be simple to at least update 
> recordsRead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14048) Aggregation operations on structs fail when the structs have fields with special characters

2016-06-15 Thread Sean Zhong (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332775#comment-15332775
 ] 

Sean Zhong commented on SPARK-14048:


[~simeons]  

Are you able to reproduce this case any longer? I cannot reproduce this on 1.6 
by using the following script on databricks cloud community edition.

{code}
val rdd = sc.makeRDD(
  """{"st": {"x.y": 1}, "age": 10}""" :: """{"st": {"x.y": 2}, "age": 10}""" :: 
"""{"st": {"x.y": 2}, "age": 20}""" :: Nil)
sqlContext.read.json(rdd).registerTempTable("test")
sqlContext.sql("select first(st) as st from test group by age").show()
{code}

> Aggregation operations on structs fail when the structs have fields with 
> special characters
> ---
>
> Key: SPARK-14048
> URL: https://issues.apache.org/jira/browse/SPARK-14048
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Databricks w/ 1.6.0
>Reporter: Simeon Simeonov
>  Labels: sql
> Attachments: bug_structs_with_backticks.html
>
>
> Consider a schema where a struct has field names with special characters, 
> e.g.,
> {code}
>  |-- st: struct (nullable = true)
>  ||-- x.y: long (nullable = true)
> {code}
> Schema such as these are frequently generated by the JSON schema generator, 
> which seems to never want to map JSON data to {{MapType}} always preferring 
> to use {{StructType}}. 
> In SparkSQL, referring to these fields requires backticks, e.g., 
> {{st.`x.y`}}. There is no problem manipulating these structs unless one is 
> using an aggregation function. It seems that, under the covers, the code is 
> not escaping fields with special characters correctly.
> For example, 
> {code}
> select first(st) as st from tbl group by something
> {code}
> generates
> {code}
> org.apache.spark.sql.catalyst.util.DataTypeException: Unsupported dataType: 
> struct. If you have a struct and a field name of it has any 
> special characters, please use backticks (`) to quote that field name, e.g. 
> `x+y`. Please note that backtick itself is not supported in a field name.
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$class.toDataType(DataTypeParser.scala:100)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$$anon$1.toDataType(DataTypeParser.scala:112)
>   at 
> org.apache.spark.sql.catalyst.util.DataTypeParser$.parse(DataTypeParser.scala:116)
>   at 
> org.apache.spark.sql.hive.HiveMetastoreTypes$.toDataType(HiveMetastoreCatalog.scala:884)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:395)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$$anonfun$toJsonSchema$1.apply(OutputAggregator.scala:394)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.toJsonSchema(OutputAggregator.scala:394)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.maybeApplyOutputAggregation(OutputAggregator.scala:122)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:82)
>   at 
> com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.executeSql(DriverLocal.scala:306)
>   at 
> com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$3.apply(DriverWrapper.scala:467)
>   at scala.util.Try$.apply(Try.scala:161)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:464)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:365)
>   at 
> com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:196)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15977) TRUNCATE TABLE does not work with Datasource tables outside of Hive

2016-06-15 Thread Herman van Hovell (JIRA)

Herman van Hovell created SPARK-15977:
-

 Summary: TRUNCATE TABLE does not work with Datasource tables 
outside of Hive
 Key: SPARK-15977
 URL: https://issues.apache.org/jira/browse/SPARK-15977
 Project: Spark
  Issue Type: Bug
Reporter: Herman van Hovell
Assignee: Herman van Hovell


The {{TRUNCATE TABLE}} command does not work with datasource tables without 
Hive support. For example the following doesn't work:
{noformat}
DROP TABLE IF EXISTS test
CREATE TABLE test(a INT, b STRING) USING JSON
INSERT INTO test VALUES (1, 'a'), (2, 'b'), (3, 'c')
SELECT * FROM test
TRUNCATE TABLE test
SELECT * FROM test
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7848) Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" information.

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7848.

   Resolution: Fixed
 Assignee: Nirman Narang
Fix Version/s: 2.0.0

> Update SparkStreaming docs to incorporate FAQ and/or bullets w/ "knobs" 
> information.
> 
>
> Key: SPARK-7848
> URL: https://issues.apache.org/jira/browse/SPARK-7848
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Reporter: jay vyas
>Assignee: Nirman Narang
> Fix For: 2.0.0
>
>
> A recent email on the maligning list detailed a bunch of great "knobs" to 
> remember for spark streaming. 
> Lets integrate this  into the docs where appropriate.
> I'll paste the raw text in a comment field below



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8813) Combine files when there're many small files in table

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8813.

   Resolution: Fixed
 Assignee: Michael Armbrust
Fix Version/s: 2.0.0

> Combine files when there're many small files in table
> -
>
> Key: SPARK-8813
> URL: https://issues.apache.org/jira/browse/SPARK-8813
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Yadong Qi
>Assignee: Michael Armbrust
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15974) Create a socket on YARN AM start-up

2016-06-15 Thread Mingyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332749#comment-15332749
 ] 

Mingyu Kim commented on SPARK-15974:


I agree this is not ideal. A lot of times setting up a server with an Socket 
won't be an unreasonable thing to do, though.

The alternative would be to have Spark program pass some information to Spark 
AM during the start-up. (Having Spark program set port to YARN is not possible 
as discussed on the thread linked above.) This can probably done through the 
use of static variables in the Spark program class. None of these sound 
particularly great to me, but here are some options I can think of,

- Spark program class optionally has Map initialize() method, 
which returns some named objects back to Spark AM. "rpc-port" could be one of 
the key names supported, and we can imagine adding more keys later. Spark 
program class will need to store some information (in the case of RPC port, a 
Server object or Socket) as a static var for main method to use.
- Pass something like a SettableFuture to the main method so that Spark AM can 
wait for some initialization to be done. This means that command line args need 
to be augmented with this one extra thing, which is confusing, or that the 
SettableFuture needs to be passed to Spark program class through some other 
method and then stored as a static var in Spark program class for the main 
method to use.

Another option would be to change the way spark-submitted applications are 
written so that the class implements an interface with an explicit initialize 
method, as opposed to a class with the main method, which allows us to avoid 
playing with the static variables, but this will be a pretty big compatibility 
break for Spark.

> Create a socket on YARN AM start-up
> ---
>
> Key: SPARK-15974
> URL: https://issues.apache.org/jira/browse/SPARK-15974
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Mingyu Kim
>
> YARN provides a way for AppilcationMaster to register a RPC port so that a 
> client outside the YARN cluster can reach the application for any RPCs, but 
> Spark’s YARN AMs simply register a dummy port number of 0. For the Spark 
> programs that starts up a server, this makes it hard for the submitter to 
> discover the server port securely. Spark's ApplicationMaster should 
> optionally create a ServerSocket and pass it to the Spark user program. This 
> socket initialization should be disabled by default.
> Some discussion on dev@spark thread: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Utilizing-YARN-AM-RPC-port-field-td17892.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12114) ColumnPruning rule fails in case of "Project <- Filter <- Join"

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12114.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> ColumnPruning rule fails in case of "Project <- Filter <- Join"
> ---
>
> Key: SPARK-12114
> URL: https://issues.apache.org/jira/browse/SPARK-12114
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Min Qiu
> Fix For: 2.0.0
>
>
> For the query
> {code}
> SELECT c_name, c_custkey, o_orderkey, o_orderdate, 
>o_totalprice, sum(l_quantity) 
> FROM customer join orders join lineitem 
>   on c_custkey = o_custkey AND o_orderkey = l_orderkey 
>  left outer join (SELECT l_orderkey tmp_orderkey 
>   FROM lineitem 
>   GROUP BY l_orderkey 
>   HAVING sum(l_quantity) > 300) tmp 
>   on o_orderkey = tmp_orderkey 
> WHERE tmp_orderkey IS NOT NULL 
> GROUP BY c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice 
> ORDER BY o_totalprice DESC, o_orderdate
> {code}
> The optimizedPlan is 
> {code}
> Sort \[o_totalprice#48 DESC,o_orderdate#49 ASC]
>  
>  Aggregate 
> \[c_name#38,c_custkey#37,o_orderkey#45,o_orderdate#49,o_totalprice#48], 
> \[c_name#38,c_custkey#37,o_orderkey#45,
> o_orderdate#49,o_totalprice#48,SUM(l_quantity#58) AS _c5#36]
>   {color: green}Project 
> \[c_name#38,o_orderdate#49,c_custkey#37,o_orderkey#45,o_totalprice#48,l_quantity#58]
>Filter IS NOT NULL tmp_orderkey#35
> Join LeftOuter, Some((o_orderkey#45 = tmp_orderkey#35)){color}
>  Join Inner, Some((c_custkey#37 = o_custkey#46))
>   MetastoreRelation default, customer, None
>   Join Inner, Some((o_orderkey#45 = l_orderkey#54))
>MetastoreRelation default, orders, None
>MetastoreRelation default, lineitem, None
>  Project \[tmp_orderkey#35]
>   Filter havingCondition#86
>Aggregate \[l_orderkey#70], \[(SUM(l_quantity#74) > 300.0) AS 
> havingCondition#86,l_orderkey#70 AS tmp_orderkey#35]
> Project \[l_orderkey#70,l_quantity#74]
>  MetastoreRelation default, lineitem, None
> {code}
> Due to the pattern highlighted in green that the ColumnPruning rule fails to 
> deal with,  all columns of lineitem and orders tables are scanned. The 
> unneeded columns are also involved in the data Shuffling. The performance is 
> extremely bad if any one of the two tables is big.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12032) Filter can't be pushed down to correct Join because of bad order of Join

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332721#comment-15332721
 ] 

Apache Spark commented on SPARK-12032:
--

User 'flyson' has created a pull request for this issue:
https://github.com/apache/spark/pull/10258

> Filter can't be pushed down to correct Join because of bad order of Join
> 
>
> Key: SPARK-12032
> URL: https://issues.apache.org/jira/browse/SPARK-12032
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 2.0.0
>
>
> For this query:
> {code}
>   select d.d_year, count(*) cnt
>FROM store_sales, date_dim d, customer c
>WHERE ss_customer_sk = c.c_customer_sk AND c.c_first_shipto_date_sk = 
> d.d_date_sk
>group by d.d_year
> {code}
> Current optimized plan is
> {code}
> == Optimized Logical Plan ==
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some(((ss_customer_sk#283 = c_customer_sk#101) && 
> (c_first_shipto_date_sk#106 = d_date_sk#141)))
>Project [d_date_sk#141,d_year#147,ss_customer_sk#283]
> Join Inner, None
>  Project [ss_customer_sk#283]
>   Relation[] ParquetRelation[store_sales]
>  Project [d_date_sk#141,d_year#147]
>   Relation[] ParquetRelation[date_dim]
>Project [c_customer_sk#101,c_first_shipto_date_sk#106]
> Relation[] ParquetRelation[customer]
> {code}
> It will join store_sales and date_dim together without any condition, the 
> condition c.c_first_shipto_date_sk = d.d_date_sk is not pushed to it because 
> the bad order of joins.
> The optimizer should re-order the joins, join date_dim after customer, then 
> it can pushed down the condition correctly.
> The plan should be 
> {code}
> Aggregate [d_year#147], [d_year#147,(count(1),mode=Complete,isDistinct=false) 
> AS cnt#425L]
>  Project [d_year#147]
>   Join Inner, Some((c_first_shipto_date_sk#106 = d_date_sk#141))
>Project [c_first_shipto_date_sk#106]
> Join Inner, Some((ss_customer_sk#283 = c_customer_sk#101))
>  Project [ss_customer_sk#283]
>   Relation[store_sales]
>  Project [c_first_shipto_date_sk#106,c_customer_sk#101]
>   Relation[customer]
>Project [d_year#147,d_date_sk#141]
> Relation[date_dim]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12329) spark-sql prints out SET commands to stdout instead of stderr

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12329:


Assignee: Apache Spark

> spark-sql prints out SET commands to stdout instead of stderr
> -
>
> Key: SPARK-12329
> URL: https://issues.apache.org/jira/browse/SPARK-12329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ashwin Shankar
>Assignee: Apache Spark
>Priority: Minor
>
> When I run "$spark-sql -f ", I see that few "SET key value" messages 
> get printed on stdout instead of stderr. These messages should go to stderr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12329) spark-sql prints out SET commands to stdout instead of stderr

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12329:


Assignee: (was: Apache Spark)

> spark-sql prints out SET commands to stdout instead of stderr
> -
>
> Key: SPARK-12329
> URL: https://issues.apache.org/jira/browse/SPARK-12329
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Ashwin Shankar
>Priority: Minor
>
> When I run "$spark-sql -f ", I see that few "SET key value" messages 
> get printed on stdout instead of stderr. These messages should go to stderr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-9689) Cache doesn't refresh for HadoopFsRelation based table

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9689.

   Resolution: Fixed
 Assignee: (was: Cheng Hao)
Fix Version/s: 2.0.0

I think this one has been fixed in 2.0 already.


> Cache doesn't refresh for HadoopFsRelation based table
> --
>
> Key: SPARK-9689
> URL: https://issues.apache.org/jira/browse/SPARK-9689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheng Hao
> Fix For: 2.0.0
>
>
> {code:title=example|borderStyle=solid}
> // create a HadoopFsRelation based table
> sql(s"""
> |CREATE TEMPORARY TABLE jsonTable (a int, b string)
> |USING org.apache.spark.sql.json.DefaultSource
> |OPTIONS (
> |  path '${path.toString}'
> |)""".stripMargin)
>   
> // give the value from table jt
> sql(
>   s"""
>   |INSERT OVERWRITE TABLE jsonTable SELECT a, b FROM jt
> """.stripMargin)
> // cache the HadoopFsRelation Table
> sqlContext.cacheTable("jsonTable")
>
> // update the HadoopFsRelation Table
> sql(
>   s"""
> |INSERT OVERWRITE TABLE jsonTable SELECT a * 2, b FROM jt
>   """.stripMargin)
> // Even this will fail
>  sql("SELECT a, b FROM jsonTable").collect()
> // This will fail, as the cache doesn't refresh
> checkAnswer(
>   sql("SELECT a, b FROM jsonTable"),
>   sql("SELECT a * 2, b FROM jt").collect())
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"

2016-06-15 Thread Derek Sabry (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332704#comment-15332704
 ] 

Derek Sabry commented on SPARK-4944:


This email account is inactive. Please contact another person at the company or 
pe...@fb.com.


> Table Not Found exception in "Create Table Like registered RDD table"
> -
>
> Key: SPARK-4944
> URL: https://issues.apache.org/jira/browse/SPARK-4944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> {code}
> rdd_table.saveAsParquetFile("/user/spark/my_data.parquet")
> hiveContext.registerRDDAsTable(rdd_table, "rdd_table")
> hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION 
> '/user/spark/my_data.parquet'")
> {code}
> {noformat}
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not 
> found rdd_table
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4944.

Resolution: Auto Closed

> Table Not Found exception in "Create Table Like registered RDD table"
> -
>
> Key: SPARK-4944
> URL: https://issues.apache.org/jira/browse/SPARK-4944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> {code}
> rdd_table.saveAsParquetFile("/user/spark/my_data.parquet")
> hiveContext.registerRDDAsTable(rdd_table, "rdd_table")
> hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION 
> '/user/spark/my_data.parquet'")
> {code}
> {noformat}
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not 
> found rdd_table
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15976) Make Stage Numbering determinstic

2016-06-15 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332686#comment-15332686
 ] 

Imran Rashid commented on SPARK-15976:
--

cc [~kayousterhout] [~markhamstra]

> Make Stage Numbering determinstic
> -
>
> Key: SPARK-15976
> URL: https://issues.apache.org/jira/browse/SPARK-15976
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>
> Stage numbering in spark is non-deterministic.  It never was deterministic, 
> but it *appeared* to be so in most cases.  After SPARK-15927, it is far more 
> random.  Reliable stage numbering would be helpful for internal unit tests, 
> and also for any client code which uses {{SparkListener}} to monitor a job 
> and gauge progress.
> FWIW, I had never even realized that the order was non-deterministic before, 
> and have written plenty of code which assumes some stage numbering.  I expect 
> users may be bitten by this too.  We might even want to try to restore the 
> "usual" ordering from before SPARK-15927.
> Finally it would be nice to restore some of the tests turned off here if 
> possible: https://github.com/apache/spark/pull/13688



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R

2016-06-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15177:
--
Issue Type: Documentation  (was: Improvement)

> SparkR 2.0 QA: New R APIs and API docs for mllib.R
> --
>
> Key: SPARK-15177
> URL: https://issues.apache.org/jira/browse/SPARK-15177
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Priority: Blocker
>
> Audit new public R APIs in mllib.R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15129) Clarify conventions for calling Spark and MLlib from R

2016-06-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15129:
--
Assignee: Gayathri Murali

> Clarify conventions for calling Spark and MLlib from R
> --
>
> Key: SPARK-15129
> URL: https://issues.apache.org/jira/browse/SPARK-15129
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Gayathri Murali
>Priority: Blocker
>
> Since some R API modifications happened in 2.0, we need to make the new 
> standards clear in the user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15976) Make Stage Numbering determinstic

2016-06-15 Thread Imran Rashid (JIRA)

Imran Rashid created SPARK-15976:


 Summary: Make Stage Numbering determinstic
 Key: SPARK-15976
 URL: https://issues.apache.org/jira/browse/SPARK-15976
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 2.0.0
Reporter: Imran Rashid


Stage numbering in spark is non-deterministic.  It never was deterministic, but 
it *appeared* to be so in most cases.  After SPARK-15927, it is far more 
random.  Reliable stage numbering would be helpful for internal unit tests, and 
also for any client code which uses {{SparkListener}} to monitor a job and 
gauge progress.

FWIW, I had never even realized that the order was non-deterministic before, 
and have written plenty of code which assumes some stage numbering.  I expect 
users may be bitten by this too.  We might even want to try to restore the 
"usual" ordering from before SPARK-15927.

Finally it would be nice to restore some of the tests turned off here if 
possible: https://github.com/apache/spark/pull/13688



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332674#comment-15332674
 ] 

Apache Spark commented on SPARK-15782:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13693

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15974) Create a socket on YARN AM start-up

2016-06-15 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332637#comment-15332637
 ] 

Marcelo Vanzin commented on SPARK-15974:


bq.  Spark's ApplicationMaster should optionally create a ServerSocket and pass 
it to the Spark user program.

That makes a ton of assumptions about how the user code starts to listen for 
connections. If Spark is to support something like this, there should be some 
other way of telling Spark (or YARN directly) what the port is.

> Create a socket on YARN AM start-up
> ---
>
> Key: SPARK-15974
> URL: https://issues.apache.org/jira/browse/SPARK-15974
> Project: Spark
>  Issue Type: New Feature
>  Components: YARN
>Reporter: Mingyu Kim
>
> YARN provides a way for AppilcationMaster to register a RPC port so that a 
> client outside the YARN cluster can reach the application for any RPCs, but 
> Spark’s YARN AMs simply register a dummy port number of 0. For the Spark 
> programs that starts up a server, this makes it hard for the submitter to 
> discover the server port securely. Spark's ApplicationMaster should 
> optionally create a ServerSocket and pass it to the Spark user program. This 
> socket initialization should be disabled by default.
> Some discussion on dev@spark thread: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Utilizing-YARN-AM-RPC-port-field-td17892.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15934) Return binary mode in ThriftServer

2016-06-15 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-15934.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13667
[https://github.com/apache/spark/pull/13667]

> Return binary mode in ThriftServer
> --
>
> Key: SPARK-15934
> URL: https://issues.apache.org/jira/browse/SPARK-15934
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Critical
> Fix For: 2.0.0
>
>
> In spark-2.0.0 preview binary mode was turned off (SPARK-15095). 
> It was greatly irresponsible step due to the fact, that in 1.6.1 binary mode 
> was default and it turned off in 2.0.0.
> Just to describe magnitude of harm not fixing this bug would do in my 
> organization:
> * Tableau works only though Thrift Server and only with binary format. 
> Tableau would not work with spark-2.0.0 at all!
> * I have bunch of analysts in my organization with configured sql 
> clients(DataGrip and Squirrel). I would need to go one by one to change 
> connection string for them(DataGrip). Squirrel simply do not work with http - 
> some jar hell in my case.
> * let me not mention all other stuff which connects to our data 
> infrastructure through ThriftServer as gateway. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15776) Type coercion incorrect

2016-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-15776.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13651
[https://github.com/apache/spark/pull/13651]

> Type coercion incorrect
> ---
>
> Key: SPARK-15776
> URL: https://issues.apache.org/jira/browse/SPARK-15776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark based on commit 
> 26c1089c37149061f838129bb53330ded68ff4c9
>Reporter: Weizhong
>Priority: Minor
> Fix For: 2.0.0
>
>
> {code:sql}
> CREATE TABLE cdr (
>   debet_dt  int  ,
>   srv_typ_cdstring   ,
>   b_brnd_cd smallint ,
>   call_dur  int
> )
> ROW FORMAT delimited fields terminated by ','
> STORED AS TEXTFILE;
> {code}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN srv_typ_cd LIKE '0%' THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) ELSE 0 
> END): bigint
> Project [debet_dt#16, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) 
> ELSE 0 END)#27L]
> +- Sort [debet_dt#16 ASC], true
>+- Aggregate [debet_dt#16], [debet_dt#16, sum(cast(CASE WHEN srv_typ_cd#18 
> LIKE 0% THEN (cast(call_dur#21 as double) / cast(60 as double)) ELSE cast(0 
> as double) END as bigint)) AS sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur 
> / 60) ELSE 0 END)#27L]
>   +- MetastoreRelation default, cdr
> {noformat}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN b_brnd_cd IN(1) THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS INT))) 
> THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) 
> END): double
> Project [debet_dt#76, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS 
> INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS 
> DOUBLE) END)#87]
> +- Sort [debet_dt#76 ASC], true
>+- Aggregate [debet_dt#76], [debet_dt#76, sum(CASE WHEN cast(b_brnd_cd#80 
> as int) IN (cast(1 as int)) THEN (cast(call_dur#81 as double) / cast(60 as 
> double)) ELSE cast(0 as double) END) AS sum(CASE WHEN (CAST(b_brnd_cd AS INT) 
> IN (CAST(1 AS INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) 
> ELSE CAST(0 AS DOUBLE) END)#87]
>   +- MetastoreRelation default, cdr
> {noformat}
> The only difference is WHEN condition, but will result different output 
> column type(one is bigint, one is double) 
> We need to apply "Division" before "FunctionArgumentConversion", like below:
> {code:java}
> val typeCoercionRules =
> PropagateTypes ::
>   InConversion ::
>   WidenSetOperationTypes ::
>   PromoteStrings ::
>   DecimalPrecision ::
>   BooleanEquality ::
>   StringToIntegralCasts ::
>   Division ::
>   FunctionArgumentConversion ::
>   CaseWhenCoercion ::
>   IfCoercion ::
>   PropagateTypes ::
>   ImplicitTypeCasts ::
>   DateTimeOperations ::
>   Nil
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15776) Type coercion incorrect

2016-06-15 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-15776:

Assignee: Sean Zhong

> Type coercion incorrect
> ---
>
> Key: SPARK-15776
> URL: https://issues.apache.org/jira/browse/SPARK-15776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
> Environment: Spark based on commit 
> 26c1089c37149061f838129bb53330ded68ff4c9
>Reporter: Weizhong
>Assignee: Sean Zhong
>Priority: Minor
> Fix For: 2.0.0
>
>
> {code:sql}
> CREATE TABLE cdr (
>   debet_dt  int  ,
>   srv_typ_cdstring   ,
>   b_brnd_cd smallint ,
>   call_dur  int
> )
> ROW FORMAT delimited fields terminated by ','
> STORED AS TEXTFILE;
> {code}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN srv_typ_cd LIKE '0%' THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) ELSE 0 
> END): bigint
> Project [debet_dt#16, sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur / 60) 
> ELSE 0 END)#27L]
> +- Sort [debet_dt#16 ASC], true
>+- Aggregate [debet_dt#16], [debet_dt#16, sum(cast(CASE WHEN srv_typ_cd#18 
> LIKE 0% THEN (cast(call_dur#21 as double) / cast(60 as double)) ELSE cast(0 
> as double) END as bigint)) AS sum(CASE WHEN srv_typ_cd LIKE 0% THEN (call_dur 
> / 60) ELSE 0 END)#27L]
>   +- MetastoreRelation default, cdr
> {noformat}
> {code:sql}
> SELECT debet_dt,
>SUM(CASE WHEN b_brnd_cd IN(1) THEN call_dur / 60 ELSE 0 END)
> FROM cdr
> GROUP BY debet_dt
> ORDER BY debet_dt;
> {code}
> {noformat}
> == Analyzed Logical Plan ==
> debet_dt: int, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS INT))) 
> THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS DOUBLE) 
> END): double
> Project [debet_dt#76, sum(CASE WHEN (CAST(b_brnd_cd AS INT) IN (CAST(1 AS 
> INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) ELSE CAST(0 AS 
> DOUBLE) END)#87]
> +- Sort [debet_dt#76 ASC], true
>+- Aggregate [debet_dt#76], [debet_dt#76, sum(CASE WHEN cast(b_brnd_cd#80 
> as int) IN (cast(1 as int)) THEN (cast(call_dur#81 as double) / cast(60 as 
> double)) ELSE cast(0 as double) END) AS sum(CASE WHEN (CAST(b_brnd_cd AS INT) 
> IN (CAST(1 AS INT))) THEN (CAST(call_dur AS DOUBLE) / CAST(60 AS DOUBLE)) 
> ELSE CAST(0 AS DOUBLE) END)#87]
>   +- MetastoreRelation default, cdr
> {noformat}
> The only difference is WHEN condition, but will result different output 
> column type(one is bigint, one is double) 
> We need to apply "Division" before "FunctionArgumentConversion", like below:
> {code:java}
> val typeCoercionRules =
> PropagateTypes ::
>   InConversion ::
>   WidenSetOperationTypes ::
>   PromoteStrings ::
>   DecimalPrecision ::
>   BooleanEquality ::
>   StringToIntegralCasts ::
>   Division ::
>   FunctionArgumentConversion ::
>   CaseWhenCoercion ::
>   IfCoercion ::
>   PropagateTypes ::
>   ImplicitTypeCasts ::
>   DateTimeOperations ::
>   Nil
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15319) Fix SparkR doc layout for corr and other DataFrame stats functions

2016-06-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15319:
--
Issue Type: Documentation  (was: Bug)

> Fix SparkR doc layout for corr and other DataFrame stats functions
> --
>
> Key: SPARK-15319
> URL: https://issues.apache.org/jira/browse/SPARK-15319
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Felix Cheung
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15319) Fix SparkR doc layout for corr and other DataFrame stats functions

2016-06-15 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15319:
--
Affects Version/s: 2.0.0

> Fix SparkR doc layout for corr and other DataFrame stats functions
> --
>
> Key: SPARK-15319
> URL: https://issues.apache.org/jira/browse/SPARK-15319
> Project: Spark
>  Issue Type: Documentation
>  Components: SparkR
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Felix Cheung
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15974) Create a socket on YARN AM start-up

2016-06-15 Thread Mingyu Kim (JIRA)

Mingyu Kim created SPARK-15974:
--

 Summary: Create a socket on YARN AM start-up
 Key: SPARK-15974
 URL: https://issues.apache.org/jira/browse/SPARK-15974
 Project: Spark
  Issue Type: New Feature
  Components: YARN
Reporter: Mingyu Kim


YARN provides a way for AppilcationMaster to register a RPC port so that a 
client outside the YARN cluster can reach the application for any RPCs, but 
Spark’s YARN AMs simply register a dummy port number of 0. For the Spark 
programs that starts up a server, this makes it hard for the submitter to 
discover the server port securely. Spark's ApplicationMaster should optionally 
create a ServerSocket and pass it to the Spark user program. This socket 
initialization should be disabled by default.

Some discussion on dev@spark thread: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Utilizing-YARN-AM-RPC-port-field-td17892.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15851) Spark 2.0 does not compile in Windows 7

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332612#comment-15332612
 ] 

Apache Spark commented on SPARK-15851:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13691

> Spark 2.0 does not compile in Windows 7
> ---
>
> Key: SPARK-15851
> URL: https://issues.apache.org/jira/browse/SPARK-15851
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.0.0
> Environment: Windows 7
>Reporter: Alexander Ulanov
>
> Spark does not compile in Windows 7.
> "mvn compile" fails on spark-core due to trying to execute a bash script 
> spark-build-info.
> Work around:
> 1)Install win-bash and put in path
> 2)Change line 350 of core/pom.xml
> 
>   
>   
>   
> 
> Error trace:
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-antrun-plugin:1.8:run (default) on project 
> spark-core_2.11: An Ant BuildException has occured: Execute failed: 
> java.io.IOException: Cannot run program 
> "C:\dev\spark\core\..\build\spark-build-info" (in directory 
> "C:\dev\spark\core"): CreateProcess error=193, %1 is not a valid Win32 
> application
> [ERROR] around Ant part ... executable="C:\dev\spark\core/../build/spark-build-info">... @ 4:73 in 
> C:\dev\spark\core\target\antrun\build-main.xml



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests

2016-06-15 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-15975:
--

 Summary: Improper Popen.wait() return code handling in 
dev/run-tests
 Key: SPARK-15975
 URL: https://issues.apache.org/jira/browse/SPARK-15975
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 1.6.0
Reporter: Josh Rosen
Assignee: Josh Rosen


In dev/run-tests.py there's a line where we effectively do

{code}
retcode = some_popen_instance.wait()
if retcode > 0:
  err
# else do nothing
{code}

but this code is subtlety wrong because Popen's return code will be negative if 
the child process was terminated by a signal: 
https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode

We should change this to {{retcode != 0}} so that we properly error out and 
exit due to termination by signal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests

2016-06-15 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332620#comment-15332620
 ] 

Apache Spark commented on SPARK-15975:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/13692

> Improper Popen.wait() return code handling in dev/run-tests
> ---
>
> Key: SPARK-15975
> URL: https://issues.apache.org/jira/browse/SPARK-15975
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In dev/run-tests.py there's a line where we effectively do
> {code}
> retcode = some_popen_instance.wait()
> if retcode > 0:
>   err
> # else do nothing
> {code}
> but this code is subtlety wrong because Popen's return code will be negative 
> if the child process was terminated by a signal: 
> https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode
> We should change this to {{retcode != 0}} so that we properly error out and 
> exit due to termination by signal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15975:


Assignee: Josh Rosen  (was: Apache Spark)

> Improper Popen.wait() return code handling in dev/run-tests
> ---
>
> Key: SPARK-15975
> URL: https://issues.apache.org/jira/browse/SPARK-15975
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In dev/run-tests.py there's a line where we effectively do
> {code}
> retcode = some_popen_instance.wait()
> if retcode > 0:
>   err
> # else do nothing
> {code}
> but this code is subtlety wrong because Popen's return code will be negative 
> if the child process was terminated by a signal: 
> https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode
> We should change this to {{retcode != 0}} so that we properly error out and 
> exit due to termination by signal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15975) Improper Popen.wait() return code handling in dev/run-tests

2016-06-15 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15975:


Assignee: Apache Spark  (was: Josh Rosen)

> Improper Popen.wait() return code handling in dev/run-tests
> ---
>
> Key: SPARK-15975
> URL: https://issues.apache.org/jira/browse/SPARK-15975
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 1.6.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In dev/run-tests.py there's a line where we effectively do
> {code}
> retcode = some_popen_instance.wait()
> if retcode > 0:
>   err
> # else do nothing
> {code}
> but this code is subtlety wrong because Popen's return code will be negative 
> if the child process was terminated by a signal: 
> https://docs.python.org/2/library/subprocess.html#subprocess.Popen.returncode
> We should change this to {{retcode != 0}} so that we properly error out and 
> exit due to termination by signal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-15 Thread Gayathri Murali (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332588#comment-15332588
 ] 

Gayathri Murali commented on SPARK-15930:
-

[~yuhaoyan] If you havent already started working on this, I can send the PR. 

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3847) Enum.hashCode is only consistent within the same JVM

2016-06-15 Thread Brett Stime (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332587#comment-15332587
 ] 

Brett Stime commented on SPARK-3847:


Another option: have the default behavior be 'safe' and not share hashCodes 
between JVMs. If passing hashCodes really does significantly improve 
performance (when used outside of arrays and enums), there could be a special 
configuration setting to enable inter-JVM hashCodes. E.g., something like 
spark.shuffle.i_solemnly_swear_my_keys_have_consistent_hashes which can be set 
true to enable the performant behavior. This would provide for discoverable 
documentation of the issue and make it relatively easy to compare/test results 
from either mode to the other.

> Enum.hashCode is only consistent within the same JVM
> 
>
> Key: SPARK-3847
> URL: https://issues.apache.org/jira/browse/SPARK-3847
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Oracle JDK 7u51 64bit on Ubuntu 12.04
>Reporter: Nathan Bijnens
>  Labels: enum
>
> When using java Enum's as key in some operations the results will be very 
> unexpected. The issue is that the Java Enum.hashCode returns the 
> memoryposition, which is different on each JVM. 
> {code}
> messages.filter(_.getHeader.getKind == Kind.EVENT).count
> >> 503650
> val tmp = messages.filter(_.getHeader.getKind == Kind.EVENT)
> tmp.map(_.getHeader.getKind).countByValue
> >> Map(EVENT -> 1389)
> {code}
> Because it's actually a JVM issue we either should reject with an error enums 
> as key or implement a workaround.
> A good writeup of the issue can be found here (and a workaround):
> http://dev.bizo.com/2014/02/beware-enums-in-spark.html
> Somewhat more on the hash codes and Enum's:
> https://stackoverflow.com/questions/4885095/what-is-the-reason-behind-enum-hashcode
> And some issues (most of them rejected) at the Oracle Bug Java database:
> - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=8050217
> - http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7190798



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15447) Performance test for ALS in Spark 2.0

2016-06-15 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332578#comment-15332578
 ] 

Reynold Xin commented on SPARK-15447:
-

We can close this one now can't we?


> Performance test for ALS in Spark 2.0
> -
>
> Key: SPARK-15447
> URL: https://issues.apache.org/jira/browse/SPARK-15447
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Nick Pentreath
>Priority: Critical
>  Labels: QA
>
> We made several changes to ALS in 2.0. It is necessary to run some tests to 
> avoid performance regression. We should test (synthetic) datasets from 1 
> million ratings to 1 billion ratings.
> cc [~mlnick] [~holdenk] Do you have time to run some large-scale performance 
> tests?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET

2016-06-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-15901:
---
Assignee: Xiao Li

> Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
> --
>
> Key: SPARK-15901
> URL: https://issues.apache.org/jira/browse/SPARK-15901
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.0.0
>
>
> So far, we do not have test cases for verifying whether the external 
> parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works 
> when users use non-default values. Adding test cases for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15901) Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET

2016-06-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-15901.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13622
[https://github.com/apache/spark/pull/13622]

> Test Cases for CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET
> --
>
> Key: SPARK-15901
> URL: https://issues.apache.org/jira/browse/SPARK-15901
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
> Fix For: 2.0.0
>
>
> So far, we do not have test cases for verifying whether the external 
> parameters CONVERT_METASTORE_ORC and CONVERT_METASTORE_PARQUET properly works 
> when users use non-default values. Adding test cases for avoiding regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-12173) Consider supporting DataSet API in SparkR

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12173.
---
Resolution: Won't Fix

Marking as won't fix. R doesn't have compile time safety and as a result it 
doesn't really make sense to have the type-safe Dataset API.


> Consider supporting DataSet API in SparkR
> -
>
> Key: SPARK-12173
> URL: https://issues.apache.org/jira/browse/SPARK-12173
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Felix Cheung
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15782) --packages doesn't work with the spark-shell

2016-06-15 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-15782.

   Resolution: Fixed
 Assignee: Nezih Yigitbasi
Fix Version/s: 2.0.0

> --packages doesn't work with the spark-shell
> 
>
> Key: SPARK-15782
> URL: https://issues.apache.org/jira/browse/SPARK-15782
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0
>Reporter: Nezih Yigitbasi
>Assignee: Nezih Yigitbasi
>Priority: Blocker
> Fix For: 2.0.0
>
>
> When {{--packages}} is specified with {{spark-shell}} the classes from those 
> packages cannot be found, which I think is due to some of the changes in 
> {{SPARK-12343}}. In particular {{SPARK-12343}} removes a line that sets the 
> {{spark.jars}} system property in client mode, which is used by the repl main 
> class to set the classpath.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14953) LocalBackend should revive offers periodically

2016-06-15 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332560#comment-15332560
 ] 

Cheng Lian commented on SPARK-14953:


Marked this as won't fix since this isn't causing any actual problems right 
now. We hit this issue because there was a locality bug in an intermediate 
version of [PR #12527|https://github.com/apache/spark/pull/12527]. See 
[here|https://github.com/apache/spark/pull/12527#issuecomment-213034425].

> LocalBackend should revive offers periodically
> --
>
> Key: SPARK-14953
> URL: https://issues.apache.org/jira/browse/SPARK-14953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{LocalBackend}} only revives offers when tasks are submitted, succeed, or 
> fail. This may lead to deadlock due to delayed scheduling. A case study is 
> provided in [this PR 
> comment|https://github.com/apache/spark/pull/12527#issuecomment-213034425].
> Basically, a job may have a task is delayed to be scheduled due to locality 
> mismatch. The default delay timeout is 3s. If all other tasks finish during 
> this period, {{LocalBackend}} won't revive any offer after the timeout since 
> no tasks are submitted, succeed or fail then. Thus, the delayed task will 
> never be scheduled again and the job never completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15930) Add Row count property to FPGrowth model

2016-06-15 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332559#comment-15332559
 ] 

Joseph K. Bradley commented on SPARK-15930:
---

This seems like a reasonable field to add to almost every model (perhaps in a 
model summary if one exists).

> Add Row count property to FPGrowth model
> 
>
> Key: SPARK-15930
> URL: https://issues.apache.org/jira/browse/SPARK-15930
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.1
>Reporter: John Aherne
>Priority: Minor
>  Labels: fp-growth, mllib
>
> Add a row count property to MLlib's FPGrowth model. 
> When using the model from FPGrowth, a count of the total number of records is 
> often necessary. 
> It appears that the function already calculates that value when training the 
> model, so it would save time not having to do it again outside the model. 
> Sorry if this is the wrong place for this kind of stuff. I am new to Jira, 
> Spark, and making suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14953) LocalBackend should revive offers periodically

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14953:

Target Version/s:   (was: 2.0.0)

> LocalBackend should revive offers periodically
> --
>
> Key: SPARK-14953
> URL: https://issues.apache.org/jira/browse/SPARK-14953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{LocalBackend}} only revives offers when tasks are submitted, succeed, or 
> fail. This may lead to deadlock due to delayed scheduling. A case study is 
> provided in [this PR 
> comment|https://github.com/apache/spark/pull/12527#issuecomment-213034425].
> Basically, a job may have a task is delayed to be scheduled due to locality 
> mismatch. The default delay timeout is 3s. If all other tasks finish during 
> this period, {{LocalBackend}} won't revive any offer after the timeout since 
> no tasks are submitted, succeed or fail then. Thus, the delayed task will 
> never be scheduled again and the job never completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14953) LocalBackend should revive offers periodically

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-14953:

Affects Version/s: 2.0.0

> LocalBackend should revive offers periodically
> --
>
> Key: SPARK-14953
> URL: https://issues.apache.org/jira/browse/SPARK-14953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{LocalBackend}} only revives offers when tasks are submitted, succeed, or 
> fail. This may lead to deadlock due to delayed scheduling. A case study is 
> provided in [this PR 
> comment|https://github.com/apache/spark/pull/12527#issuecomment-213034425].
> Basically, a job may have a task is delayed to be scheduled due to locality 
> mismatch. The default delay timeout is 3s. If all other tasks finish during 
> this period, {{LocalBackend}} won't revive any offer after the timeout since 
> no tasks are submitted, succeed or fail then. Thus, the delayed task will 
> never be scheduled again and the job never completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-14953) LocalBackend should revive offers periodically

2016-06-15 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-14953.

Resolution: Won't Fix

> LocalBackend should revive offers periodically
> --
>
> Key: SPARK-14953
> URL: https://issues.apache.org/jira/browse/SPARK-14953
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> {{LocalBackend}} only revives offers when tasks are submitted, succeed, or 
> fail. This may lead to deadlock due to delayed scheduling. A case study is 
> provided in [this PR 
> comment|https://github.com/apache/spark/pull/12527#issuecomment-213034425].
> Basically, a job may have a task is delayed to be scheduled due to locality 
> mismatch. The default delay timeout is 3s. If all other tasks finish during 
> this period, {{LocalBackend}} won't revive any offer after the timeout since 
> no tasks are submitted, succeed or fail then. Thus, the delayed task will 
> never be scheduled again and the job never completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15715) Altering partition storage information doesn't work in Hive

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-15715.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Altering partition storage information doesn't work in Hive
> ---
>
> Key: SPARK-15715
> URL: https://issues.apache.org/jira/browse/SPARK-15715
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 2.0.0
>
>
> In HiveClientImpl
> {code}
>   private def toHivePartition(
>   p: CatalogTablePartition,
>   ht: HiveTable): HivePartition = {
> new HivePartition(ht, p.spec.asJava, p.storage.locationUri.map { l => new 
> Path(l) }.orNull)
>   }
> {code}
> Other than the location, we don't even store any of the storage information 
> in the metastore: output format, input format, serde, serde props. The result 
> is that doing something like the following doesn't actually do anything:
> {code}
> ALTER TABLE boxes PARTITION (width=3)
> SET SERDE 'com.sparkbricks.serde.ColumnarSerDe'
> WITH SERDEPROPERTIES ('compress'='true')
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12978) Skip unnecessary final group-by when input data already clustered with group-by keys

2016-06-15 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12978:

Target Version/s: 2.1.0  (was: 2.0.0)

> Skip unnecessary final group-by when input data already clustered with 
> group-by keys
> 
>
> Key: SPARK-12978
> URL: https://issues.apache.org/jira/browse/SPARK-12978
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>
> This ticket targets the optimization to skip an unnecessary group-by 
> operation below;
> Without opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Final,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Partial,isDistinct=false),(avg(col2#161),mode=Partial,isDistinct=false)],
>  output=[col0#159,sum#200,sum#201,count#202L])
>+- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], 
> InMemoryRelation [col0#159,col1#160,col2#161], true, 1, 
> StorageLevel(true, true, false, true, 1), ConvertToUnsafe, None
> {code}
> With opt.:
> {code}
> == Physical Plan ==
> TungstenAggregate(key=[col0#159], 
> functions=[(sum(col1#160),mode=Complete,isDistinct=false),(avg(col2#161),mode=Final,isDistinct=false)],
>  output=[col0#159,sum(col1)#177,avg(col2)#178])
> +- TungstenExchange hashpartitioning(col0#159,200), None
>   +- InMemoryColumnarTableScan [col0#159,col1#160,col2#161], InMemoryRelation 
> [col0#159,col1#160,col2#161], true, 1, StorageLevel(true, true, false, 
> true, 1), ConvertToUnsafe, None
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 209 matches

Mail list logo