[jira] [Assigned] (SPARK-10331) Update user guide to address minor comments during code review

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10331:


Assignee: Apache Spark  (was: Xiangrui Meng)

> Update user guide to address minor comments during code review
> --
>
> Key: SPARK-10331
> URL: https://issues.apache.org/jira/browse/SPARK-10331
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Clean-up user guides to address some minor comments in:
> https://github.com/apache/spark/pull/8304
> https://github.com/apache/spark/pull/8487



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10331) Update user guide to address minor comments during code review

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10331:


Assignee: Xiangrui Meng  (was: Apache Spark)

> Update user guide to address minor comments during code review
> --
>
> Key: SPARK-10331
> URL: https://issues.apache.org/jira/browse/SPARK-10331
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Clean-up user guides to address some minor comments in:
> https://github.com/apache/spark/pull/8304
> https://github.com/apache/spark/pull/8487



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10331) Update user guide to address minor comments during code review

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721007#comment-14721007
 ] 

Apache Spark commented on SPARK-10331:
--

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/8518

> Update user guide to address minor comments during code review
> --
>
> Key: SPARK-10331
> URL: https://issues.apache.org/jira/browse/SPARK-10331
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Clean-up user guides to address some minor comments in:
> https://github.com/apache/spark/pull/8304
> https://github.com/apache/spark/pull/8487



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10175) Enhance spark doap file

2015-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10175:
--
Shepherd:   (was: Matei Zaharia)
Assignee: Sean Owen
Target Version/s: 1.5.0
Priority: Minor  (was: Major)

No problem, I can get this one in as I am familiar with updating the site in 
SVN.

> Enhance spark doap file
> ---
>
> Key: SPARK-10175
> URL: https://issues.apache.org/jira/browse/SPARK-10175
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Sean Owen
>Priority: Minor
> Attachments: SPARK-10175
>
>
> The Spark doap has broken links and is also missing entries related to issue 
> tracker and mailing lists. This affects the list in projects.apache.org and 
> also in the main apache website.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10175) Enhance spark doap file

2015-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10175.
---
   Resolution: Fixed
 Assignee: Luciano Resende  (was: Sean Owen)
Fix Version/s: 1.5.0

Fixed in SVN revision 1698445. I'll call this fixed for 1.5.0 even though it's 
not part of the project's source release per se.

> Enhance spark doap file
> ---
>
> Key: SPARK-10175
> URL: https://issues.apache.org/jira/browse/SPARK-10175
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Reporter: Luciano Resende
>Assignee: Luciano Resende
>Priority: Minor
> Fix For: 1.5.0
>
> Attachments: SPARK-10175
>
>
> The Spark doap has broken links and is also missing entries related to issue 
> tracker and mailing lists. This affects the list in projects.apache.org and 
> also in the main apache website.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10349) OneVsRest use "when ... otherwise" not UDF to generate new label at binary reduction

2015-08-29 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-10349:
---

 Summary: OneVsRest use "when ... otherwise" not UDF to generate 
new label at binary reduction  
 Key: SPARK-10349
 URL: https://issues.apache.org/jira/browse/SPARK-10349
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


Currently OneVsRest use UDF to generate new binary label during training.
Considering that SPARK-7321 has been merged, we can use "when ... otherwise"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10349) OneVsRest use "when ... otherwise" not UDF to generate new label at binary reduction

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10349:


Assignee: (was: Apache Spark)

> OneVsRest use "when ... otherwise" not UDF to generate new label at binary 
> reduction  
> --
>
> Key: SPARK-10349
> URL: https://issues.apache.org/jira/browse/SPARK-10349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently OneVsRest use UDF to generate new binary label during training.
> Considering that SPARK-7321 has been merged, we can use "when ... otherwise"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10349) OneVsRest use "when ... otherwise" not UDF to generate new label at binary reduction

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721026#comment-14721026
 ] 

Apache Spark commented on SPARK-10349:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8519

> OneVsRest use "when ... otherwise" not UDF to generate new label at binary 
> reduction  
> --
>
> Key: SPARK-10349
> URL: https://issues.apache.org/jira/browse/SPARK-10349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently OneVsRest use UDF to generate new binary label during training.
> Considering that SPARK-7321 has been merged, we can use "when ... otherwise"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10349) OneVsRest use "when ... otherwise" not UDF to generate new label at binary reduction

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10349:


Assignee: Apache Spark

> OneVsRest use "when ... otherwise" not UDF to generate new label at binary 
> reduction  
> --
>
> Key: SPARK-10349
> URL: https://issues.apache.org/jira/browse/SPARK-10349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Currently OneVsRest use UDF to generate new binary label during training.
> Considering that SPARK-7321 has been merged, we can use "when ... otherwise"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10349) OneVsRest use "when ... otherwise" not UDF to generate new label at binary reduction

2015-08-29 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-10349:

Description: 
Currently OneVsRest use UDF to generate new binary label during training.
Considering that SPARK-7321 has been merged, we can use "when ... otherwise" 
which will be more efficiency.

  was:
Currently OneVsRest use UDF to generate new binary label during training.
Considering that SPARK-7321 has been merged, we can use "when ... otherwise"


> OneVsRest use "when ... otherwise" not UDF to generate new label at binary 
> reduction  
> --
>
> Key: SPARK-10349
> URL: https://issues.apache.org/jira/browse/SPARK-10349
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> Currently OneVsRest use UDF to generate new binary label during training.
> Considering that SPARK-7321 has been merged, we can use "when ... otherwise" 
> which will be more efficiency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5992) Locality Sensitive Hashing (LSH) for MLlib

2015-08-29 Thread Maruf Aytekin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721066#comment-14721066
 ] 

Maruf Aytekin commented on SPARK-5992:
--

I have developed spark implementation of LSH for Charikar's scheme for 
collection of vectors. It is published here: 
https://github.com/marufaytekin/lsh-spark. The details are documented in 
Readme.md file. I'd really appreciate if you check it out and provide feedback.


> Locality Sensitive Hashing (LSH) for MLlib
> --
>
> Key: SPARK-5992
> URL: https://issues.apache.org/jira/browse/SPARK-5992
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> Locality Sensitive Hashing (LSH) would be very useful for ML.  It would be 
> great to discuss some possible algorithms here, choose an API, and make a PR 
> for an initial algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-10350:
---

 Summary: Fix SQL Programming Guide
 Key: SPARK-10350
 URL: https://issues.apache.org/jira/browse/SPARK-10350
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.5.0
Reporter: Guoqiang Li
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10350:


Assignee: Apache Spark

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721074#comment-14721074
 ] 

Apache Spark commented on SPARK-10350:
--

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/8520

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10350:


Assignee: (was: Apache Spark)

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10350:
--
Target Version/s:   (was: 1.5.0)

[~gq] this doesn't contain any explanation, and neither does the pull request. 
I think you're familiar with the process for creating JIRAs and PRs in Spark: 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Could 
I please ask you to write a clear description of the change, or else close 
this? 

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-10350:

Description: 
[b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95]
 contains duplicate content: [[spark.sql.parquet.mergeSchema]]

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Priority: Minor
>
> [b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95]
>  contains duplicate content: [[spark.sql.parquet.mergeSchema]]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-10350:

Description: 
[b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95]
 contains duplicate content:  {{spark.sql.parquet.mergeSchema}}  (was: 
[b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95]
 contains duplicate content: [[spark.sql.parquet.mergeSchema]])

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Priority: Minor
>
> [b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95]
>  contains duplicate content:  {{spark.sql.parquet.mergeSchema}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-10350:

Description: 
[b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95#diff-d8aa7a37d17a1227cba38c99f9f22511R1383]
 contains duplicate content:  {{spark.sql.parquet.mergeSchema}}  (was: 
[b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95]
 contains duplicate content:  {{spark.sql.parquet.mergeSchema}})

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Priority: Minor
>
> [b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95#diff-d8aa7a37d17a1227cba38c99f9f22511R1383]
>  contains duplicate content:  {{spark.sql.parquet.mergeSchema}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721187#comment-14721187
 ] 

Apache Spark commented on SPARK-10301:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/8515

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> When parquet's global schema has less number of fields than the local schema 
> of a file, the data reading path will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2015-08-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10340:
-
Target Version/s: 1.5.0

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10340) Use S3 bulk listing for S3-backed Hive tables

2015-08-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10340:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> Use S3 bulk listing for S3-backed Hive tables
> -
>
> Key: SPARK-10340
> URL: https://issues.apache.org/jira/browse/SPARK-10340
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input 
> paths as a parameter and returns all the objects whose prefixes start with 
> the common prefix in blocks of 1000.
> Since SPARK-9926 allow us to list multiple partitions all together, we can 
> significantly speed up input split calculation using S3 bulk listing. This 
> optimization is particularly useful for queries like {{select * from 
> partitioned_table limit 10}}.
> This is a common optimization for S3. For eg, here is a [blog 
> post|http://www.qubole.com/blog/product/optimizing-hadoop-for-s3-part-1/] 
> from Qubole on this topic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10350) Fix SQL Programming Guide

2015-08-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10350.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8520
[https://github.com/apache/spark/pull/8520]

> Fix SQL Programming Guide
> -
>
> Key: SPARK-10350
> URL: https://issues.apache.org/jira/browse/SPARK-10350
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Guoqiang Li
>Priority: Minor
> Fix For: 1.5.0
>
>
> [b93d99a|https://github.com/apache/spark/commit/b93d99ae21b8b3af1dd55775f77e5a9ddea48f95#diff-d8aa7a37d17a1227cba38c99f9f22511R1383]
>  contains duplicate content:  {{spark.sql.parquet.mergeSchema}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10170) Writing from data frame into db2 database using jdbc data source api fails with error for string, and boolean column types.

2015-08-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-10170:
-
Target Version/s: 1.6.0

> Writing from data frame into db2 database using jdbc data source api fails 
> with error for string, and boolean column types.
> ---
>
> Key: SPARK-10170
> URL: https://issues.apache.org/jira/browse/SPARK-10170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Suresh Thalamati
>
> Repro :
> -- start spark shell with classpath set to the db2 jdbc driver. 
> SPARK_CLASSPATH=~/myjars/db2jcc.jar ./spark-shell
>  
> // set connetion properties 
> val properties = new java.util.Properties()
> properties.setProperty("user" , "user")
> properties.setProperty("password" , "password")
> // load the driver.
> Class.forName("com.ibm.db2.jcc.DB2Driver").newInstance
> // create data frame with a String type
> val empdf = sc.parallelize( Array((1,"John"), (2,"Mike"))).toDF("id", "name" )
> // write the data frame.  this will fail with error.  
> empdf.write.jdbc("jdbc:db2://bdvs150.svl.ibm.com:6/SAMPLE:retrieveMessagesFromServerOnGetMessage=true;",
>  "emp_data", properties)
> Error :
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: TEXT
>   at com.ibm.db2.jcc.am.fd.a(fd.java:679)
>   at com.ibm.db2.jcc.am.fd.a(fd.java:60)
> ..
> // create data frame with String , and Boolean types 
> val empdf = sc.parallelize( Array((1,"true".toBoolean ), (2, 
> "false".toBoolean ))).toDF("id", "isManager")
> // write the data frame.  this will fail with error.  
> empdf.write.jdbc("jdbc:db2://: 
> /SAMPLE:retrieveMessagesFromServerOnGetMessage=true;", "emp_data", properties)
> Error :
> com.ibm.db2.jcc.am.SqlSyntaxErrorException: TEXT
>   at com.ibm.db2.jcc.am.fd.a(fd.java:679)
>   at com.ibm.db2.jcc.am.fd.a(fd.java:60)
> Write is failing because by default JDBC data source implementation 
> generating table schema with unsupported data types TEXT  for String, and 
> BIT1(1)  for Boolean. I think String type should get mapped to CLOB/VARCHAR, 
> and boolean type should be mapped to CHAR(1) for DB2 database.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10344) Add tests for extraStrategies

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-10344.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8516
[https://github.com/apache/spark/pull/8516]

> Add tests for extraStrategies
> -
>
> Key: SPARK-10344
> URL: https://issues.apache.org/jira/browse/SPARK-10344
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10226) Error occured in SparkSQL when using !=

2015-08-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10226.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8420
[https://github.com/apache/spark/pull/8420]

> Error occured in SparkSQL when using  !=
> 
>
> Key: SPARK-10226
> URL: https://issues.apache.org/jira/browse/SPARK-10226
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: wangwei
> Fix For: 1.5.0
>
>
> DataSource:  
> src/main/resources/kv1.txt
> SQL: 
>   1. create table src(id string, name string);
>   2. load data local inpath 
> '${SparkHome}/examples/src/main/resources/kv1.txt' into table src;
>   3. select count( * ) from src where id != '0';
> [ERROR] Could not expand event
> java.lang.IllegalArgumentException: != 0;: event not found
>   at jline.console.ConsoleReader.expandEvents(ConsoleReader.java:779)
>   at jline.console.ConsoleReader.finishBuffer(ConsoleReader.java:631)
>   at jline.console.ConsoleReader.accept(ConsoleReader.java:2019)
>   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2666)
>   at jline.console.ConsoleReader.readLine(ConsoleReader.java:2269)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:231)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:601)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:666)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:178)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:203)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:118)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10330) Use SparkHadoopUtil TaskAttemptContext reflection methods in more places

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721275#comment-14721275
 ] 

Apache Spark commented on SPARK-10330:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8521

> Use SparkHadoopUtil TaskAttemptContext reflection methods in more places
> 
>
> Key: SPARK-10330
> URL: https://issues.apache.org/jira/browse/SPARK-10330
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>
> SparkHadoopUtil contains methods that use reflection to work around 
> TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We 
> should use these methods in more places.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9926) Parallelize file listing for partitioned Hive table

2015-08-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9926:
---
Assignee: Cheolsoo Park

> Parallelize file listing for partitioned Hive table
> ---
>
> Key: SPARK-9926
> URL: https://issues.apache.org/jira/browse/SPARK-9926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Cheolsoo Park
>Assignee: Cheolsoo Park
>
> In Spark SQL, short queries like {{select * from table limit 10}} run very 
> slowly against partitioned Hive tables because of file listing. In 
> particular, if a large number of partitions are scanned on storage like S3, 
> the queries run extremely slowly. Here are some example benchmarks in my 
> environment-
> * Parquet-backed Hive table
> * Partitioned by dateint and hour
> * Stored on S3
> ||\# of partitions||\# of files||runtime||query||
> |1|972|30 secs|select * from nccp_log where dateint=20150601 and hour=0 limit 
> 10;|
> |24|13646|6 mins|select * from nccp_log where dateint=20150601 limit 10;|
> |240|136222|1 hour|select * from nccp_log where dateint>=20150601 and 
> dateint<=20150610 limit 10;|
> The problem is that {{TableReader}} constructs a separate HadoopRDD per Hive 
> partition path and group them into a UnionRDD. Then, all the input files are 
> listed sequentially. In other tools such as Hive and Pig, this can be solved 
> by setting 
> [mapreduce.input.fileinputformat.list-status.num-threads|https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml]
>  high. But in Spark, since each HadoopRDD lists only one partition path, 
> setting this property doesn't help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap memory

2015-08-29 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-10351:
-

 Summary: UnsafeRow.getUTF8String should handle off-heap memory
 Key: SPARK-10351
 URL: https://issues.apache.org/jira/browse/SPARK-10351
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Feynman Liang
Priority: Critical


{{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which does 
not handle off-heap memory correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap memory

2015-08-29 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721286#comment-14721286
 ] 

Feynman Liang commented on SPARK-10351:
---

I'm working on a PR to make my use case work.

[~rxin] is this a bug or actually intended behavior (and I'm just not 
interpreting correctly)?

> UnsafeRow.getUTF8String should handle off-heap memory
> -
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> does not handle off-heap memory correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap memory

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10351:
--
Description: {{UnsafeRow.getUTF8String}} delegates to 
{{UTF8String.fromAddress}} which returns {{null}} when passed a {{null}} base 
object, failing to handle off-heap memory correctly.   (was: 
{{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which does 
not handle off-heap memory correctly. )

> UnsafeRow.getUTF8String should handle off-heap memory
> -
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap memory correctly. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap memory

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10351:
--
Description: 
{{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
returns {{null}} when passed a {{null}} base object, failing to handle off-heap 
memory correctly.

This will also cause a {{NullPointerException}} when {{getString}} is called 
with off-heap storage.

  was:{{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
returns {{null}} when passed a {{null}} base object, failing to handle off-heap 
memory correctly. 


> UnsafeRow.getUTF8String should handle off-heap memory
> -
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap memory correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap memory

2015-08-29 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721286#comment-14721286
 ] 

Feynman Liang edited comment on SPARK-10351 at 8/29/15 11:12 PM:
-

I'm working on a PR to fix this.

[~rxin] is this a bug or actually intended behavior (and I'm just not 
interpreting correctly)?


was (Author: fliang):
I'm working on a PR to make my use case work.

[~rxin] is this a bug or actually intended behavior (and I'm just not 
interpreting correctly)?

> UnsafeRow.getUTF8String should handle off-heap memory
> -
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap memory correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-10352:
-

 Summary: BaseGenericInternalRow.getUTF8String should support 
java.lang.String
 Key: SPARK-10352
 URL: https://issues.apache.org/jira/browse/SPARK-10352
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Feynman Liang


Running the code:
{{
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
}}
generates the error:
{{[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721289#comment-14721289
 ] 

Feynman Liang commented on SPARK-10352:
---

Working on a PR.

[~rxin] can you confirm that this is a bug?

> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {{
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> }}
> generates the error:
> {{[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}

  was:
Running the code:
{code scala}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{/code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{/code}

  was:
Running the code:
{{code}}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{{/code}}
generates the error:
{{code}}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{{/code}}


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {/code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {/code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{code scala}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}

  was:
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{/code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{/code}


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code scala}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{{code}}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{{/code}}
generates the error:
{{code}}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{{/code}}

  was:
Running the code:
{{
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
}}
generates the error:
{{[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***}}


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {{code}}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {{/code}}
> generates the error:
> {{code}}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {{/code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{code:scala}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}

  was:
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code:scala}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}

  was:
Running the code:
{code:scala}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10352:


Assignee: Apache Spark

> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Assignee: Apache Spark
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10352:


Assignee: (was: Apache Spark)

> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721291#comment-14721291
 ] 

Apache Spark commented on SPARK-10352:
--

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8522

> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}

Although `StringType` should in theory only have internal type `UTF8String`, we 
[are inconsistent with this 
constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
 and being more strict would [break existing 
code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
 

  was:
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}
> Although `StringType` should in theory only have internal type `UTF8String`, 
> we [are inconsistent with this 
> constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
>  and being more strict would [break existing 
> code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10334.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8515
[https://github.com/apache/spark/pull/8515]

> Partitioned table scan's query plan does not show Filter and Project on top 
> of the table scan
> -
>
> Key: SPARK-10334
> URL: https://issues.apache.org/jira/browse/SPARK-10334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.5.0
>
>
> {code}
> Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
> "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
> val df1 = 
> sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
> df1.selectExpr("hash(i)", "hash(j)").show
> df1.filter("hash(j) = 1").explain
> == Physical Plan ==
> Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
> {code}
> Looks like the reason is that we correctly apply the project and filter. 
> Then, we create an RDD for the result and then manually create a PhysicalRDD. 
> So, the Project and Filter on top of the original table scan disappears from 
> the physical  plan.
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175
> We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10339) When scanning a partitioned table having thousands of partitions, Driver has a very high memory pressure because of SQL metrics

2015-08-29 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-10339.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 8515
[https://github.com/apache/spark/pull/8515]

> When scanning a partitioned table having thousands of partitions, Driver has 
> a very high memory pressure because of SQL metrics
> ---
>
> Key: SPARK-10339
> URL: https://issues.apache.org/jira/browse/SPARK-10339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
> Fix For: 1.5.0
>
>
> I have a local dataset having 5000 partitions stored in {{/tmp/partitioned}}. 
> When I run the following code, the free memory space in driver's old gen 
> gradually decreases and eventually there is pretty much no free space in 
> driver's old gen. Finally, all kinds of timeouts happen and the cluster is 
> died.
> {code}
> val df = sqlContext.read.format("parquet").load("/tmp/partitioned")
> df.filter("a > -100").selectExpr("hash(a, b)").queryExecution.toRdd.foreach(_ 
> => Unit)
> {code}
> I did a quick test by deleting SQL metrics from project and filter operator, 
> my job works fine.
> The reason is that for a partitioned table, when we scan it, the actual plan 
> is like
> {code}
>other operators
>|
>|
> /--|--\
>/   |   \
>   /|\
>  / | \
> project  project ... project
>   ||   |
> filter   filter  ... filter
>   ||   |
> part1part2   ... part n
> {code}
> We create SQL metrics for every filter and project, which causing the 
> extremely high memory pressure to the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) BaseGenericInternalRow.getUTF8String should support java.lang.String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Description: 
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}

Although {{StringType}} should in theory only have internal type 
{{UTF8String}}, we [are inconsistent with this 
constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
 and being more strict would [break existing 
code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
 

  was:
Running the code:
{code}
val inputString = "abc"
val row = InternalRow.apply(inputString)
val unsafeRow = 
UnsafeProjection.create(Array[DataType](StringType)).apply(row)
{code}
generates the error:
{code}
[info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
org.apache.spark.unsafe.types.UTF8String
[info]   at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
***snip***
{code}

Although `StringType` should in theory only have internal type `UTF8String`, we 
[are inconsistent with this 
constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
 and being more strict would [break existing 
code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
 


> BaseGenericInternalRow.getUTF8String should support java.lang.String
> 
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}
> Although {{StringType}} should in theory only have internal type 
> {{UTF8String}}, we [are inconsistent with this 
> constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
>  and being more strict would [break existing 
> code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap memory

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10351:
--
Description: 
{{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
returns {{null}} when passed a {{null}} base object, failing to handle off-heap 
backed {{UnsafeRow}}s correctly.

This will also cause a {{NullPointerException}} when {{getString}} is called 
with off-heap storage.

  was:
{{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
returns {{null}} when passed a {{null}} base object, failing to handle off-heap 
memory correctly.

This will also cause a {{NullPointerException}} when {{getString}} is called 
with off-heap storage.


> UnsafeRow.getUTF8String should handle off-heap memory
> -
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10351) UnsafeRow.getString should handle off-heap backed UnsafeRow

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10351:
--
Summary: UnsafeRow.getString should handle off-heap backed UnsafeRow  (was: 
UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow)

> UnsafeRow.getString should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10351:
--
Summary: UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow  
(was: UnsafeRow.getUTF8String should handle off-heap memory)

> UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10351) UnsafeRow.getString should handle off-heap backed UnsafeRow

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10351:


Assignee: Apache Spark

> UnsafeRow.getString should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Assignee: Apache Spark
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10351) UnsafeRow.getString should handle off-heap backed UnsafeRow

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10351:


Assignee: (was: Apache Spark)

> UnsafeRow.getString should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10351) UnsafeRow.getString should handle off-heap backed UnsafeRow

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721302#comment-14721302
 ] 

Apache Spark commented on SPARK-10351:
--

User 'feynmanliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/8523

> UnsafeRow.getString should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) Replace internal usages of String with UTF8String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Summary: Replace internal usages of String with UTF8String  (was: 
BaseGenericInternalRow.getUTF8String should support java.lang.String)

> Replace internal usages of String with UTF8String
> -
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}
> Although {{StringType}} should in theory only have internal type 
> {{UTF8String}}, we [are inconsistent with this 
> constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
>  and being more strict would [break existing 
> code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10352) Replace SQLTestData internal usages of String with UTF8String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10352:
--
Summary: Replace SQLTestData internal usages of String with UTF8String  
(was: Replace internal usages of String with UTF8String)

> Replace SQLTestData internal usages of String with UTF8String
> -
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}
> Although {{StringType}} should in theory only have internal type 
> {{UTF8String}}, we [are inconsistent with this 
> constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
>  and being more strict would [break existing 
> code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10301:
-
Target Version/s: 1.6.0  (was: 1.5.0)

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> When parquet's global schema has less number of fields than the local schema 
> of a file, the data reading path will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10334) Partitioned table scan's query plan does not show Filter and Project on top of the table scan

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10334:
-
Target Version/s: 1.5.0  (was: 1.6.0, 1.5.1)

> Partitioned table scan's query plan does not show Filter and Project on top 
> of the table scan
> -
>
> Key: SPARK-10334
> URL: https://issues.apache.org/jira/browse/SPARK-10334
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 1.5.0
>
>
> {code}
> Seq(Tuple2(1, 1), Tuple2(2, 2)).toDF("i", 
> "j").write.format("parquet").partitionBy("i").save("/tmp/testFilter_partitioned")
> val df1 = 
> sqlContext.read.format("parquet").load("/tmp/testFilter_partitioned")
> df1.selectExpr("hash(i)", "hash(j)").show
> df1.filter("hash(j) = 1").explain
> == Physical Plan ==
> Scan ParquetRelation[file:/tmp/testFilter_partitioned][j#20,i#21]
> {code}
> Looks like the reason is that we correctly apply the project and filter. 
> Then, we create an RDD for the result and then manually create a PhysicalRDD. 
> So, the Project and Filter on top of the original table scan disappears from 
> the physical  plan.
> See 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L138-L175
> We will not generate wrong result. But, the query plan is confusing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-08-29 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721312#comment-14721312
 ] 

Yin Huai commented on SPARK-10301:
--

https://github.com/apache/spark/pull/8515 has been merged. It is not the fix 
for this issue but will give users a nice error message when the global schema 
as less struct fields than local parquet file schema (it will ask users to 
enable schema merging). I am re-targeting this issue to 1.6 for the proper fix 
(https://github.com/apache/spark/pull/8509).

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> When parquet's global schema has less number of fields than the local schema 
> of a file, the data reading path will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10301) For struct type, if parquet's global schema has less fields than a file's schema, data reading will fail

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10301:
-
Assignee: Cheng Lian  (was: Yin Huai)

> For struct type, if parquet's global schema has less fields than a file's 
> schema, data reading will fail
> 
>
> Key: SPARK-10301
> URL: https://issues.apache.org/jira/browse/SPARK-10301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Cheng Lian
>Priority: Critical
>
> When parquet's global schema has less number of fields than the local schema 
> of a file, the data reading path will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9514) Add EventHubsReceiver to support Spark Streaming using Azure EventHubs

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9514:

Fix Version/s: (was: 1.5.0)

> Add EventHubsReceiver to support Spark Streaming using Azure EventHubs
> --
>
> Key: SPARK-9514
> URL: https://issues.apache.org/jira/browse/SPARK-9514
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.4.1
>Reporter: shanyu zhao
> Attachments: SPARK-9514.patch
>
>
> We need to add EventHubsReceiver implementation to support Spark Streaming 
> applications that receive data from Azure EventHubs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9976) create function do not work

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-9976:

Fix Version/s: (was: 1.4.2)
   (was: 1.5.0)

> create function do not work
> ---
>
> Key: SPARK-9976
> URL: https://issues.apache.org/jira/browse/SPARK-9976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
> Environment: spark 1.4.1 yarn 2.2.0
>Reporter: cen yuhai
>
> I use beeline to connect to ThriftServer, but add jar can not work, so I use 
> create function , see the link below.
> http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html
> I do as blow:
> {code}
> create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
> 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 
> {code}
> It returns Ok, and I connect to the metastore, I see records in table FUNCS.
> {code}
> select gdecodeorder(t1)  from tableX  limit 1;
> {code}
> It returns error 'Couldn't find function default.gdecodeorder'
> This is the Exception
> {code}
> 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
> as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
> java.lang.RuntimeException: Couldn't find function default.gdecodeorder
> 15/08/14 15:04:47 ERROR RetryingHMSHandler: 
> MetaException(message:NoSuchObjectException(message:Function 
> default.t_gdecodeorder does not exist))
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
> at com.sun.proxy.$Proxy21.get_function(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
> at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
> at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:242)
> at scala.collection.Iterator$$ano

[jira] [Commented] (SPARK-9976) create function do not work

2015-08-29 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721313#comment-14721313
 ] 

Yin Huai commented on SPARK-9976:
-

Can you try our 1.5 branch and see if add jar in thrift server works?

> create function do not work
> ---
>
> Key: SPARK-9976
> URL: https://issues.apache.org/jira/browse/SPARK-9976
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
> Environment: spark 1.4.1 yarn 2.2.0
>Reporter: cen yuhai
>
> I use beeline to connect to ThriftServer, but add jar can not work, so I use 
> create function , see the link below.
> http://www.cloudera.com/content/cloudera/en/documentation/core/v5-3-x/topics/cm_mc_hive_udf.html
> I do as blow:
> {code}
> create function gdecodeorder as 'com.hive.udf.GOrderDecode' USING JAR 
> 'hdfs://mycluster/user/spark/lib/gorderdecode.jar'; 
> {code}
> It returns Ok, and I connect to the metastore, I see records in table FUNCS.
> {code}
> select gdecodeorder(t1)  from tableX  limit 1;
> {code}
> It returns error 'Couldn't find function default.gdecodeorder'
> This is the Exception
> {code}
> 15/08/14 14:53:51 ERROR UserGroupInformation: PriviledgedActionException 
> as:xiaoju (auth:SIMPLE) cause:org.apache.hive.service.cli.HiveSQLException: 
> java.lang.RuntimeException: Couldn't find function default.gdecodeorder
> 15/08/14 15:04:47 ERROR RetryingHMSHandler: 
> MetaException(message:NoSuchObjectException(message:Function 
> default.t_gdecodeorder does not exist))
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newMetaException(HiveMetaStore.java:4613)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_function(HiveMetaStore.java:4740)
> at sun.reflect.GeneratedMethodAccessor57.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:105)
> at com.sun.proxy.$Proxy21.get_function(Unknown Source)
> at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getFunction(HiveMetaStoreClient.java:1721)
> at sun.reflect.GeneratedMethodAccessor56.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
> at com.sun.proxy.$Proxy22.getFunction(Unknown Source)
> at org.apache.hadoop.hive.ql.metadata.Hive.getFunction(Hive.java:2662)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfoFromMetastore(FunctionRegistry.java:546)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getQualifiedFunctionInfo(FunctionRegistry.java:579)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:645)
> at 
> org.apache.hadoop.hive.ql.exec.FunctionRegistry.getFunctionInfo(FunctionRegistry.java:652)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUdfs.scala:54)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.org$apache$spark$sql$catalyst$analysis$OverrideFunctionRegistry$$super$lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$$anonfun$lookupFunction$2.apply(FunctionRegistry.scala:44)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.catalyst.analysis.OverrideFunctionRegistry$class.lookupFunction(FunctionRegistry.scala:44)
> at 
> org.apache.spark.sql.hive.HiveContext$$anon$3.lookupFunction(HiveContext.scala:376)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:465)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$13$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:463)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNod

[jira] [Updated] (SPARK-10110) StringIndexer lacks of parameter "handleInvalid".

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10110:
-
Fix Version/s: (was: 1.5.0)

> StringIndexer lacks of parameter "handleInvalid".
> -
>
> Key: SPARK-10110
> URL: https://issues.apache.org/jira/browse/SPARK-10110
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Kai Sasaki
>  Labels: ML
>
> Missing API for pyspark {{StringIndexer.handleInvalid}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10110) StringIndexer lacks of parameter "handleInvalid".

2015-08-29 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721314#comment-14721314
 ] 

Yin Huai commented on SPARK-10110:
--

I am removing fix version since this field will not be set until the pr gets 
merged.

> StringIndexer lacks of parameter "handleInvalid".
> -
>
> Key: SPARK-10110
> URL: https://issues.apache.org/jira/browse/SPARK-10110
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Kai Sasaki
>  Labels: ML
>
> Missing API for pyspark {{StringIndexer.handleInvalid}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10352) Replace SQLTestData internal usages of String with UTF8String

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang closed SPARK-10352.
-
Resolution: Not A Problem

Caused by my code not respecting {{InternalRow}} can only contain 
{{UTF8String}} and no {{java.lang.String}}

> Replace SQLTestData internal usages of String with UTF8String
> -
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}
> Although {{StringType}} should in theory only have internal type 
> {{UTF8String}}, we [are inconsistent with this 
> constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
>  and being more strict would [break existing 
> code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-08-29 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721316#comment-14721316
 ] 

Yin Huai commented on SPARK-1564:
-

https://github.com/apache/spark/pull/7169 has been merged and it is included in 
both 1.5.0-rc1 and 1.5.0-rc2. I am resolving this issue.

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10351) UnsafeRow.getString should handle off-heap backed UnsafeRow

2015-08-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721318#comment-14721318
 ] 

Reynold Xin commented on SPARK-10351:
-

getString is only used in debugging I think?

> UnsafeRow.getString should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-1564.
-
Resolution: Fixed

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-08-29 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721319#comment-14721319
 ] 

Yin Huai commented on SPARK-1564:
-

[~andrewor14] seems I cannot change the assignee to [~deron]. Can you assign it 
to him?

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Andrew Or
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10351) UnsafeRow.getString should handle off-heap backed UnsafeRow

2015-08-29 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721325#comment-14721325
 ] 

Feynman Liang commented on SPARK-10351:
---

Sorry, the fix is for {{getUTF8String}}. {{getString}} is the method which 
causes the {{NullPointerException}}. Updated title.

> UnsafeRow.getString should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10351) UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow

2015-08-29 Thread Feynman Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feynman Liang updated SPARK-10351:
--
Summary: UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow  
(was: UnsafeRow.getString should handle off-heap backed UnsafeRow)

> UnsafeRow.getUTF8String should handle off-heap backed UnsafeRow
> ---
>
> Key: SPARK-10351
> URL: https://issues.apache.org/jira/browse/SPARK-10351
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>Priority: Critical
>
> {{UnsafeRow.getUTF8String}} delegates to {{UTF8String.fromAddress}} which 
> returns {{null}} when passed a {{null}} base object, failing to handle 
> off-heap backed {{UnsafeRow}}s correctly.
> This will also cause a {{NullPointerException}} when {{getString}} is called 
> with off-heap storage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8684) Update R version in Spark EC2 AMI

2015-08-29 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721324#comment-14721324
 ] 

Yin Huai commented on SPARK-8684:
-

Should we resolve it?


> Update R version in Spark EC2 AMI
> -
>
> Key: SPARK-8684
> URL: https://issues.apache.org/jira/browse/SPARK-8684
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2, SparkR
>Reporter: Shivaram Venkataraman
>Priority: Minor
> Fix For: 1.5.0
>
>
> Right now the R version in the AMI is 3.1 -- However a number of R libraries 
> need R version 3.2 and it will be good to update the R version on the AMI 
> while launching a EC2 cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9991) Create local limit operator

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9991.

   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.6.0

> Create local limit operator
> ---
>
> Key: SPARK-9991
> URL: https://issues.apache.org/jira/browse/SPARK-9991
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9986) Create a simple test framework for local operators

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9986.

   Resolution: Fixed
Fix Version/s: 1.6.0

> Create a simple test framework for local operators
> --
>
> Key: SPARK-9986
> URL: https://issues.apache.org/jira/browse/SPARK-9986
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> It'd be great if we can just create local query plans and test the 
> correctness of their implementation directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9993) Create local union operator

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9993.

   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 1.6.0

> Create local union operator
> ---
>
> Key: SPARK-9993
> URL: https://issues.apache.org/jira/browse/SPARK-9993
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2015-08-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14717079#comment-14717079
 ] 

Reynold Xin edited comment on SPARK-6817 at 8/30/15 1:14 AM:
-

Here are some suggestions on the proposed API. If the idea is to keep the API 
close to R's current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as 
collect(dapply(...)). Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse 
the keyword instead of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was 
performance the primary intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using 
dapply? In R, the function "split" provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One 
should be able to implement "split" using GroupBy in Spark.
"gapply" can then be expressed in terms of dapply and split, and gapplyCollect 
will become collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

{code}
df<-data.frame(city=c("A","B","A","D"), age=c(10,12,23,5))
print(df)
s<-split(df$age, df$city)
lapply(s, mean)
{code}


was (Author: indrajit):
Here are some suggestions on the proposed API. If the idea is to keep the API 
close to R's current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as 
collect(dapply(...)). Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse 
the keyword instead of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was 
performance the primary intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using 
dapply? In R, the function "split" provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One 
should be able to implement "split" using GroupBy in Spark.
"gapply" can then be expressed in terms of dapply and split, and gapplyCollect 
will become collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

df<-data.frame(city=c("A","B","A","D"), age=c(10,12,23,5))
print(df)
s<-split(df$age, df$city)
lapply(s, mean)

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9078) Use of non-standard LIMIT keyword in JDBC tableExists code

2015-08-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721328#comment-14721328
 ] 

Reynold Xin commented on SPARK-9078:


Please submit a pull request, [~tsuresh].

I think it is OK to ignore the option for now.


> Use of non-standard LIMIT keyword in JDBC tableExists code
> --
>
> Key: SPARK-9078
> URL: https://issues.apache.org/jira/browse/SPARK-9078
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.1, 1.4.0
>Reporter: Robert Beauchemin
>Priority: Minor
>
> tableExists in  
> spark/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcUtils.scala uses 
> non-standard SQL (specifically, the LIMIT keyword) to determine whether a 
> table exists in a JDBC data source. This will cause an exception in many/most 
> JDBC databases that doesn't support LIMIT keyword. See 
> http://stackoverflow.com/questions/1528604/how-universal-is-the-limit-statement-in-sql
> To check for table existence or an exception, it could be recrafted around 
> "select 1 from $table where 0 = 1" which isn't the same (it returns an empty 
> resultset rather than the value '1'), but would support more data sources and 
> also support empty tables. Arguably ugly and possibly queries every row on 
> sources that don't support constant folding, but better than failing on JDBC 
> sources that don't support LIMIT. 
> Perhaps "supports LIMIT" could be a field in the JdbcDialect class for 
> databases that support keyword this to override. The ANSI standard is (OFFSET 
> and) FETCH. 
> The standard way to check for table existence would be to use 
> information_schema.tables which is a SQL standard but may not work for other 
> JDBC data sources that support SQL, but not the information_schema. The JDBC 
> DatabaseMetaData interface provides getSchemas()  that allows checking for 
> the information_schema in drivers that support it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10308) %in% is not exported in SparkR

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10308:

Fix Version/s: (was: 1.5.1)
   (was: 1.6.0)
   1.5.0

> %in% is not exported in SparkR
> --
>
> Key: SPARK-10308
> URL: https://issues.apache.org/jira/browse/SPARK-10308
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.0
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 1.5.0
>
>
> While the operator is defined in Column.R it is not exported in our NAMESPACE 
> file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10287) After processing a query using JSON data, Spark SQL continuously refreshes metadata of the table

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10287:

Fix Version/s: (was: 1.5.1)
   1.5.0

> After processing a query using JSON data, Spark SQL continuously refreshes 
> metadata of the table
> 
>
> Key: SPARK-10287
> URL: https://issues.apache.org/jira/browse/SPARK-10287
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>  Labels: releasenotes
> Fix For: 1.5.0
>
>
> I have a partitioned json table with 1824 partitions.
> {code}
> val df = sqlContext.read.format("json").load("aPartitionedJsonData")
> val columnStr = df.schema.map(_.name).mkString(",")
> println(s"columns: $columnStr")
> val hash = df
>   .selectExpr(s"hash($columnStr) as hashValue")
>   .groupBy()
>   .sum("hashValue")
>   .head()
>   .getLong(0)
> {code}
> Looks like for JSON, we refresh metadata when we call buildScan. For a 
> partitioned table, we call buildScan for every partition. So, looks like we 
> will refresh this table 1824 times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10188) Pyspark CrossValidator with RMSE selects incorrect model

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10188:

Fix Version/s: (was: 1.5.1)
   1.5.0

> Pyspark CrossValidator with RMSE selects incorrect model
> 
>
> Key: SPARK-10188
> URL: https://issues.apache.org/jira/browse/SPARK-10188
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Noel Smith
>Assignee: Noel Smith
>Priority: Critical
> Fix For: 1.5.0
>
>
> Pyspark {{CrossValidator}} is giving incorrect results when selecting 
> estimators using RMSE as an evaluation metric.
> In the example below, it should be selecting the {{LogisticRegression}} 
> estimator with zero regularization as that gives the most accurate result, 
> but instead it selects the one with the largest.
> Probably related to: SPARK-10097
> {code}
> from pyspark.ml.evaluation import RegressionEvaluator
> from pyspark.ml.regression import LinearRegression
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.feature import Binarizer
> from pyspark.mllib.linalg import Vectors
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> # Label = 2 * feature
> train = sqlContext.createDataFrame([
> (Vectors.dense([10.0]), 20.0), 
> (Vectors.dense([100.0]), 200.0), 
> (Vectors.dense([1000.0]), 2000.0)] * 10,
> ["features", "label"])
> test = sqlContext.createDataFrame([
> (Vectors.dense([1000.0]),)],  
> ["features"])
> # Expected prediction 2000.0
> print LinearRegression(regParam=0.0).fit(train).transform(test).collect() # 
> Predicts 2000.0 (perfect)
> print LinearRegression(regParam=100.0).fit(train).transform(test).collect() # 
> Predicts 1869.31
> print 
> LinearRegression(regParam=100.0).fit(train).transform(test).collect() # 
> 741.08 (worst)
> # Cross-validation
> lr = LinearRegression()
> rmse_eval = RegressionEvaluator(metricName="rmse")
> grid = (ParamGridBuilder()
> .addGrid( lr.regParam, [0.0, 100.0, 100.0] )
> .build())
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=rmse_eval)
> cv_model = cv.fit(train)
> cv_model.bestModel.transform(test).collect() # Predicts 741.08 (i.e. worst 
> model selected)
> {code}
> Once workaround for users would be to add a wrapper around the selected 
> evaluator to invert the metric:
> {code}
> class InvertedEvaluator(Evaluator):
> def __init__(self, evaluator):
> super(Evaluator, self).__init__()
> self.evaluator = evaluator
> 
> def _evaluate(self, dataset):
> return -self.evaluator.evaluate(dataset)
>  invertedEvaluator = InvertedEvaluator(RegressionEvaluator(metricName="rmse"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9671) ML 1.5 QA: Programming guide update and migration guide

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9671?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9671:
---
Fix Version/s: (was: 1.5.1)
   1.5.0

> ML 1.5 QA: Programming guide update and migration guide
> ---
>
> Key: SPARK-9671
> URL: https://issues.apache.org/jira/browse/SPARK-9671
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Xiangrui Meng
>Priority: Critical
> Fix For: 1.5.0
>
>
> Before the release, we need to update the MLlib Programming Guide.  Updates 
> will include:
> * Add migration guide subsection.
> ** Use the results of the QA audit JIRAs.
> * Check phrasing, especially in main sections (for outdated items such as "In 
> this release, ...")
> * Possibly reorganize parts of the Pipelines guide if needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10188) Pyspark CrossValidator with RMSE selects incorrect model

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10188:

Target Version/s: 1.5.0  (was: 1.5.1)

> Pyspark CrossValidator with RMSE selects incorrect model
> 
>
> Key: SPARK-10188
> URL: https://issues.apache.org/jira/browse/SPARK-10188
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Noel Smith
>Assignee: Noel Smith
>Priority: Critical
> Fix For: 1.5.0
>
>
> Pyspark {{CrossValidator}} is giving incorrect results when selecting 
> estimators using RMSE as an evaluation metric.
> In the example below, it should be selecting the {{LogisticRegression}} 
> estimator with zero regularization as that gives the most accurate result, 
> but instead it selects the one with the largest.
> Probably related to: SPARK-10097
> {code}
> from pyspark.ml.evaluation import RegressionEvaluator
> from pyspark.ml.regression import LinearRegression
> from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, 
> CrossValidatorModel
> from pyspark.ml.feature import Binarizer
> from pyspark.mllib.linalg import Vectors
> from pyspark.sql import SQLContext
> sqlContext = SQLContext(sc)
> # Label = 2 * feature
> train = sqlContext.createDataFrame([
> (Vectors.dense([10.0]), 20.0), 
> (Vectors.dense([100.0]), 200.0), 
> (Vectors.dense([1000.0]), 2000.0)] * 10,
> ["features", "label"])
> test = sqlContext.createDataFrame([
> (Vectors.dense([1000.0]),)],  
> ["features"])
> # Expected prediction 2000.0
> print LinearRegression(regParam=0.0).fit(train).transform(test).collect() # 
> Predicts 2000.0 (perfect)
> print LinearRegression(regParam=100.0).fit(train).transform(test).collect() # 
> Predicts 1869.31
> print 
> LinearRegression(regParam=100.0).fit(train).transform(test).collect() # 
> 741.08 (worst)
> # Cross-validation
> lr = LinearRegression()
> rmse_eval = RegressionEvaluator(metricName="rmse")
> grid = (ParamGridBuilder()
> .addGrid( lr.regParam, [0.0, 100.0, 100.0] )
> .build())
> cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, 
> evaluator=rmse_eval)
> cv_model = cv.fit(train)
> cv_model.bestModel.transform(test).collect() # Predicts 741.08 (i.e. worst 
> model selected)
> {code}
> Once workaround for users would be to add a wrapper around the selected 
> evaluator to invert the metric:
> {code}
> class InvertedEvaluator(Evaluator):
> def __init__(self, evaluator):
> super(Evaluator, self).__init__()
> self.evaluator = evaluator
> 
> def _evaluate(self, dataset):
> return -self.evaluator.evaluate(dataset)
>  invertedEvaluator = InvertedEvaluator(RegressionEvaluator(metricName="rmse"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10219) Error when additional options provided as variable in write.df

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10219:

Fix Version/s: (was: 1.5.1)
   (was: 1.6.0)
   1.5.0

> Error when additional options provided as variable in write.df
> --
>
> Key: SPARK-10219
> URL: https://issues.apache.org/jira/browse/SPARK-10219
> Project: Spark
>  Issue Type: Bug
>  Components: R
>Affects Versions: 1.4.0
> Environment: SparkR shell
>Reporter: Samuel Alexander
>Assignee: Shivaram Venkataraman
>  Labels: spark-shell, sparkR
> Fix For: 1.5.0
>
>
> Opened a SparkR shell
> Created a df using 
> > df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
> Assigned a variable like below
> > mode <- "append"
> When write.df called using below statement got the mentioned error
> > write.df(df, source="org.apache.spark.sql.parquet", path=par_path, 
> > option=mode)
> Error in writeType(con, type) : Unsupported type for serialization name
> Whereas mode is passed as "append" itself, i.e. not via mode variable as 
> below everything works fine
> > write.df(df, source="org.apache.spark.sql.parquet", path=par_path, 
> > option="append")
> Note: For parquet it is not needed to hanve option. But we are using Spark 
> Salesforce package 
> (http://spark-packages.org/package/springml/spark-salesforce) which require 
> additional options to be passed.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10336) fitIntercept is a command line option but not set in the LR example program.

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10336:

Fix Version/s: (was: 1.5.1)
   1.5.0

> fitIntercept is a command line option but not set in the LR example program.
> 
>
> Key: SPARK-10336
> URL: https://issues.apache.org/jira/browse/SPARK-10336
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, ML
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Shuo Xiang
>Assignee: Shuo Xiang
> Fix For: 1.5.0
>
>
> the parsed parameter is not set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10328) na.omit has too restrictive generic in SparkR

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10328:

Fix Version/s: (was: 1.5.1)
   (was: 1.6.0)
   1.5.0

> na.omit has too restrictive generic in SparkR
> -
>
> Key: SPARK-10328
> URL: https://issues.apache.org/jira/browse/SPARK-10328
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Shivaram Venkataraman
> Fix For: 1.5.0
>
>
> It should match the S3 function definition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10295) Dynamic allocation in Mesos does not release when RDDs are cached

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10295:

Fix Version/s: (was: 1.5.1)
   (was: 1.6.0)
   1.5.0

> Dynamic allocation in Mesos does not release when RDDs are cached
> -
>
> Key: SPARK-10295
> URL: https://issues.apache.org/jira/browse/SPARK-10295
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, Spark Core
>Affects Versions: 1.5.0
> Environment: Spark 1.5.0 RC1
> Centos 6
> java 7 oracle
>Reporter: Hans van den Bogert
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.5.0
>
>
> When running spark in coarse grained mode with shuffle service and dynamic 
> allocation, the driver does not release executors if a dataset is cached.
> The console output OTOH shows:
> > 15/08/26 17:29:58 WARN SparkContext: Dynamic allocation currently does not 
> > support cached RDDs. Cached data for RDD 9 will be lost when executors are 
> > removed.
> However after the default of 1m, executors are not released. When I perform 
> the same initial setup, loading data, etc, but without caching, the executors 
> are released.
> Is this intended behaviour?
> If this is intended behaviour, the console warning is misleading. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9316) Add support for filtering using `[` (synonym for filter / select)

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9316:
---
Fix Version/s: (was: 1.5.1)
   (was: 1.6.0)
   1.5.0

> Add support for filtering using `[` (synonym for filter / select)
> -
>
> Key: SPARK-9316
> URL: https://issues.apache.org/jira/browse/SPARK-9316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>Assignee: Felix Cheung
> Fix For: 1.5.0
>
>
> Will help us support queries of the form 
> {code}
> air[air$UniqueCarrier %in% c("UA", "HA"), c(1,2,3,5:9)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8952) JsonFile() of SQLContext display improper warning message for a S3 path

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8952:
---
Fix Version/s: (was: 1.5.1)
   (was: 1.6.0)
   1.5.0

> JsonFile() of SQLContext display improper warning message for a S3 path
> ---
>
> Key: SPARK-8952
> URL: https://issues.apache.org/jira/browse/SPARK-8952
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
>Reporter: Sun Rui
>Assignee: Luciano Resende
> Fix For: 1.5.0
>
>
> This is an issue reported by Ben Spark .
> {quote}
> Spark 1.4 deployed on AWS EMR 
> "jsonFile" is working though with some warning message
> Warning message:
> In normalizePath(path) :
>   
> path[1]="s3://rea-consumer-data-dev/cbr/profiler/output/20150618/part-0": 
> No such file or directory
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9890) User guide for CountVectorizer

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-9890:
---
Fix Version/s: (was: 1.5.1)
   1.5.0

> User guide for CountVectorizer
> --
>
> Key: SPARK-9890
> URL: https://issues.apache.org/jira/browse/SPARK-9890
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Feynman Liang
>Assignee: yuhao yang
> Fix For: 1.5.0
>
>
> SPARK-8703 added a count vectorizer as a ML transformer. We should add an 
> accompanying user guide to {{ml-features}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10315) remove document on spark.akka.failure-detector.threshold

2015-08-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-10315:

Fix Version/s: (was: 1.5.1)
   (was: 1.6.0)
   1.5.0

> remove document on spark.akka.failure-detector.threshold
> 
>
> Key: SPARK-10315
> URL: https://issues.apache.org/jira/browse/SPARK-10315
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Nan Zhu
>Assignee: Nan Zhu
>Priority: Minor
> Fix For: 1.5.0
>
>
> this parameter is not used any longer and there is some mistake in the 
> current document , should be 'akka.remote.watch-failure-detector.threshold'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10304) Partition discovery does not throw an exception if the dir structure is valid

2015-08-29 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-10304:
-
Target Version/s: 1.5.1,1.6.0  (was: 1.5.0)

> Partition discovery does not throw an exception if the dir structure is valid
> -
>
> Key: SPARK-10304
> URL: https://issues.apache.org/jira/browse/SPARK-10304
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Zhan Zhang
>Priority: Critical
>
> I have a dir structure like {{/path/table1/partition_column=1/}}. When I try 
> to use {{load("/path/")}}, it works and I get a DF. When I query this DF, if 
> it is stored as ORC, there will be the following NPE. But, if it is Parquet, 
> we even can return rows. We should complain to users about the dir struct 
> because {{table1}} does not meet our format.
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 26 in 
> stage 57.0 failed 4 times, most recent failure: Lost task 26.3 in stage 57.0 
> (TID 3504, 10.0.195.227): java.lang.NullPointerException
> at 
> org.apache.spark.sql.hive.HiveInspectors$class.unwrapperFor(HiveInspectors.scala:466)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.unwrapperFor(OrcRelation.scala:224)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1$$anonfun$9.apply(OrcRelation.scala:261)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:261)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject$1.apply(OrcRelation.scala:256)
>   at scala.Option.map(Option.scala:145)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan.org$apache$spark$sql$hive$orc$OrcTableScan$$fillObject(OrcRelation.scala:256)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:318)
>   at 
> org.apache.spark.sql.hive.orc.OrcTableScan$$anonfun$execute$3.apply(OrcRelation.scala:316)
>   at 
> org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.compute(HadoopRDD.scala:380)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9523) Receiver for Spark Streaming does not naturally support kryo serializer

2015-08-29 Thread John Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721354#comment-14721354
 ] 

John Chen commented on SPARK-9523:
--

The problem here is not about warn/error/etc, it's that if you want to do 
something special for transient attributes, you'll have to write your own code. 
However, for Java and Kryo serializations, the code you write is different.

For Java, you need to write those code in readObject() and writeObject() 
methods, as for Kryo, you'll have to write those codes in another pair of 
methods: read() and write(). So if you want to support both Java and Kryo 
serializations with transient attributes and customized serialization 
operations, you need to write all 4 methods in your class.

For other DStream functions, you do not care about Kryo, as they seems to only 
support Java serialization, so even if you set the KryoSerializer in SparkConf, 
the serialization is still done by java . However, for the Receiver in Spark 
Streaming, if will be serialized by Kryo if you set so, and the real issue here 
is that THE RECEIVER AND OTHER FUNCTIONS DO NOT ACT THE SAME, which can be 
confusing for new developers.

> Receiver for Spark Streaming does not naturally support kryo serializer
> ---
>
> Key: SPARK-9523
> URL: https://issues.apache.org/jira/browse/SPARK-9523
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.3.1
> Environment: Windows 7 local mode
>Reporter: John Chen
>Priority: Minor
>  Labels: kryo, serialization
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> In some cases, some attributes in a class is not serializable, which you 
> still want to use after serialization of the whole object, you'll have to 
> customize your serialization codes. For example, you can declare those 
> attributes as transient, which makes them ignored during serialization, and 
> then you can reassign their values during deserialization.
> Now, if you're using Java serialization, you'll have to implement 
> Serializable, and write those codes in readObject() and writeObejct() 
> methods; And if you're using kryo serialization, you'll have to implement 
> KryoSerializable, and write these codes in read() and write() methods.
> In Spark and Spark Streaming, you can set kryo as the serializer for speeding 
> up. However, the functions taken by RDD or DStream operations are still 
> serialized by Java serialization, which means you only need to write those 
> custom serialization codes in readObject() and writeObejct() methods.
> But when it comes to Spark Streaming's Receiver, things are different. When 
> you wish to customize an InputDStream, you must extend the Receiver. However, 
> it turns out, the Receiver will be serialized by kryo if you set kryo 
> serializer in SparkConf, and will fall back to Java serialization if you 
> didn't.
> So here's comes the problems, if you want to change the serializer by 
> configuration and make sure the Receiver runs perfectly for both Java and 
> kryo, you'll have to write all the 4 methods above. First, it is redundant, 
> since you'll have to write serialization/deserialization code almost twice; 
> Secondly, there's nothing in the doc or in the code to inform users to 
> implement the KryoSerializable interface. 
> Since all other function parameters are serialized by Java only, I suggest 
> you also make it so for the Receiver. It may be slower, but since the 
> serialization will only be executed for each interval, it's durable. More 
> importantly, it can cause fewer trouble



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1564) Add JavaScript into Javadoc to turn ::Experimental:: and such into badges

2015-08-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1564:
-
Assignee: Deron Eriksson  (was: Andrew Or)

> Add JavaScript into Javadoc to turn ::Experimental:: and such into badges
> -
>
> Key: SPARK-1564
> URL: https://issues.apache.org/jira/browse/SPARK-1564
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Reporter: Matei Zaharia
>Assignee: Deron Eriksson
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication

2015-08-29 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-10353:
---

 Summary: MLlib BLAS gemm outputs wrong result when beta = 0.0 for 
transpose transpose matrix multiplication
 Key: SPARK-10353
 URL: https://issues.apache.org/jira/browse/SPARK-10353
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Burak Yavuz


Basically 
{code}
if (beta != 0.0) {
  f2jBLAS.dscal(C.values.length, beta, C.values, 1)
}
{code}
should be
{code}
if (beta != 1.0) {
  f2jBLAS.dscal(C.values.length, beta, C.values, 1)
}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication

2015-08-29 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-10353:

Affects Version/s: 1.5.0

> MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
> matrix multiplication
> --
>
> Key: SPARK-10353
> URL: https://issues.apache.org/jira/browse/SPARK-10353
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Burak Yavuz
>
> Basically 
> {code}
> if (beta != 0.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}
> should be
> {code}
> if (beta != 1.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10353:


Assignee: Apache Spark

> MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
> matrix multiplication
> --
>
> Key: SPARK-10353
> URL: https://issues.apache.org/jira/browse/SPARK-10353
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Basically 
> {code}
> if (beta != 0.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}
> should be
> {code}
> if (beta != 1.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication

2015-08-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10353:


Assignee: (was: Apache Spark)

> MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
> matrix multiplication
> --
>
> Key: SPARK-10353
> URL: https://issues.apache.org/jira/browse/SPARK-10353
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Burak Yavuz
>
> Basically 
> {code}
> if (beta != 0.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}
> should be
> {code}
> if (beta != 1.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10353) MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose matrix multiplication

2015-08-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14721392#comment-14721392
 ] 

Apache Spark commented on SPARK-10353:
--

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/8525

> MLlib BLAS gemm outputs wrong result when beta = 0.0 for transpose transpose 
> matrix multiplication
> --
>
> Key: SPARK-10353
> URL: https://issues.apache.org/jira/browse/SPARK-10353
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Burak Yavuz
>
> Basically 
> {code}
> if (beta != 0.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}
> should be
> {code}
> if (beta != 1.0) {
>   f2jBLAS.dscal(C.values.length, beta, C.values, 1)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10348) Improve Spark ML user guide

2015-08-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10348.
---
   Resolution: Fixed
Fix Version/s: 1.5.1

Issue resolved by pull request 8517
[https://github.com/apache/spark/pull/8517]

> Improve Spark ML user guide
> ---
>
> Key: SPARK-10348
> URL: https://issues.apache.org/jira/browse/SPARK-10348
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.5.1
>
>
> improve ml-guide:
> * replace `ML Dataset` by `DataFrame` to simplify the abstraction
> * remove links to Scala API doc in the main guide
> * change ML algorithms to pipeline components



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >