[jira] [Updated] (SPARK-8177) date/time function: year

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8177:
---
Description: 

{code}
year(timestamp time): int
{code}

Returns the year part of a date or a timestamp string: year("1970-01-01 
00:00:00") = 1970, year("1970-01-01") = 1970.


  was:
year(string|date|timestamp): int

Returns the year part of a date or a timestamp string: year("1970-01-01 
00:00:00") = 1970, year("1970-01-01") = 1970.

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF


> date/time function: year
> 
>
> Key: SPARK-8177
> URL: https://issues.apache.org/jira/browse/SPARK-8177
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> {code}
> year(timestamp time): int
> {code}
> Returns the year part of a date or a timestamp string: year("1970-01-01 
> 00:00:00") = 1970, year("1970-01-01") = 1970.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8176) date/time function: to_date

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8176:
---
Description: 
parse a timestamp string and return the date portion
{code}
to_date(string timestamp): date
{code}

Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
"1970-01-01" (in some date format)

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



  was:
# parse a timestamp string and return the date portion
to_date(string timestamp): date

Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
"1970-01-01" (in some date format)

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF




> date/time function: to_date
> ---
>
> Key: SPARK-8176
> URL: https://issues.apache.org/jira/browse/SPARK-8176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> parse a timestamp string and return the date portion
> {code}
> to_date(string timestamp): date
> {code}
> Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
> "1970-01-01" (in some date format)
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8176) date/time function: to_date

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8176:
---
Description: 
# parse a timestamp string and return the date portion
to_date(string timestamp): date

Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
"1970-01-01" (in some date format)

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



  was:
to_date(date|timestamp): date
to_date(string): string

Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
"1970-01-01".

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF




> date/time function: to_date
> ---
>
> Key: SPARK-8176
> URL: https://issues.apache.org/jira/browse/SPARK-8176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> # parse a timestamp string and return the date portion
> to_date(string timestamp): date
> Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
> "1970-01-01" (in some date format)
> See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8176) date/time function: to_date

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8176:
---
Description: 
parse a timestamp string and return the date portion
{code}
to_date(string timestamp): date
{code}

Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
"1970-01-01" (in some date format)



  was:
parse a timestamp string and return the date portion
{code}
to_date(string timestamp): date
{code}

Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
"1970-01-01" (in some date format)

See https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF




> date/time function: to_date
> ---
>
> Key: SPARK-8176
> URL: https://issues.apache.org/jira/browse/SPARK-8176
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> parse a timestamp string and return the date portion
> {code}
> to_date(string timestamp): date
> {code}
> Returns the date part of a timestamp string: to_date("1970-01-01 00:00:00") = 
> "1970-01-01" (in some date format)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8753) Create an IntervalType data type

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8753:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-8159

> Create an IntervalType data type
> 
>
> Key: SPARK-8753
> URL: https://issues.apache.org/jira/browse/SPARK-8753
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We should create an IntervalType data type that represents time intervals. 
> Internally, we can use a long value to store it, similar to Timestamp (i.e. 
> 100ns precision). This data type initially cannot be stored externally, but 
> only used for expressions.
> 1. Add IntervalType data type.
> 2. Add parser support in our SQL expression, in the form of
> {code}
> INTERVAL [number] [unit] 
> {code}
> unit can be YEAR[S], MONTH[S], WEEK[S], DAY[S], HOUR[S], MINUTE[S], 
> SECOND[S], MILLISECOND[S], MICROSECOND[S], or NANOSECOND[S].
> 3. Add in the analyzer to make sure we throw some exception to prevent saving 
> a dataframe/table with IntervalType out to external systems.
> Related Hive ticket: https://issues.apache.org/jira/browse/HIVE-9792



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8810) Gaps in SQL UDF test coverage

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8810:
---

Assignee: (was: Apache Spark)

> Gaps in SQL UDF test coverage
> -
>
> Key: SPARK-8810
> URL: https://issues.apache.org/jira/browse/SPARK-8810
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: all
>Reporter: Spiro Michaylov
>  Labels: test
>
> SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in 
> combination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8810) Gaps in SQL UDF test coverage

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612871#comment-14612871
 ] 

Apache Spark commented on SPARK-8810:
-

User 'spirom' has created a pull request for this issue:
https://github.com/apache/spark/pull/7207

> Gaps in SQL UDF test coverage
> -
>
> Key: SPARK-8810
> URL: https://issues.apache.org/jira/browse/SPARK-8810
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: all
>Reporter: Spiro Michaylov
>  Labels: test
>
> SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in 
> combination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8810) Gaps in SQL UDF test coverage

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8810:
---

Assignee: Apache Spark

> Gaps in SQL UDF test coverage
> -
>
> Key: SPARK-8810
> URL: https://issues.apache.org/jira/browse/SPARK-8810
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 1.2.0
> Environment: all
>Reporter: Spiro Michaylov
>Assignee: Apache Spark
>  Labels: test
>
> SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in 
> combination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8810) Gaps in SQL UDF test coverage

2015-07-02 Thread Spiro Michaylov (JIRA)
Spiro Michaylov created SPARK-8810:
--

 Summary: Gaps in SQL UDF test coverage
 Key: SPARK-8810
 URL: https://issues.apache.org/jira/browse/SPARK-8810
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 1.2.0
 Environment: all
Reporter: Spiro Michaylov


SQL UDFs are untested in GROUP BY, WHERE and HAVING clauses, and in combination.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8809) Remove ConvertNaNs analyzer rule

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8809:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Remove ConvertNaNs analyzer rule
> 
>
> Key: SPARK-8809
> URL: https://issues.apache.org/jira/browse/SPARK-8809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Cast already handles "NaN" when casting from string to double/float. I don't 
> think this rule is necessary anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8809) Remove ConvertNaNs analyzer rule

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8809:
---

Assignee: Apache Spark  (was: Reynold Xin)

> Remove ConvertNaNs analyzer rule
> 
>
> Key: SPARK-8809
> URL: https://issues.apache.org/jira/browse/SPARK-8809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Cast already handles "NaN" when casting from string to double/float. I don't 
> think this rule is necessary anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8809) Remove ConvertNaNs analyzer rule

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612856#comment-14612856
 ] 

Apache Spark commented on SPARK-8809:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7206

> Remove ConvertNaNs analyzer rule
> 
>
> Key: SPARK-8809
> URL: https://issues.apache.org/jira/browse/SPARK-8809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Cast already handles "NaN" when casting from string to double/float. I don't 
> think this rule is necessary anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-07-02 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612847#comment-14612847
 ] 

Feynman Liang edited comment on SPARK-5016 at 7/3/15 5:20 AM:
--

I did some [perf testing | 
https://gist.github.com/feynmanliang/70d79c23dffc828939ec] and it shows that 
distributing the Gaussians does yield a significant improvement in performance 
when the number of clusters and dimensionality of the data is sufficiently 
large (>30 dimensions, >10 clusters).

In particular, the "typical" use case of 40 dimensions and 10k clusters gains 
about 15 seconds in runtime when distributing the Gaussians.


was (Author: fliang):
I did some [perf 
testing](https://gist.github.com/feynmanliang/70d79c23dffc828939ec) and it 
shows that distributing the Gaussians does yield a significant improvement in 
performance when the number of clusters and dimensionality of the data is 
sufficiently large (>30 dimensions, >10 clusters).

In particular, the "typical" use case of 40 dimensions and 10k clusters gains 
about 15 seconds in runtime when distributing the Gaussians.

> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---
>
> Key: SPARK-5016
> URL: https://issues.apache.org/jira/browse/SPARK-5016
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>  Labels: clustering
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5016) GaussianMixtureEM should distribute matrix inverse for large numFeatures, k

2015-07-02 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612847#comment-14612847
 ] 

Feynman Liang commented on SPARK-5016:
--

I did some [perf 
testing](https://gist.github.com/feynmanliang/70d79c23dffc828939ec) and it 
shows that distributing the Gaussians does yield a significant improvement in 
performance when the number of clusters and dimensionality of the data is 
sufficiently large (>30 dimensions, >10 clusters).

In particular, the "typical" use case of 40 dimensions and 10k clusters gains 
about 15 seconds in runtime when distributing the Gaussians.

> GaussianMixtureEM should distribute matrix inverse for large numFeatures, k
> ---
>
> Key: SPARK-5016
> URL: https://issues.apache.org/jira/browse/SPARK-5016
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>  Labels: clustering
>
> If numFeatures or k are large, GMM EM should distribute the matrix inverse 
> computation for Gaussian initialization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8803) Crosstab element's can't contain null's and back ticks

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8803.

   Resolution: Fixed
 Assignee: Burak Yavuz
Fix Version/s: 1.5.0
   1.4.1

> Crosstab element's can't contain null's and back ticks
> --
>
> Key: SPARK-8803
> URL: https://issues.apache.org/jira/browse/SPARK-8803
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
> Fix For: 1.4.1, 1.5.0
>
>
> Having back ticks or null as elements causes problems. 
> Since elements become column names, we have to drop them from the element as 
> back ticks are special characters.
> Having null throws exceptions, we could replace them with empty strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8776) Increase the default MaxPermSize

2015-07-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reassigned SPARK-8776:
---

Assignee: Yin Huai

> Increase the default MaxPermSize
> 
>
> Key: SPARK-8776
> URL: https://issues.apache.org/jira/browse/SPARK-8776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.4.1, 1.5.0
>
>
> Since 1.4.0, Spark SQL has isolated class loaders for seperating hive 
> dependencies on metastore and execution, which increases the memory 
> consumption of PermGen. How about we increase the default size from 128m to 
> 256m? Seems the change we need to make is 
> https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L139.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8776) Increase the default MaxPermSize

2015-07-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-8776.
-
   Resolution: Fixed
Fix Version/s: 1.5.0
   1.4.1

Issue resolved by pull request 7196
[https://github.com/apache/spark/pull/7196]

> Increase the default MaxPermSize
> 
>
> Key: SPARK-8776
> URL: https://issues.apache.org/jira/browse/SPARK-8776
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Yin Huai
> Fix For: 1.4.1, 1.5.0
>
>
> Since 1.4.0, Spark SQL has isolated class loaders for seperating hive 
> dependencies on metastore and execution, which increases the memory 
> consumption of PermGen. How about we increase the default size from 128m to 
> 256m? Seems the change we need to make is 
> https://github.com/apache/spark/blob/3c0156899dc1ec1f7dfe6d7c8af47fa6dc7d00bf/launcher/src/main/java/org/apache/spark/launcher/AbstractCommandBuilder.java#L139.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8809) Remove ConvertNaNs analyzer rule

2015-07-02 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-8809:
--

 Summary: Remove ConvertNaNs analyzer rule
 Key: SPARK-8809
 URL: https://issues.apache.org/jira/browse/SPARK-8809
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


Cast already handles "NaN" when casting from string to double/float. I don't 
think this rule is necessary anymore.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8801) Support TypeCollection in ExpectsInputTypes

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8801.

   Resolution: Fixed
Fix Version/s: 1.5.0

> Support TypeCollection in ExpectsInputTypes
> ---
>
> Key: SPARK-8801
> URL: https://issues.apache.org/jira/browse/SPARK-8801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.5.0
>
>
> Some functions support more than one input types for each parameter. For 
> example, length supports binary and string, and maybe array/struct in the 
> future.
> This ticket proposes a TypeCollection AbstractDataType that supports multiple 
> data types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-8501.
---
Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/7199

Backported to 1.4.1 by https://github.com/apache/spark/pull/7200

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8804:
---
Fix Version/s: 1.4.1

>  order of UTF8String is wrong if there is any non-ascii character in it
> ---
>
> Key: SPARK-8804
> URL: https://issues.apache.org/jira/browse/SPARK-8804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Blocker
> Fix For: 1.4.1
>
>
> We compare the UTF8String byte by byte, but byte in JVM is signed, it should 
> be compared as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8213) math function: factorial

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8213.

   Resolution: Fixed
Fix Version/s: 1.5.0

(still missing Python)

> math function: factorial
> 
>
> Key: SPARK-8213
> URL: https://issues.apache.org/jira/browse/SPARK-8213
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: zhichao-li
> Fix For: 1.5.0
>
>
> factorial(INT a): long
> Returns the factorial of a (as of Hive 1.2.0). Valid a is [0..20].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-07-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612799#comment-14612799
 ] 

Reynold Xin commented on SPARK-8159:


I think it is ok to add them all at once. But it is also ok if there are pull 
requests that add a few of them at a time. Not a big deal. 

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2015-07-02 Thread SuYan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SuYan updated SPARK-5594:
-
Comment: was deleted

(was: Do you write sth like: 
object XXX {
  val sc = new SparkContext()

  def main {
rdd.map {
  someFunc()
}
  }
  def someFunc{}
}

Our user meet the same exception because make the sparkContext as static 
variable instead of a local variable.)

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
> at 
> org.apache.spa

[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2015-07-02 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612786#comment-14612786
 ] 

SuYan commented on SPARK-5594:
--

Do you write sth like: 
object XXX {
  val sc = new SparkContext()

  def main {
rdd.map {
  someFunc()
}
  }
  def someFunc{}
}

Our user meet the same exception because make the sparkContext as static 
variable instead of a local variable.

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>

[jira] [Commented] (SPARK-5594) SparkException: Failed to get broadcast (TorrentBroadcast)

2015-07-02 Thread SuYan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612785#comment-14612785
 ] 

SuYan commented on SPARK-5594:
--

Do you write sth like: 
object XXX {
  val sc = new SparkContext()

  def main {
rdd.map {
  someFunc()
}
  }
  def someFunc{}
}

Our user meet the same exception because make the sparkContext as static 
variable instead of a local variable.

> SparkException: Failed to get broadcast (TorrentBroadcast)
> --
>
> Key: SPARK-5594
> URL: https://issues.apache.org/jira/browse/SPARK-5594
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: John Sandiford
>Priority: Critical
>
> I am uncertain whether this is a bug, however I am getting the error below 
> when running on a cluster (works locally), and have no idea what is causing 
> it, or where to look for more information.
> Any help is appreciated.  Others appear to experience the same issue, but I 
> have not found any solutions online.
> Please note that this only happens with certain code and is repeatable, all 
> my other spark jobs work fine.
> {noformat}
> ERROR TaskSetManager: Task 3 in stage 6.0 failed 4 times; aborting job
> Exception in thread "main" org.apache.spark.SparkException: Job aborted due 
> to stage failure: Task 3 in stage 6.0 failed 4 times, most recent failure: 
> Lost task 3.3 in stage 6.0 (TID 24, ): java.io.IOException: 
> org.apache.spark.SparkException: Failed to get broadcast_6_piece0 of 
> broadcast_6
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
> at org.apache.spark.scheduler.Task.run(Task.scala:56)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: org.apache.spark.SparkException: Failed to get broadcast_6_piece0 
> of broadcast_6
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
> ... 11 more
> {noformat}
> Driver stacktrace:
> {noformat}
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696)
>

[jira] [Updated] (SPARK-6980) Akka timeout exceptions indicate which conf controls them

2015-07-02 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-6980:

Fix Version/s: (was: 1.6.0)
   1.5.0

> Akka timeout exceptions indicate which conf controls them
> -
>
> Key: SPARK-6980
> URL: https://issues.apache.org/jira/browse/SPARK-6980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Bryan Cutler
>Priority: Minor
>  Labels: starter
> Fix For: 1.5.0
>
> Attachments: Spark-6980-Test.scala
>
>
> If you hit one of the akka timeouts, you just get an exception like
> {code}
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> {code}
> The exception doesn't indicate how to change the timeout, though there is 
> usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
> nice if the exception including the relevant setting.
> I think this should be pretty easy to do -- we just need to create something 
> like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
> akka timeout and throws its own exception.  We should change 
> {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a 
> {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a 
> better exception.
> Given the latest refactoring to the rpc layer, this needs to be done in both 
> {{AkkaUtils}} and {{AkkaRpcEndpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6980) Akka timeout exceptions indicate which conf controls them

2015-07-02 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-6980:

Assignee: Bryan Cutler  (was: Harsh Gupta)

> Akka timeout exceptions indicate which conf controls them
> -
>
> Key: SPARK-6980
> URL: https://issues.apache.org/jira/browse/SPARK-6980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Bryan Cutler
>Priority: Minor
>  Labels: starter
> Fix For: 1.5.0
>
> Attachments: Spark-6980-Test.scala
>
>
> If you hit one of the akka timeouts, you just get an exception like
> {code}
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> {code}
> The exception doesn't indicate how to change the timeout, though there is 
> usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
> nice if the exception including the relevant setting.
> I think this should be pretty easy to do -- we just need to create something 
> like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
> akka timeout and throws its own exception.  We should change 
> {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a 
> {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a 
> better exception.
> Given the latest refactoring to the rpc layer, this needs to be done in both 
> {{AkkaUtils}} and {{AkkaRpcEndpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8808) Fix assignments in SparkR

2015-07-02 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-8808:
--

 Summary: Fix assignments in SparkR
 Key: SPARK-8808
 URL: https://issues.apache.org/jira/browse/SPARK-8808
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Reporter: Yu Ishikawa


{noformat}
inst/tests/test_binary_function.R:79:12: style: Use <-, not =, for assignment.
  mockFile = c("Spark is pretty.", "Spark is awesome.")
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6980) Akka timeout exceptions indicate which conf controls them

2015-07-02 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-6980.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 6205
[https://github.com/apache/spark/pull/6205]

> Akka timeout exceptions indicate which conf controls them
> -
>
> Key: SPARK-6980
> URL: https://issues.apache.org/jira/browse/SPARK-6980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Harsh Gupta
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
> Attachments: Spark-6980-Test.scala
>
>
> If you hit one of the akka timeouts, you just get an exception like
> {code}
> java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
> {code}
> The exception doesn't indicate how to change the timeout, though there is 
> usually (always?) a corresponding setting in {{SparkConf}} .  It would be 
> nice if the exception including the relevant setting.
> I think this should be pretty easy to do -- we just need to create something 
> like a {{NamedTimeout}}.  It would have its own {{await}} method, catches the 
> akka timeout and throws its own exception.  We should change 
> {{RpcUtils.askTimeout}} and {{RpcUtils.lookupTimeout}} to always give a 
> {{NamedTimeout}}, so we can be sure that anytime we have a timeout, we get a 
> better exception.
> Given the latest refactoring to the rpc layer, this needs to be done in both 
> {{AkkaUtils}} and {{AkkaRpcEndpoint}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8069) Add support for cutoff to RandomForestClassifier

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8069:
---

Assignee: (was: Apache Spark)

> Add support for cutoff to RandomForestClassifier
> 
>
> Key: SPARK-8069
> URL: https://issues.apache.org/jira/browse/SPARK-8069
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> Consider adding support for cutoffs similar to 
> http://cran.r-project.org/web/packages/randomForest/randomForest.pdf 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8069) Add support for cutoff to RandomForestClassifier

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612771#comment-14612771
 ] 

Apache Spark commented on SPARK-8069:
-

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/7205

> Add support for cutoff to RandomForestClassifier
> 
>
> Key: SPARK-8069
> URL: https://issues.apache.org/jira/browse/SPARK-8069
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Priority: Minor
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> Consider adding support for cutoffs similar to 
> http://cran.r-project.org/web/packages/randomForest/randomForest.pdf 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8069) Add support for cutoff to RandomForestClassifier

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8069:
---

Assignee: Apache Spark

> Add support for cutoff to RandomForestClassifier
> 
>
> Key: SPARK-8069
> URL: https://issues.apache.org/jira/browse/SPARK-8069
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: holdenk
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> Consider adding support for cutoffs similar to 
> http://cran.r-project.org/web/packages/randomForest/randomForest.pdf 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8549) Fix the line length of SparkR

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8549:
---

Assignee: Apache Spark

> Fix the line length of SparkR
> -
>
> Key: SPARK-8549
> URL: https://issues.apache.org/jira/browse/SPARK-8549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8549) Fix the line length of SparkR

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8549:
---

Assignee: (was: Apache Spark)

> Fix the line length of SparkR
> -
>
> Key: SPARK-8549
> URL: https://issues.apache.org/jira/browse/SPARK-8549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8549) Fix the line length of SparkR

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612763#comment-14612763
 ] 

Apache Spark commented on SPARK-8549:
-

User 'yu-iskw' has created a pull request for this issue:
https://github.com/apache/spark/pull/7204

> Fix the line length of SparkR
> -
>
> Key: SPARK-8549
> URL: https://issues.apache.org/jira/browse/SPARK-8549
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8497) Graph Clique(Complete Connected Sub-graph) Discovery Algorithm

2015-07-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-8497:
-
Assignee: Fan Jiang

> Graph Clique(Complete Connected Sub-graph) Discovery Algorithm
> --
>
> Key: SPARK-8497
> URL: https://issues.apache.org/jira/browse/SPARK-8497
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX, ML, MLlib, Spark Core
>Reporter: Fan Jiang
>Assignee: Fan Jiang
>  Labels: features
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In recent years, social network industry has high demand on Complete 
> Connected Sub-Graph Discoveries, so does Telecom. Similar as the graph 
> connection from Twitter, the calls and other activities from telecoms world 
> form a huge social graph, and due to the nature of communication method, it 
> shows the strongest inter-person relationship, the graph based analysis will 
> reveal tremendous value from telecoms connections. 
> We need an algorithm in Spark to figure out ALL the strongest completely 
> connected sub-graph (so called Clique here) for EVERY person in the network 
> which will be one of the start point for understanding user's social 
> behaviour. 
> In Huawei, we have many real-world use cases that invovle telecom social 
> graph of tens billion edges and hundreds million vertices, and the cliques 
> will be also in tens million level. The graph will be a fast changing one 
> which means we need to analyse the graph pattern very often (one result per 
> day/week for moving time window which spans multiple months). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6487) Add sequential pattern mining algorithm to Spark MLlib

2015-07-02 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6487:
-
Shepherd: Xiangrui Meng
Target Version/s: 1.5.0

> Add sequential pattern mining algorithm to Spark MLlib
> --
>
> Key: SPARK-6487
> URL: https://issues.apache.org/jira/browse/SPARK-6487
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Zhang JiaJin
>
> [~mengxr] [~zhangyouhua]
> Sequential pattern mining is an important branch in the pattern mining. In 
> the past the actual work, we use the sequence mining (mainly PrefixSpan 
> algorithm) to find the telecommunication signaling sequence pattern, achieved 
> good results. But once the data is too large, the operation time is too long, 
> even can not meet the the service requirements. We are ready to implement the 
> PrefixSpan algorithm in spark, and applied to our subsequent work. 
> The related Paper: 
> PrefixSpan: 
> Pei, Jian, et al. "Mining sequential patterns by pattern-growth: The 
> prefixspan approach." Knowledge and Data Engineering, IEEE Transactions on 
> 16.11 (2004): 1424-1440.
> Parallel Algorithm: 
> Cong, Shengnan, Jiawei Han, and David Padua. "Parallel mining of closed 
> sequential patterns." Proceedings of the eleventh ACM SIGKDD international 
> conference on Knowledge discovery in data mining. ACM, 2005.
> Distributed Algorithm: 
> Wei, Yong-qing, Dong Liu, and Lin-shan Duan. "Distributed PrefixSpan 
> algorithm based on MapReduce." Information Technology in Medicine and 
> Education (ITME), 2012 International Symposium on. Vol. 2. IEEE, 2012.
> Pattern mining and sequential mining Knowledge: 
> Han, Jiawei, et al. "Frequent pattern mining: current status and future 
> directions." Data Mining and Knowledge Discovery 15.1 (2007): 55-86.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8572) Type coercion for ScalaUDFs

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8572:
---

Assignee: Apache Spark

> Type coercion for ScalaUDFs
> ---
>
> Key: SPARK-8572
> URL: https://issues.apache.org/jira/browse/SPARK-8572
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> Seems we do not do type coercion for ScalaUDFs. The following code will hit a 
> runtime exception.
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((x: Int) => x + 1)
> val df = sqlContext.range(1, 10).toDF("i").select(myUDF($"i"))
> df.explain(true)
> df.show
> {code}
> It is also good to check if we do type coercion for PythonUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8572) Type coercion for ScalaUDFs

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612742#comment-14612742
 ] 

Apache Spark commented on SPARK-8572:
-

User 'piaozhexiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/7203

> Type coercion for ScalaUDFs
> ---
>
> Key: SPARK-8572
> URL: https://issues.apache.org/jira/browse/SPARK-8572
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Seems we do not do type coercion for ScalaUDFs. The following code will hit a 
> runtime exception.
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((x: Int) => x + 1)
> val df = sqlContext.range(1, 10).toDF("i").select(myUDF($"i"))
> df.explain(true)
> df.show
> {code}
> It is also good to check if we do type coercion for PythonUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8572) Type coercion for ScalaUDFs

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8572:
---

Assignee: (was: Apache Spark)

> Type coercion for ScalaUDFs
> ---
>
> Key: SPARK-8572
> URL: https://issues.apache.org/jira/browse/SPARK-8572
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Seems we do not do type coercion for ScalaUDFs. The following code will hit a 
> runtime exception.
> {code}
> import org.apache.spark.sql.functions._
> val myUDF = udf((x: Int) => x + 1)
> val df = sqlContext.range(1, 10).toDF("i").select(myUDF($"i"))
> df.explain(true)
> df.show
> {code}
> It is also good to check if we do type coercion for PythonUDFs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8806) run-tests scala style must fail if it does not adhere to Spark Code Style Guide

2015-07-02 Thread Rekha Joshi (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rekha Joshi updated SPARK-8806:
---
Description: 
./dev/run-tests Scala Style must fail if it does not adhere to Spark Code Style 
Guide
Spark Scala Style 
:https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

The scala style test passes even if it does not adhere to style guide.
Now scala style pass check gives only false illusion of correctness.

Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) similar 
to hadoop-format.xml to avoid style issues?


  was:
./dev/run-tests Scala Style must fail if it does not adhere to Spark Code Style 
Guide
Spark Scala Style 
:https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

The scala style test passes even if it does not adhere to style guide.Now scala 
style pass check gives only false illusion of correctness.

Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) similar 
to hadoop-format.xml to avoid style issues?



> run-tests scala style must fail if it does not adhere to Spark Code Style 
> Guide
> ---
>
> Key: SPARK-8806
> URL: https://issues.apache.org/jira/browse/SPARK-8806
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> ./dev/run-tests Scala Style must fail if it does not adhere to Spark Code 
> Style Guide
> Spark Scala Style 
> :https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
> The scala style test passes even if it does not adhere to style guide.
> Now scala style pass check gives only false illusion of correctness.
> Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) 
> similar to hadoop-format.xml to avoid style issues?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8807) Add between operator in SparkR

2015-07-02 Thread Yu Ishikawa (JIRA)
Yu Ishikawa created SPARK-8807:
--

 Summary: Add between operator in SparkR
 Key: SPARK-8807
 URL: https://issues.apache.org/jira/browse/SPARK-8807
 Project: Spark
  Issue Type: New Feature
  Components: SparkR
Reporter: Yu Ishikawa


Add between operator in SparkR

```
df$age between c(1, 2)
```



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-07-02 Thread Cheng Hao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612728#comment-14612728
 ] 

Cheng Hao commented on SPARK-8159:
--

Will that possible to add all of the expressions support in a SINGLE PR for 
Python API and another SINGLE PR for R, after we finish all of the expressions?

At least we can save the of jenkins resources compare to adding them one by one.

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8806) run-tests scala style must fail if it does not adhere to Spark Code Style Guide

2015-07-02 Thread Rekha Joshi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612727#comment-14612727
 ] 

Rekha Joshi commented on SPARK-8806:


looking into what is possible into script run-tests.thanks

> run-tests scala style must fail if it does not adhere to Spark Code Style 
> Guide
> ---
>
> Key: SPARK-8806
> URL: https://issues.apache.org/jira/browse/SPARK-8806
> Project: Spark
>  Issue Type: Wish
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Rekha Joshi
>
> ./dev/run-tests Scala Style must fail if it does not adhere to Spark Code 
> Style Guide
> Spark Scala Style 
> :https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
> The scala style test passes even if it does not adhere to style guide.Now 
> scala style pass check gives only false illusion of correctness.
> Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) 
> similar to hadoop-format.xml to avoid style issues?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8806) run-tests scala style must fail if it does not adhere to Spark Code Style Guide

2015-07-02 Thread Rekha Joshi (JIRA)
Rekha Joshi created SPARK-8806:
--

 Summary: run-tests scala style must fail if it does not adhere to 
Spark Code Style Guide
 Key: SPARK-8806
 URL: https://issues.apache.org/jira/browse/SPARK-8806
 Project: Spark
  Issue Type: Wish
  Components: Build
Affects Versions: 1.5.0
Reporter: Rekha Joshi


./dev/run-tests Scala Style must fail if it does not adhere to Spark Code Style 
Guide
Spark Scala Style 
:https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide

The scala style test passes even if it does not adhere to style guide.Now scala 
style pass check gives only false illusion of correctness.

Alterntively if we can have spark-format.xml for IDE (intellij/eclipse) similar 
to hadoop-format.xml to avoid style issues?




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8782) GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8782.

   Resolution: Fixed
Fix Version/s: 1.5.0

> GenerateOrdering fails for NullType (i.e. ORDER BY NULL crashes)
> 
>
> Key: SPARK-8782
> URL: https://issues.apache.org/jira/browse/SPARK-8782
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.5.0
>
>
> Queries containing ORDER BY NULL currently result in a code generation 
> exception:
> {code}
>   public SpecificOrdering 
> generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
> return new SpecificOrdering(expr);
>   }
>   class SpecificOrdering extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseOrdering {
> private org.apache.spark.sql.catalyst.expressions.Expression[] 
> expressions = null;
> public 
> SpecificOrdering(org.apache.spark.sql.catalyst.expressions.Expression[] expr) 
> {
>   expressions = expr;
> }
> @Override
> public int compare(InternalRow a, InternalRow b) {
>   InternalRow i = null;  // Holds current row being evaluated.
>   
>   i = a;
>   final Object primitive1 = null;
>   i = b;
>   final Object primitive3 = null;
>   if (true && true) {
> // Nothing
>   } else if (true) {
> return -1;
>   } else if (true) {
> return 1;
>   } else {
> int comp = primitive1.compare(primitive3);
> if (comp != 0) {
>   return comp;
> }
>   }
>   
>   return 0;
> }
>   }
> org.codehaus.commons.compiler.CompileException: Line 29, Column 43: A method 
> named "compare" is not declared in any enclosing class nor any supertype, nor 
> through a static import
>   at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8159) Improve SQL/DataFrame expression coverage

2015-07-02 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612725#comment-14612725
 ] 

Yu Ishikawa commented on SPARK-8159:


[~rxin] How should we deal with the {{pyspark}} and {{SparkR}} versions? 
- Make another umblella issues for {{pyspark}} and {{SparkR}}
- Make sub issues in each these issues
- Reopen each issues for {{pyspark}} and {{SparkR}}

And which ones should we support in {{pyspark}} and {{SparkR}}? All?

> Improve SQL/DataFrame expression coverage
> -
>
> Key: SPARK-8159
> URL: https://issues.apache.org/jira/browse/SPARK-8159
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>
> This is an umbrella ticket to track new expressions we are adding to 
> SQL/DataFrame.
> For each new expression, we should:
> 1. Add a new Expression implementation in 
> org.apache.spark.sql.catalyst.expressions
> 2. If applicable, implement the code generated version (by implementing 
> genCode).
> 3. Add comprehensive unit tests (for all the data types the expressions 
> support).
> 4. If applicable, add a new function for DataFrame in 
> org.apache.spark.sql.functions, and python/pyspark/sql/functions.py for 
> Python.
> For date/time functions, put them in expressions/datetime.scala, and create a 
> DateTimeFunctionSuite.scala for testing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8768) SparkSubmitSuite fails on Hadoop 1.x builds due to java.lang.VerifyError in Akka Protobuf

2015-07-02 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612724#comment-14612724
 ] 

Josh Rosen commented on SPARK-8768:
---

[~zsxwing], I don't think so: the master Maven build uses build/mvn, which, as 
far as I know, should now be downloading the newer Maven version that is 
supposed to work.

> SparkSubmitSuite fails on Hadoop 1.x builds due to java.lang.VerifyError in 
> Akka Protobuf
> -
>
> Key: SPARK-8768
> URL: https://issues.apache.org/jira/browse/SPARK-8768
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> The end-to-end SparkSubmitSuite tests ("launch simple application with 
> spark-submit", "include jars passed in through --jars", and "include jars 
> passed in through --packages") are currently failing for the pre-YARN Hadoop 
> builds.
> I managed to reproduce one of the Jenkins failures locally:
> {code}
> build/mvn -Phadoop-1 -Dhadoop.version=1.2.1 -Phive -Phive-thriftserver 
> -Pkinesis-asl test -DwildcardSuites=org.apache.spark.deploy.SparkSubmitSuite 
> -Dtest=none
> {code}
> Here's the output from unit-tests.log:
> {code}
> = TEST OUTPUT FOR o.a.s.deploy.SparkSubmitSuite: 'launch simple 
> application with spark-submit' =
> 15/07/01 13:39:58.964 redirect stderr for command ./bin/spark-submit INFO 
> Utils: SLF4J: Class path contains multiple SLF4J bindings.
> 15/07/01 13:39:58.964 redirect stderr for command ./bin/spark-submit INFO 
> Utils: SLF4J: Found binding in 
> [jar:file:/Users/joshrosen/Documents/spark-2/assembly/target/scala-2.10/spark-assembly-1.5.0-SNAPSHOT-hadoop1.2.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 15/07/01 13:39:58.965 redirect stderr for command ./bin/spark-submit INFO 
> Utils: SLF4J: Found binding in 
> [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.10/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 15/07/01 13:39:58.965 redirect stderr for command ./bin/spark-submit INFO 
> Utils: SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> 15/07/01 13:39:58.965 redirect stderr for command ./bin/spark-submit INFO 
> Utils: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 15/07/01 13:39:58.966 redirect stderr for command ./bin/spark-submit INFO 
> Utils: 15/07/01 13:39:58 INFO SparkContext: Running Spark version 
> 1.5.0-SNAPSHOT
> 15/07/01 13:39:59.334 redirect stderr for command ./bin/spark-submit INFO 
> Utils: 15/07/01 13:39:59 INFO SecurityManager: Changing view acls to: 
> joshrosen
> 15/07/01 13:39:59.335 redirect stderr for command ./bin/spark-submit INFO 
> Utils: 15/07/01 13:39:59 INFO SecurityManager: Changing modify acls to: 
> joshrosen
> 15/07/01 13:39:59.335 redirect stderr for command ./bin/spark-submit INFO 
> Utils: 15/07/01 13:39:59 INFO SecurityManager: SecurityManager: 
> authentication disabled; ui acls disabled; users with view permissions: 
> Set(joshrosen); users with modify permissions: Set(joshrosen)
> 15/07/01 13:39:59.898 redirect stderr for command ./bin/spark-submit INFO 
> Utils: 15/07/01 13:39:59 INFO Slf4jLogger: Slf4jLogger started
> 15/07/01 13:39:59.934 redirect stderr for command ./bin/spark-submit INFO 
> Utils: 15/07/01 13:39:59 INFO Remoting: Starting remoting
> 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO 
> Utils: 15/07/01 13:40:00 ERROR ActorSystemImpl: Uncaught fatal error from 
> thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down 
> ActorSystem [sparkDriver]
> 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO 
> Utils: java.lang.VerifyError: class 
> akka.remote.WireFormats$AkkaControlMessage overrides final method 
> getUnknownFields.()Lcom/google/protobuf/UnknownFieldSet;
> 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO 
> Utils:at java.lang.ClassLoader.defineClass1(Native Method)
> 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO 
> Utils:at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
> 15/07/01 13:40:00.009 redirect stderr for command ./bin/spark-submit INFO 
> Utils:at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> 15/07/01 13:40:00.010 redirect stderr for command ./bin/spark-submit INFO 
> Utils:at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
> 15/07/01 13:40:00.010 redirect stderr for command ./bin/spark-submit INFO 
> Utils:at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
> 15/07/01 13:40:00.010 redirect stderr for command ./bin/spark-submit INFO 
> Utils:at java.net.URLClassLoader$1.run(URLClassLoader.java

[jira] [Commented] (SPARK-5682) Add encrypted shuffle in spark

2015-07-02 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612719#comment-14612719
 ] 

liyunzhang_intel commented on SPARK-5682:
-

[~hujiayin]:
{quote}
 the AES solution is a bit heavy to encode/decode the live steaming data.
{quote}
  Is there any other solution to encode/decode the live streaming data? please 
share your suggestion with us.

> Add encrypted shuffle in spark
> --
>
> Key: SPARK-5682
> URL: https://issues.apache.org/jira/browse/SPARK-5682
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Reporter: liyunzhang_intel
> Attachments: Design Document of Encrypted Spark 
> Shuffle_20150209.docx, Design Document of Encrypted Spark 
> Shuffle_20150318.docx, Design Document of Encrypted Spark 
> Shuffle_20150402.docx, Design Document of Encrypted Spark 
> Shuffle_20150506.docx
>
>
> Encrypted shuffle is enabled in hadoop 2.6 which make the process of shuffle 
> data safer. This feature is necessary in spark. AES  is a specification for 
> the encryption of electronic data. There are 5 common modes in AES. CTR is 
> one of the modes. We use two codec JceAesCtrCryptoCodec and 
> OpensslAesCtrCryptoCodec to enable spark encrypted shuffle which is also used 
> in hadoop encrypted shuffle. JceAesCtrypoCodec uses encrypted algorithms  jdk 
> provides while OpensslAesCtrCryptoCodec uses encrypted algorithms  openssl 
> provides. 
> Because ugi credential info is used in the process of encrypted shuffle, we 
> first enable encrypted shuffle on spark-on-yarn framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7263) Add new shuffle manager which stores shuffle blocks in Parquet

2015-07-02 Thread Matt Massie (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612686#comment-14612686
 ] 

Matt Massie commented on SPARK-7263:


I just pushed a large update to my github account. I'll have a PR submitted to 
the Spark project very soon.

https://github.com/apache/spark/compare/master...massie:parquet-shuffle

> Add new shuffle manager which stores shuffle blocks in Parquet
> --
>
> Key: SPARK-7263
> URL: https://issues.apache.org/jira/browse/SPARK-7263
> Project: Spark
>  Issue Type: New Feature
>  Components: Block Manager
>Reporter: Matt Massie
>
> I have a working prototype of this feature that can be viewed at
> https://github.com/apache/spark/compare/master...massie:parquet-shuffle?expand=1
> Setting the "spark.shuffle.manager" to "parquet" enables this shuffle manager.
> The dictionary support that Parquet provides appreciably reduces the amount of
> memory that objects use; however, once Parquet data is shuffled, all the
> dictionary information is lost and the column-oriented data is written to 
> shuffle
> blocks in a record-oriented fashion. This shuffle manager addresses this issue
> by reading and writing all shuffle blocks in the Parquet format.
> If shuffle objects are Avro records, then the Avro $SCHEMA is converted to 
> Parquet
> schema and used directly, otherwise, the Parquet schema is generated via 
> reflection.
> Currently, the only non-Avro keys supported is primitive types. The reflection
> code can be improved (or replaced) to support complex records.
> The ParquetShufflePair class allows the shuffle key and value to be stored in
> Parquet blocks as a single record with a single schema.
> This commit adds the following new Spark configuration options:
> "spark.shuffle.parquet.compression" - sets the Parquet compression codec
> "spark.shuffle.parquet.blocksize" - sets the Parquet block size
> "spark.shuffle.parquet.pagesize" - set the Parquet page size
> "spark.shuffle.parquet.enabledictionary" - turns dictionary encoding on/off
> Parquet does not (and has no plans to) support a streaming API. Metadata 
> sections
> are scattered through a Parquet file making a streaming API difficult. As 
> such,
> the ShuffleBlockFetcherIterator has been modified to fetch the entire contents
> of map outputs into temporary blocks before loading the data into the reducer.
> Interesting future asides:
> o There is no need to define a data serializer (although Spark requires it)
> o Parquet support predicate pushdown and projection which could be used at
>   between shuffle stages to improve performance in the future



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5581) When writing sorted map output file, avoid open / close between each partition

2015-07-02 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612676#comment-14612676
 ] 

Matt Cheah commented on SPARK-5581:
---

I'd be interested in taking something like this on =). [~joshrosen] it sounds 
like there are still some open questions though; can I write up a PR taking 
your comments into consideration and we can iterate from there?

> When writing sorted map output file, avoid open / close between each partition
> --
>
> Key: SPARK-5581
> URL: https://issues.apache.org/jira/browse/SPARK-5581
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.3.0
>Reporter: Sandy Ryza
>
> {code}
>   // Bypassing merge-sort; get an iterator by partition and just write 
> everything directly.
>   for ((id, elements) <- this.partitionedIterator) {
> if (elements.hasNext) {
>   val writer = blockManager.getDiskWriter(
> blockId, outputFile, ser, fileBufferSize, 
> context.taskMetrics.shuffleWriteMetrics.get)
>   for (elem <- elements) {
> writer.write(elem)
>   }
>   writer.commitAndClose()
>   val segment = writer.fileSegment()
>   lengths(id) = segment.length
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-8784) Add python API for hex/unhex

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-8784:


Reopened due to build breaking.


> Add python API for hex/unhex
> 
>
> Key: SPARK-8784
> URL: https://issues.apache.org/jira/browse/SPARK-8784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7483) [MLLib] Using Kryo with FPGrowth fails with an exception

2015-07-02 Thread S (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612673#comment-14612673
 ] 

S commented on SPARK-7483:
--

This does NOT work.

Registering the classes stops it from crashing, but produces a bug in the 
FP-Growth algorithm.

Specifically, the frequency counts for itemsets are wrong.

:(

> [MLLib] Using Kryo with FPGrowth fails with an exception
> 
>
> Key: SPARK-7483
> URL: https://issues.apache.org/jira/browse/SPARK-7483
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.1
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> When using FPGrowth algorithm with KryoSerializer - Spark fails with
> {code}
> Job aborted due to stage failure: Task 0 in stage 9.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 9.0 (TID 16, localhost): 
> com.esotericsoftware.kryo.KryoException: java.lang.IllegalArgumentException: 
> Can not set final scala.collection.mutable.ListBuffer field 
> org.apache.spark.mllib.fpm.FPTree$Summary.nodes to 
> scala.collection.mutable.ArrayBuffer
> Serialization trace:
> nodes (org.apache.spark.mllib.fpm.FPTree$Summary)
> org$apache$spark$mllib$fpm$FPTree$$summaries 
> (org.apache.spark.mllib.fpm.FPTree)
> {code}
> This can be easily reproduced in spark codebase by setting 
> {code}
> conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> {code} and running FPGrowthSuite.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8801) Support TypeCollection in ExpectsInputTypes

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8801:
---

Assignee: Reynold Xin  (was: Apache Spark)

> Support TypeCollection in ExpectsInputTypes
> ---
>
> Key: SPARK-8801
> URL: https://issues.apache.org/jira/browse/SPARK-8801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Some functions support more than one input types for each parameter. For 
> example, length supports binary and string, and maybe array/struct in the 
> future.
> This ticket proposes a TypeCollection AbstractDataType that supports multiple 
> data types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8801) Support TypeCollection in ExpectsInputTypes

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8801:
---

Assignee: Apache Spark  (was: Reynold Xin)

> Support TypeCollection in ExpectsInputTypes
> ---
>
> Key: SPARK-8801
> URL: https://issues.apache.org/jira/browse/SPARK-8801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Some functions support more than one input types for each parameter. For 
> example, length supports binary and string, and maybe array/struct in the 
> future.
> This ticket proposes a TypeCollection AbstractDataType that supports multiple 
> data types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8801) Support TypeCollection in ExpectsInputTypes

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612671#comment-14612671
 ] 

Apache Spark commented on SPARK-8801:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7202

> Support TypeCollection in ExpectsInputTypes
> ---
>
> Key: SPARK-8801
> URL: https://issues.apache.org/jira/browse/SPARK-8801
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Some functions support more than one input types for each parameter. For 
> example, length supports binary and string, and maybe array/struct in the 
> future.
> This ticket proposes a TypeCollection AbstractDataType that supports multiple 
> data types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8803) Crosstab element's can't contain null's and back ticks

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8803:
---

Assignee: (was: Apache Spark)

> Crosstab element's can't contain null's and back ticks
> --
>
> Key: SPARK-8803
> URL: https://issues.apache.org/jira/browse/SPARK-8803
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>
> Having back ticks or null as elements causes problems. 
> Since elements become column names, we have to drop them from the element as 
> back ticks are special characters.
> Having null throws exceptions, we could replace them with empty strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8803) Crosstab element's can't contain null's and back ticks

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8803:
---

Assignee: Apache Spark

> Crosstab element's can't contain null's and back ticks
> --
>
> Key: SPARK-8803
> URL: https://issues.apache.org/jira/browse/SPARK-8803
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>Assignee: Apache Spark
>
> Having back ticks or null as elements causes problems. 
> Since elements become column names, we have to drop them from the element as 
> back ticks are special characters.
> Having null throws exceptions, we could replace them with empty strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8803) Crosstab element's can't contain null's and back ticks

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612662#comment-14612662
 ] 

Apache Spark commented on SPARK-8803:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/7201

> Crosstab element's can't contain null's and back ticks
> --
>
> Key: SPARK-8803
> URL: https://issues.apache.org/jira/browse/SPARK-8803
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Burak Yavuz
>
> Having back ticks or null as elements causes problems. 
> Since elements become column names, we have to drop them from the element as 
> back ticks are special characters.
> Having null throws exceptions, we could replace them with empty strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7768) Make user-defined type (UDT) API public

2015-07-02 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612658#comment-14612658
 ] 

Joseph K. Bradley commented on SPARK-7768:
--

Making VectorUDT public blocks on this issue, but it should probably be in a 
separate PR.

> Make user-defined type (UDT) API public
> ---
>
> Key: SPARK-7768
> URL: https://issues.apache.org/jira/browse/SPARK-7768
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Xiangrui Meng
>Priority: Critical
>
> As the demand for UDTs increases beyond sparse/dense vectors in MLlib, it 
> would be nice to make the UDT API public in 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8733) ML RDD.unpersist calls should use blocking = false

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-8733.

Resolution: Won't Fix

Per discussion on the linked PR, closing this for now.

> ML RDD.unpersist calls should use blocking = false
> --
>
> Key: SPARK-8733
> URL: https://issues.apache.org/jira/browse/SPARK-8733
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
> Attachments: Screen Shot 2015-06-30 at 10.51.44 AM.png
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> MLlib uses unpersist in many places, but is not consistent about blocking vs 
> not.  We should check through all of MLlib and change calls to use blocking = 
> false, unless there is a real need to block.  I have run into issues with 
> futures timing out because of unpersist() calls, when there was no real need 
> for the ML method to fail.
> See attached screenshot.  Training succeeded, but the final unpersist during 
> cleanup failed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7104) Support model save/load in Python's Word2Vec

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7104.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 6821
[https://github.com/apache/spark/pull/6821]

> Support model save/load in Python's Word2Vec
> 
>
> Key: SPARK-7104
> URL: https://issues.apache.org/jira/browse/SPARK-7104
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7104) Support model save/load in Python's Word2Vec

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7104:
-
Assignee: Yu Ishikawa

> Support model save/load in Python's Word2Vec
> 
>
> Key: SPARK-7104
> URL: https://issues.apache.org/jira/browse/SPARK-7104
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>Priority: Minor
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8805) Spark shell not working

2015-07-02 Thread Perinkulam I Ganesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perinkulam I Ganesh updated SPARK-8805:
---
Description: 
I am using Git Bash on windows.  Installed Open jdk1.8.0_45 and spark 1.4.0

I am able to build spark and install it. But when ever I execute spark shell it 
gives me the following error:

$ spark-shell
/c/.../spark/bin/spark-class: line 76: conditional binary operator expected






  was:
I am using Git Bash.  Installed Open jdk1.8.0_45 and spark 1.4.0

I am able to build spark and install it. But when ever I execute spark shell it 
gives me the following error:

$ spark-shell
/c/.../spark/bin/spark-class: line 76: conditional binary operator expected







> Spark shell not working
> ---
>
> Key: SPARK-8805
> URL: https://issues.apache.org/jira/browse/SPARK-8805
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Spark Core, Windows
>Reporter: Perinkulam I Ganesh
>
> I am using Git Bash on windows.  Installed Open jdk1.8.0_45 and spark 1.4.0
> I am able to build spark and install it. But when ever I execute spark shell 
> it gives me the following error:
> $ spark-shell
> /c/.../spark/bin/spark-class: line 76: conditional binary operator expected



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8805) Spark shell not working

2015-07-02 Thread Perinkulam I Ganesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perinkulam I Ganesh updated SPARK-8805:
---
Description: 
I am using Git Bash.  Installed Open jdk1.8.0_45 and spark 1.4.0

I am able to build spark and install it. But when ever I execute spark shell it 
gives me the following error:

$ spark-shell
/c/.../spark/bin/spark-class: line 76: conditional binary operator expected






  was:
I am using Git Bash.  Installed Open jdk1.8.0_45 and spark 1.4.0

I am able to build spark and install it. But when ever I execute spark shell it 
gives me the following error:






> Spark shell not working
> ---
>
> Key: SPARK-8805
> URL: https://issues.apache.org/jira/browse/SPARK-8805
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Spark Core, Windows
>Reporter: Perinkulam I Ganesh
>
> I am using Git Bash.  Installed Open jdk1.8.0_45 and spark 1.4.0
> I am able to build spark and install it. But when ever I execute spark shell 
> it gives me the following error:
> $ spark-shell
> /c/.../spark/bin/spark-class: line 76: conditional binary operator expected



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612646#comment-14612646
 ] 

Cheng Lian commented on SPARK-8501:
---

Exactly. Please see my PR description here 
https://github.com/apache/spark/pull/7199

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8805) Spark shell not working

2015-07-02 Thread Perinkulam I Ganesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perinkulam I Ganesh updated SPARK-8805:
---
Description: 
I am using Git Bash.  Installed Open jdk1.8.0_45 and spark 1.4.0

I am able to build spark and install it. But when ever I execute spark shell it 
gives me the following error:





> Spark shell not working
> ---
>
> Key: SPARK-8805
> URL: https://issues.apache.org/jira/browse/SPARK-8805
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Spark Core, Windows
>Reporter: Perinkulam I Ganesh
>
> I am using Git Bash.  Installed Open jdk1.8.0_45 and spark 1.4.0
> I am able to build spark and install it. But when ever I execute spark shell 
> it gives me the following error:



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612643#comment-14612643
 ] 

Zhan Zhang commented on SPARK-8501:
---

Because in spark, we will not create the orc file if the record is empty. It is 
only happens with the ORC file created by hive, right? 

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8805) Spark shell not working

2015-07-02 Thread Perinkulam I Ganesh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Perinkulam I Ganesh updated SPARK-8805:
---
Summary: Spark shell not working  (was: I installed Git Bash)

> Spark shell not working
> ---
>
> Key: SPARK-8805
> URL: https://issues.apache.org/jira/browse/SPARK-8805
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Spark Core, Windows
>Reporter: Perinkulam I Ganesh
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8805) I installed Git Bash

2015-07-02 Thread Perinkulam I Ganesh (JIRA)
Perinkulam I Ganesh created SPARK-8805:
--

 Summary: I installed Git Bash
 Key: SPARK-8805
 URL: https://issues.apache.org/jira/browse/SPARK-8805
 Project: Spark
  Issue Type: Brainstorming
  Components: Spark Core, Windows
Reporter: Perinkulam I Ganesh






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612637#comment-14612637
 ] 

Reynold Xin commented on SPARK-8685:


The problem is that Python Row doesn't allow duplicate values, because under 
the hood it is stored as a dict.

> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideally without duplicate columns):
> {code}
> [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
> name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> The workaround for now is this rather clunky piece of code:
> {code}
> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
> 'name2').withColumnRenamed('country', 'country2')
> df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 
> 'left_outer').collect()
> {code}
> Also, {{.show()}} works
> {code}
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
> 'left_outer').show()
> +---+---+-+--+---+-+
> |age|country| name|colour|country| name|
> +---+---+-+--+---+-+
> |  3|ire|carol| green|ire|carol|
> |  2|jpn|alice|  null|   null| null|
> |  1|usa|  bob|   red|usa|  bob|
> +---+---+-+--+---+-+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8784) Add python API for hex/unhex

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-8784.

   Resolution: Fixed
Fix Version/s: 1.5.0

> Add python API for hex/unhex
> 
>
> Key: SPARK-8784
> URL: https://issues.apache.org/jira/browse/SPARK-8784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8685:

Description: 
I have the following code:

{code}
from pyspark import SQLContext

d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
{'name':'alice', 'country': 'jpn', 'age': 2}, 
{'name':'carol', 'country': 'ire', 'age': 3}]

d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
{'name':'carol', 'country': 'ire', 'colour':'green'}]

r1 = sc.parallelize(d1)
r2 = sc.parallelize(d2)

sqlContext = SQLContext(sc)
df1 = sqlContext.createDataFrame(d1)
df2 = sqlContext.createDataFrame(d2)
df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
'left_outer').collect()
{code}

When I run it I get the following, (notice in the first row, all join keys are 
take from the right-side and so are blanked out):

{code}
[Row(age=2, country=None, name=None, colour=None, country=None, name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

I would expect to get (though ideally without duplicate columns):
{code}
[Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

The workaround for now is this rather clunky piece of code:
{code}
df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
'name2').withColumnRenamed('country', 'country2')
df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 
'left_outer').collect()
{code}

Also, {{.show()}} works
{code}
sqlContext = SQLContext(sc)
df1 = sqlContext.createDataFrame(d1)
df2 = sqlContext.createDataFrame(d2)
df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
'left_outer').show()
+---+---+-+--+---+-+
|age|country| name|colour|country| name|
+---+---+-+--+---+-+
|  3|ire|carol| green|ire|carol|
|  2|jpn|alice|  null|   null| null|
|  1|usa|  bob|   red|usa|  bob|
+---+---+-+--+---+-+
{code}

  was:
I have the following code:

{code}
from pyspark import SQLContext

d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
{'name':'alice', 'country': 'jpn', 'age': 2}, 
{'name':'carol', 'country': 'ire', 'age': 3}]

d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
{'name':'carol', 'country': 'ire', 'colour':'green'}]

r1 = sc.parallelize(d1)
r2 = sc.parallelize(d2)

sqlContext = SQLContext(sc)
df1 = sqlContext.createDataFrame(d1)
df2 = sqlContext.createDataFrame(d2)
df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
'left_outer').collect()
{code}

When I run it I get the following, (notice in the first row, all join keys are 
take from the right-side and so are blanked out):

{code}
[Row(age=2, country=None, name=None, colour=None, country=None, name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

I would expect to get (though ideally without duplicate columns):
{code}
[Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

The workaround for now is this rather clunky piece of code:
{code}
df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
'name2').withColumnRenamed('country', 'country2')
df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 
'left_outer').collect()
{code}


> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
> 'left_outer').collect()
> {code}

[jira] [Created] (SPARK-8804) order of UTF8String is wrong if there is any non-ascii character in it

2015-07-02 Thread Davies Liu (JIRA)
Davies Liu created SPARK-8804:
-

 Summary:  order of UTF8String is wrong if there is any non-ascii 
character in it
 Key: SPARK-8804
 URL: https://issues.apache.org/jira/browse/SPARK-8804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


We compare the UTF8String byte by byte, but byte in JVM is signed, it should be 
compared as unsigned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8685:

Fix Version/s: (was: 1.4.1)
   (was: 1.5.0)

> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideally without duplicate columns):
> {code}
> [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
> name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> The workaround for now is this rather clunky piece of code:
> {code}
> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
> 'name2').withColumnRenamed('country', 'country2')
> df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 
> 'left_outer').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8685:
---
Priority: Major  (was: Critical)

> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
> Fix For: 1.4.1, 1.5.0
>
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideally without duplicate columns):
> {code}
> [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
> name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> The workaround for now is this rather clunky piece of code:
> {code}
> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
> 'name2').withColumnRenamed('country', 'country2')
> df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 
> 'left_outer').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8685:
---
Target Version/s: 1.5.0  (was: 1.5.0, 1.4.2)

> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideally without duplicate columns):
> {code}
> [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
> name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> The workaround for now is this rather clunky piece of code:
> {code}
> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
> 'name2').withColumnRenamed('country', 'country2')
> df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 
> 'left_outer').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8803) Crosstab element's can't contain null's and back ticks

2015-07-02 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-8803:
--

 Summary: Crosstab element's can't contain null's and back ticks
 Key: SPARK-8803
 URL: https://issues.apache.org/jira/browse/SPARK-8803
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Burak Yavuz


Having back ticks or null as elements causes problems. 

Since elements become column names, we have to drop them from the element as 
back ticks are special characters.

Having null throws exceptions, we could replace them with empty strings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-8685:

Description: 
I have the following code:

{code}
from pyspark import SQLContext

d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
{'name':'alice', 'country': 'jpn', 'age': 2}, 
{'name':'carol', 'country': 'ire', 'age': 3}]

d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
{'name':'carol', 'country': 'ire', 'colour':'green'}]

r1 = sc.parallelize(d1)
r2 = sc.parallelize(d2)

sqlContext = SQLContext(sc)
df1 = sqlContext.createDataFrame(d1)
df2 = sqlContext.createDataFrame(d2)
df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
'left_outer').collect()
{code}

When I run it I get the following, (notice in the first row, all join keys are 
take from the right-side and so are blanked out):

{code}
[Row(age=2, country=None, name=None, colour=None, country=None, name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

I would expect to get (though ideally without duplicate columns):
{code}
[Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

The workaround for now is this rather clunky piece of code:
{code}
df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
'name2').withColumnRenamed('country', 'country2')
df1.join(df2, (df1.name == df2.name2) & (df1.country == df2.country2), 
'left_outer').collect()
{code}

  was:
I have the following code:

{code}
from pyspark import SQLContext

d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
{'name':'alice', 'country': 'jpn', 'age': 2}, 
{'name':'carol', 'country': 'ire', 'age': 3}]

d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
{'name':'carol', 'country': 'ire', 'colour':'green'}]

r1 = sc.parallelize(d1)
r2 = sc.parallelize(d2)

sqlContext = SQLContext(sc)
df1 = sqlContext.createDataFrame(d1)
df2 = sqlContext.createDataFrame(d2)
df1.join(df2, df1.name == df2.name and df1.country == df2.country, 
'left_outer').collect()
{code}

When I run it I get the following, (notice in the first row, all join keys are 
take from the right-side and so are blanked out):

{code}
[Row(age=2, country=None, name=None, colour=None, country=None, name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

I would expect to get (though ideally without duplicate columns):
{code}
[Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
name=u'bob'),
Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
name=u'alice')]
{code}

The workaround for now is this rather clunky piece of code:
{code}
df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
'name2').withColumnRenamed('country', 'country2')
df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 
'left_outer').collect()
{code}


> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, (df1.name == df2.name) & (df1.country == df2.country), 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideal

[jira] [Reopened] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai reopened SPARK-8685:
-

> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, df1.name == df2.name and df1.country == df2.country, 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideally without duplicate columns):
> {code}
> [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
> name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> The workaround for now is this rather clunky piece of code:
> {code}
> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
> 'name2').withColumnRenamed('country', 'country2')
> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 
> 'left_outer').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-8501:
--
Target Version/s: 1.4.1, 1.5.0  (was: 1.5.0, 1.4.2)

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-8685.
--
   Resolution: Duplicate
 Assignee: Davies Liu
Fix Version/s: 1.5.0
   1.4.1

This is already fixed. 

> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, df1.name == df2.name and df1.country == df2.country, 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideally without duplicate columns):
> {code}
> [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
> name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> The workaround for now is this rather clunky piece of code:
> {code}
> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
> 'name2').withColumnRenamed('country', 'country2')
> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 
> 'left_outer').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8573) For PySpark's DataFrame API, we need to throw exceptions when users try to use and/or/not

2015-07-02 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-8573:
---
Fix Version/s: 1.5.0
   1.4.1

> For PySpark's DataFrame API, we need to throw exceptions when users try to 
> use and/or/not
> -
>
> Key: SPARK-8573
> URL: https://issues.apache.org/jira/browse/SPARK-8573
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
> Fix For: 1.4.1, 1.5.0
>
>
> In PySpark's DataFrame API, we have
> {code}
> # `and`, `or`, `not` cannot be overloaded in Python,
> # so use bitwise operators as boolean operators
> __and__ = _bin_op('and')
> __or__ = _bin_op('or')
> __invert__ = _func_op('not')
> __rand__ = _bin_op("and")
> __ror__ = _bin_op("or")
> {code}
> Right now, users can still use operators like {{and}}, which can cause very 
> confusing behaviors. We need to throw an error when users try to use them and 
> let them know what is the right way to do.
> For example, 
> {code}
> df = sqlContext.range(1, 10)
> df.id > 5 or df.id < 10
> Out[30]: Column<(id > 5)>
> df.id > 5 and df.id < 10
> Out[31]: Column<(id < 10)>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8685) dataframe left joins are not working as expected in pyspark

2015-07-02 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612626#comment-14612626
 ] 

Michael Armbrust commented on SPARK-8685:
-

I think the problem here is that you are using {{and}}, but instead should 
write {{(df1.name == df2.name) & (df1.country == df2.country)}}

> dataframe left joins are not working as expected in pyspark
> ---
>
> Key: SPARK-8685
> URL: https://issues.apache.org/jira/browse/SPARK-8685
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
> Environment: ubuntu 14.04
>Reporter: axel dahl
>Priority: Critical
>
> I have the following code:
> {code}
> from pyspark import SQLContext
> d1 = [{'name':'bob', 'country': 'usa', 'age': 1},
> {'name':'alice', 'country': 'jpn', 'age': 2}, 
> {'name':'carol', 'country': 'ire', 'age': 3}]
> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
> {'name':'carol', 'country': 'ire', 'colour':'green'}]
> r1 = sc.parallelize(d1)
> r2 = sc.parallelize(d2)
> sqlContext = SQLContext(sc)
> df1 = sqlContext.createDataFrame(d1)
> df2 = sqlContext.createDataFrame(d2)
> df1.join(df2, df1.name == df2.name and df1.country == df2.country, 
> 'left_outer').collect()
> {code}
> When I run it I get the following, (notice in the first row, all join keys 
> are take from the right-side and so are blanked out):
> {code}
> [Row(age=2, country=None, name=None, colour=None, country=None, name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> I would expect to get (though ideally without duplicate columns):
> {code}
> [Row(age=2, country=u'ire', name=u'alice', colour=None, country=None, 
> name=None),
> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', 
> name=u'bob'),
> Row(age=3, country=u'ire', name=u'carol', colour=u'green', country=u'ire', 
> name=u'alice')]
> {code}
> The workaround for now is this rather clunky piece of code:
> {code}
> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 
> 'name2').withColumnRenamed('country', 'country2')
> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 
> 'left_outer').collect()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612624#comment-14612624
 ] 

Apache Spark commented on SPARK-8501:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/7200

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-07-02 Thread Justin Uang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612608#comment-14612608
 ] 

Justin Uang commented on SPARK-8632:


Haven't gotten around to it yet. I'll let you know when I can find time to work 
on it!

Right now, I'm thinking about creating a separate code path for sql udfs, since 
I realized that the current system with two threads is necessary because the 
RDD interface is from Iterator -> Iterator. Any type of synchronous batching 
won't work with RDDs that change the length of the output iterator.

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8632:

Target Version/s: 1.5.0

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8632) Poor Python UDF performance because of RDD caching

2015-07-02 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-8632:

Shepherd: Davies Liu

> Poor Python UDF performance because of RDD caching
> --
>
> Key: SPARK-8632
> URL: https://issues.apache.org/jira/browse/SPARK-8632
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0
>Reporter: Justin Uang
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8501:
---

Assignee: Cheng Lian  (was: Apache Spark)

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612599#comment-14612599
 ] 

Apache Spark commented on SPARK-8501:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/7199

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3382) GradientDescent convergence tolerance

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3382:
-
Assignee: Kai Sasaki

> GradientDescent convergence tolerance
> -
>
> Key: SPARK-3382
> URL: https://issues.apache.org/jira/browse/SPARK-3382
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Kai Sasaki
>Priority: Minor
> Fix For: 1.5.0
>
>
> GradientDescent should support a convergence tolerance setting.  In general, 
> for optimization, convergence tolerance should be preferred over a limit on 
> the number of iterations since it is a somewhat data-adaptive or 
> data-specific convergence criterion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-8501) ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

2015-07-02 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-8501:
---

Assignee: Apache Spark  (was: Cheng Lian)

> ORC data source may give empty schema if an ORC file containing zero rows is 
> picked for schema discovery
> 
>
> Key: SPARK-8501
> URL: https://issues.apache.org/jira/browse/SPARK-8501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
> Environment: Hive 0.13.1
>Reporter: Cheng Lian
>Assignee: Apache Spark
>Priority: Critical
>
> Not sure whether this should be considered as a bug of ORC bundled with Hive 
> 0.13.1: for an ORC file containing zero rows, the schema written in its 
> footer contains zero fields (e.g. {{struct<>}}).
> To reproduce this issue, let's first produce an empty ORC file.  Copy data 
> file {{sql/hive/src/test/resources/data/files/kv1.txt}} in Spark code repo to 
> {{/tmp/kv1.txt}} (I just picked a random simple test data file), then run the 
> following lines in Hive 0.13.1 CLI:
> {noformat}
> $ hive
> hive> CREATE TABLE foo(key INT, value STRING);
> hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
> hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
> {noformat}
> Now inspect the empty ORC file we just wrote:
> {noformat}
> $ hive --orcfiledump /user/hive/warehouse_hive13/bar/00_0
> Structure for /user/hive/warehouse_hive13/bar/00_0
> 15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from 
> /user/hive/warehouse_hive13/bar/00_0 with {include: null, offset: 0, 
> length: 9223372036854775807}
> Rows: 0
> Compression: ZLIB
> Compression size: 262144
> Type: struct<>
> Stripe Statistics:
> File Statistics:
>   Column 0: count: 0
> Stripes:
> {noformat}
> Notice the {{struct<>}} part.
> This "feature" is OK for Hive, which has a central metastore to save table 
> schema.  But for users who read raw data files without Hive metastore with 
> Spark SQL 1.4.0, it causes problem because currently the ORC data source just 
> picks a random part-file whichever comes the first for schema discovery.
> Expected behavior can be:
> # Try all files one by one until we find a part-file with non-empty schema.
> # Throws {{AnalysisException}} if no such part-file can be found.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3382) GradientDescent convergence tolerance

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-3382.
--
   Resolution: Fixed
Fix Version/s: 1.5.0

Issue resolved by pull request 3636
[https://github.com/apache/spark/pull/3636]

> GradientDescent convergence tolerance
> -
>
> Key: SPARK-3382
> URL: https://issues.apache.org/jira/browse/SPARK-3382
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.5.0
>
>
> GradientDescent should support a convergence tolerance setting.  In general, 
> for optimization, convergence tolerance should be preferred over a limit on 
> the number of iterations since it is a somewhat data-adaptive or 
> data-specific convergence criterion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8746) Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)

2015-07-02 Thread Christian Kadner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612584#comment-14612584
 ] 

Christian Kadner commented on SPARK-8746:
-

Thank you Sean!

> Need to update download link for Hive 0.13.1 jars (HiveComparisonTest)
> --
>
> Key: SPARK-8746
> URL: https://issues.apache.org/jira/browse/SPARK-8746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0
>Reporter: Christian Kadner
>Assignee: Christian Kadner
>Priority: Trivial
>  Labels: documentation, test
> Fix For: 1.4.1, 1.5.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The Spark SQL documentation (https://github.com/apache/spark/tree/master/sql) 
> describes how to generate golden answer files for new hive comparison test 
> cases. However the download link for the Hive 0.13.1 jars points to 
> https://hive.apache.org/downloads.html but none of the linked mirror sites 
> still has the 0.13.1 version.
> We need to update the link to 
> https://archive.apache.org/dist/hive/hive-0.13.1/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8792) Add Python API for PCA transformer

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8792:
-
Target Version/s: 1.5.0

> Add Python API for PCA transformer
> --
>
> Key: SPARK-8792
> URL: https://issues.apache.org/jira/browse/SPARK-8792
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>
> Add Python API for PCA transformer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8780) Move Python doctest code example from models to algorithms

2015-07-02 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612566#comment-14612566
 ] 

Joseph K. Bradley commented on SPARK-8780:
--

Changing priority to minor for now.  This would be nice, but there is a lot of 
other stuff to do first.

> Move Python doctest code example from models to algorithms
> --
>
> Key: SPARK-8780
> URL: https://issues.apache.org/jira/browse/SPARK-8780
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> Almost all doctest code examples are in the models at Pyspark mllib.
> Since users usually start with algorithms rather than models, we need to move 
> them from models to algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8780) Move Python doctest code example from models to algorithms

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8780:
-
Priority: Minor  (was: Major)

> Move Python doctest code example from models to algorithms
> --
>
> Key: SPARK-8780
> URL: https://issues.apache.org/jira/browse/SPARK-8780
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>Priority: Minor
>
> Almost all doctest code examples are in the models at Pyspark mllib.
> Since users usually start with algorithms rather than models, we need to move 
> them from models to algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8780) Move Python doctest code example from models to algorithms

2015-07-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8780:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6173

> Move Python doctest code example from models to algorithms
> --
>
> Key: SPARK-8780
> URL: https://issues.apache.org/jira/browse/SPARK-8780
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.5.0
>Reporter: Yanbo Liang
>
> Almost all doctest code examples are in the models at Pyspark mllib.
> Since users usually start with algorithms rather than models, we need to move 
> them from models to algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8802) Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException

2015-07-02 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14612547#comment-14612547
 ] 

Apache Spark commented on SPARK-8802:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/7198

> Decimal.apply(BigDecimal).toBigDecimal may throw NumberFormatException
> --
>
> Key: SPARK-8802
> URL: https://issues.apache.org/jira/browse/SPARK-8802
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Minor
>
> There exist certain BigDecimals that can be converted into Spark SQL's 
> Decimal class but which produce Decimals that cannot be converted back to 
> BigDecimal without throwing NumberFormatException.
> For instance:
> {code}
> val x = BigDecimal(BigInt("18889465931478580854784"), -2147483648)
> assert(Decimal(x).toBigDecimal === x)
> {code}
> will fail with an exception:
> {code}
> java.lang.NumberFormatException
>   at java.math.BigDecimal.(BigDecimal.java:511)
>   at java.math.BigDecimal.(BigDecimal.java:757)
>   at scala.math.BigDecimal$.apply(BigDecimal.scala:119)
>   at scala.math.BigDecimal.apply(BigDecimal.scala:324)
>   at org.apache.spark.sql.types.Decimal.toBigDecimal(Decimal.scala:142)
>   at 
> org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply$mcV$sp(DecimalSuite.scala:62)
>   at 
> org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60)
>   at 
> org.apache.spark.sql.types.decimal.DecimalSuite$$anonfun$2.apply(DecimalSuite.scala:60)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >