[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-12 Thread Takuya Ueshin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162646#comment-16162646
 ] 

Takuya Ueshin commented on SPARK-21190:
---

[~icexelloss] Thank you for your suggestion. I agree that we should unify the 
prs to 1 pr and the collaboration.
As [~cloud_fan] mentioned in [~bryanc]'s pr, I'll send some prs toward 
[~bryanc]'s branch per functionality to discuss each.

> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more concerned about getting proper feedback for API design. The 
> implementation should be pretty straightforward and is not a huge concern at 
> this point. We can leverage the same implementation for faster toPandas 
> (using Arrow).
>  
>  
> *Optional Rejected Designs*
> See above.
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: i

[jira] [Created] (SPARK-21979) Improve QueryPlanConstraints framework

2017-09-12 Thread Gengliang Wang (JIRA)
Gengliang Wang created SPARK-21979:
--

 Summary: Improve QueryPlanConstraints framework
 Key: SPARK-21979
 URL: https://issues.apache.org/jira/browse/SPARK-21979
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Gengliang Wang
Priority: Critical


Improve QueryPlanConstraints framework, make it robust and simple.
In apache/spark#15319, constraints for expressions like a = f(b, c) is resolved.
However, for expressions like

a = f(b, c) && c = g(a, b)
The current QueryPlanConstraints framework will produce non-converging 
constraints.
Essentially, the problem is caused by having both the name and child of aliases 
in the same constraint set. We infer constraints, and push down constraints as 
predicates in filters, later on these predicates are propagated as constraints, 
etc..
Simply using the alias names only can resolve these problems. The size of 
constraints is reduced without losing any information. We can always get these 
inferred constraints on child of aliases when pushing down filters.

Also, the EqualNullSafe between name and child in propagating alias is 
meaningless

allConstraints += EqualNullSafe(e, a.toAttribute)
It just produce redundant constraints.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21979) Improve QueryPlanConstraints framework

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21979:


Assignee: (was: Apache Spark)

> Improve QueryPlanConstraints framework
> --
>
> Key: SPARK-21979
> URL: https://issues.apache.org/jira/browse/SPARK-21979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Critical
>
> Improve QueryPlanConstraints framework, make it robust and simple.
> In apache/spark#15319, constraints for expressions like a = f(b, c) is 
> resolved.
> However, for expressions like
> a = f(b, c) && c = g(a, b)
> The current QueryPlanConstraints framework will produce non-converging 
> constraints.
> Essentially, the problem is caused by having both the name and child of 
> aliases in the same constraint set. We infer constraints, and push down 
> constraints as predicates in filters, later on these predicates are 
> propagated as constraints, etc..
> Simply using the alias names only can resolve these problems. The size of 
> constraints is reduced without losing any information. We can always get 
> these inferred constraints on child of aliases when pushing down filters.
> Also, the EqualNullSafe between name and child in propagating alias is 
> meaningless
> allConstraints += EqualNullSafe(e, a.toAttribute)
> It just produce redundant constraints.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21979) Improve QueryPlanConstraints framework

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21979:


Assignee: Apache Spark

> Improve QueryPlanConstraints framework
> --
>
> Key: SPARK-21979
> URL: https://issues.apache.org/jira/browse/SPARK-21979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Critical
>
> Improve QueryPlanConstraints framework, make it robust and simple.
> In apache/spark#15319, constraints for expressions like a = f(b, c) is 
> resolved.
> However, for expressions like
> a = f(b, c) && c = g(a, b)
> The current QueryPlanConstraints framework will produce non-converging 
> constraints.
> Essentially, the problem is caused by having both the name and child of 
> aliases in the same constraint set. We infer constraints, and push down 
> constraints as predicates in filters, later on these predicates are 
> propagated as constraints, etc..
> Simply using the alias names only can resolve these problems. The size of 
> constraints is reduced without losing any information. We can always get 
> these inferred constraints on child of aliases when pushing down filters.
> Also, the EqualNullSafe between name and child in propagating alias is 
> meaningless
> allConstraints += EqualNullSafe(e, a.toAttribute)
> It just produce redundant constraints.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21979) Improve QueryPlanConstraints framework

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162708#comment-16162708
 ] 

Apache Spark commented on SPARK-21979:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/19201

> Improve QueryPlanConstraints framework
> --
>
> Key: SPARK-21979
> URL: https://issues.apache.org/jira/browse/SPARK-21979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Priority: Critical
>
> Improve QueryPlanConstraints framework, make it robust and simple.
> In apache/spark#15319, constraints for expressions like a = f(b, c) is 
> resolved.
> However, for expressions like
> a = f(b, c) && c = g(a, b)
> The current QueryPlanConstraints framework will produce non-converging 
> constraints.
> Essentially, the problem is caused by having both the name and child of 
> aliases in the same constraint set. We infer constraints, and push down 
> constraints as predicates in filters, later on these predicates are 
> propagated as constraints, etc..
> Simply using the alias names only can resolve these problems. The size of 
> constraints is reduced without losing any information. We can always get 
> these inferred constraints on child of aliases when pushing down filters.
> Also, the EqualNullSafe between name and child in propagating alias is 
> meaningless
> allConstraints += EqualNullSafe(e, a.toAttribute)
> It just produce redundant constraints.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21980) References in grouping functions should be indexed with resolver

2017-09-12 Thread Feng Zhu (JIRA)
Feng Zhu created SPARK-21980:


 Summary: References in grouping functions should be indexed with 
resolver
 Key: SPARK-21980
 URL: https://issues.apache.org/jira/browse/SPARK-21980
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0, 2.1.1, 2.1.0
Reporter: Feng Zhu


In our spark-2.1 cluster, when users sumbit queries like

{code:sql}
select a, grouping(b), sum(c) from table group by a, b with cube
{code}

It works well. However, when the query is 

{code:sql}
select a, grouping(B), sum(c) from table group by a, b with cube
{code}

We will get the exception:

{code:java}
org.apache.spark.sql.AnalysisException: Column of grouping (B#11) can't be 
found in grouping columns a#10,b#11
{code}

The root cause is the replaceGroupingFunc's  incorrect logic in 
ResolveGroupingAnalytics
 rule



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21980) References in grouping functions should be indexed with resolver

2017-09-12 Thread Feng Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Zhu updated SPARK-21980:
-
Description: 
In our spark-2.1 cluster, when users sumbit queries like

{code:sql}
select a, grouping(b), sum(c) from table group by a, b with cube
{code}

It works well. However, when the query is 

{code:sql}
select a, grouping(B), sum(c) from table group by a, b with cube
{code}

We will get the exception:

{code:java}
org.apache.spark.sql.AnalysisException: Column of grouping (B#11) can't be 
found in grouping columns a#10,b#11
{code}

The root cause is the replaceGroupingFunc's incorrect logic in 
ResolveGroupingAnalytics
 rule. It indexes the column without resolver.

  was:
In our spark-2.1 cluster, when users sumbit queries like

{code:sql}
select a, grouping(b), sum(c) from table group by a, b with cube
{code}

It works well. However, when the query is 

{code:sql}
select a, grouping(B), sum(c) from table group by a, b with cube
{code}

We will get the exception:

{code:java}
org.apache.spark.sql.AnalysisException: Column of grouping (B#11) can't be 
found in grouping columns a#10,b#11
{code}

The root cause is the replaceGroupingFunc's  incorrect logic in 
ResolveGroupingAnalytics
 rule


> References in grouping functions should be indexed with resolver
> 
>
> Key: SPARK-21980
> URL: https://issues.apache.org/jira/browse/SPARK-21980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>
> In our spark-2.1 cluster, when users sumbit queries like
> {code:sql}
> select a, grouping(b), sum(c) from table group by a, b with cube
> {code}
> It works well. However, when the query is 
> {code:sql}
> select a, grouping(B), sum(c) from table group by a, b with cube
> {code}
> We will get the exception:
> {code:java}
> org.apache.spark.sql.AnalysisException: Column of grouping (B#11) can't be 
> found in grouping columns a#10,b#11
> {code}
> The root cause is the replaceGroupingFunc's incorrect logic in 
> ResolveGroupingAnalytics
>  rule. It indexes the column without resolver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21976) Fix wrong doc about Mean Absolute Error

2017-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21976.
---
   Resolution: Fixed
Fix Version/s: 2.1.2
   2.3.0
   2.2.1

Issue resolved by pull request 19190
[https://github.com/apache/spark/pull/19190]

> Fix wrong doc about Mean Absolute Error
> ---
>
> Key: SPARK-21976
> URL: https://issues.apache.org/jira/browse/SPARK-21976
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 2.2.0
>Reporter: Favio Vázquez
>Priority: Trivial
> Fix For: 2.2.1, 2.3.0, 2.1.2
>
>
> Fix wrong doc for MAE in webpage. 
> Even though the code is correct for the MAE:
> {code}
> @Since("1.2.0")
>   def meanAbsoluteError: Double = {
> summary.normL1(1) / summary.count
>   }
> {code}
> In the documentation the division by n is missing. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21976) Fix wrong doc about Mean Absolute Error

2017-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-21976:
-

Assignee: Favio Vázquez
Priority: Minor  (was: Trivial)

> Fix wrong doc about Mean Absolute Error
> ---
>
> Key: SPARK-21976
> URL: https://issues.apache.org/jira/browse/SPARK-21976
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Affects Versions: 2.2.0
>Reporter: Favio Vázquez
>Assignee: Favio Vázquez
>Priority: Minor
> Fix For: 2.1.2, 2.2.1, 2.3.0
>
>
> Fix wrong doc for MAE in webpage. 
> Even though the code is correct for the MAE:
> {code}
> @Since("1.2.0")
>   def meanAbsoluteError: Double = {
> summary.normL1(1) / summary.count
>   }
> {code}
> In the documentation the division by n is missing. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21980) References in grouping functions should be indexed with resolver

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21980:


Assignee: Apache Spark

> References in grouping functions should be indexed with resolver
> 
>
> Key: SPARK-21980
> URL: https://issues.apache.org/jira/browse/SPARK-21980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>Assignee: Apache Spark
>
> In our spark-2.1 cluster, when users sumbit queries like
> {code:sql}
> select a, grouping(b), sum(c) from table group by a, b with cube
> {code}
> It works well. However, when the query is 
> {code:sql}
> select a, grouping(B), sum(c) from table group by a, b with cube
> {code}
> We will get the exception:
> {code:java}
> org.apache.spark.sql.AnalysisException: Column of grouping (B#11) can't be 
> found in grouping columns a#10,b#11
> {code}
> The root cause is the replaceGroupingFunc's incorrect logic in 
> ResolveGroupingAnalytics
>  rule. It indexes the column without resolver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21980) References in grouping functions should be indexed with resolver

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21980:


Assignee: (was: Apache Spark)

> References in grouping functions should be indexed with resolver
> 
>
> Key: SPARK-21980
> URL: https://issues.apache.org/jira/browse/SPARK-21980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>
> In our spark-2.1 cluster, when users sumbit queries like
> {code:sql}
> select a, grouping(b), sum(c) from table group by a, b with cube
> {code}
> It works well. However, when the query is 
> {code:sql}
> select a, grouping(B), sum(c) from table group by a, b with cube
> {code}
> We will get the exception:
> {code:java}
> org.apache.spark.sql.AnalysisException: Column of grouping (B#11) can't be 
> found in grouping columns a#10,b#11
> {code}
> The root cause is the replaceGroupingFunc's incorrect logic in 
> ResolveGroupingAnalytics
>  rule. It indexes the column without resolver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21980) References in grouping functions should be indexed with resolver

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162732#comment-16162732
 ] 

Apache Spark commented on SPARK-21980:
--

User 'DonnyZone' has created a pull request for this issue:
https://github.com/apache/spark/pull/19202

> References in grouping functions should be indexed with resolver
> 
>
> Key: SPARK-21980
> URL: https://issues.apache.org/jira/browse/SPARK-21980
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Feng Zhu
>
> In our spark-2.1 cluster, when users sumbit queries like
> {code:sql}
> select a, grouping(b), sum(c) from table group by a, b with cube
> {code}
> It works well. However, when the query is 
> {code:sql}
> select a, grouping(B), sum(c) from table group by a, b with cube
> {code}
> We will get the exception:
> {code:java}
> org.apache.spark.sql.AnalysisException: Column of grouping (B#11) can't be 
> found in grouping columns a#10,b#11
> {code}
> The root cause is the replaceGroupingFunc's incorrect logic in 
> ResolveGroupingAnalytics
>  rule. It indexes the column without resolver.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21978) schemaInference option not to convert strings with leading zeros to int/long

2017-09-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162749#comment-16162749
 ] 

Hyukjin Kwon commented on SPARK-21978:
--

Not sure. It sounds rather a niche use case. As a workaround, we could just 
disable {{inferSchema}} or manually change it after only getting the schema, 
manually changing it and setting it. For example:

{code}
schema = spark.read.csv("...", inferSchema=True).schema
# Update `schema`
spark.read.schema(schema).csv("...").show()
{code}

Do you maybe have a reference to support this idea, for example, in 
{{read.csv}} at R or other CSV parsing libraries?

> schemaInference option not to convert strings with leading zeros to int/long 
> -
>
> Key: SPARK-21978
> URL: https://issues.apache.org/jira/browse/SPARK-21978
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ruslan Dautkhanov
>  Labels: csv, csvparser, easy-fix, inference, ramp-up, schema
>
> It would be great to have an option in Spark's schema inference to *not* to 
> convert to int/long datatype a column that has leading zeros. Think zip 
> codes, for example.
> {code}
> df = (sqlc.read.format('csv')
>   .option('inferSchema', True)
>   .option('header', True)
>   .option('delimiter', '|')
>   .option('leadingZeros', 'KEEP')   # this is the new 
> proposed option
>   .option('mode', 'FAILFAST')
>   .load('csvfile_withzipcodes_to_ingest.csv')
> )
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14516) Clustering evaluator

2017-09-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-14516.
-
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-21981:
---

 Summary: Python API for ClusteringEvaluator
 Key: SPARK-21981
 URL: https://issues.apache.org/jira/browse/SPARK-21981
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Yanbo Liang


We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162759#comment-16162759
 ] 

Yanbo Liang commented on SPARK-21981:
-

[~mgaido] Would you like to work on this?

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162764#comment-16162764
 ] 

Marco Gaido commented on SPARK-21981:
-

[~yanboliang] yes, thanks. I will post a PR asap, thank you.

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21942.
---
Resolution: Won't Fix

It's a good useful discussion here. I think "won't fix" is the correct outcome 
here but it was useful to examine the situation.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162814#comment-16162814
 ] 

Apache Spark commented on SPARK-21981:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19204

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21981:


Assignee: (was: Apache Spark)

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21981:


Assignee: Apache Spark

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21982) Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US

2017-09-12 Thread German Schiavon Matteo (JIRA)
German Schiavon Matteo created SPARK-21982:
--

 Summary: Set Locale to US in order to pass UtilsSuite when your 
jvm Locale is not US
 Key: SPARK-21982
 URL: https://issues.apache.org/jira/browse/SPARK-21982
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 2.2.0
Reporter: German Schiavon Matteo
Priority: Minor


In UtilsSuite there is a case that Locale is not set by default which make this 
test fail when your jvm has different Locale than US



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21982) Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US

2017-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21982:
--
Target Version/s:   (was: 2.2.0)
  Labels:   (was: test)

It'd be great to fix up any call to {{.format(...)}} where the output will 
matter to tests or APIs. There are unfortunately thousands of calls like this, 
so we can't examine and change them all. If no other tests fail on your non-US 
locale, that's a pretty good indication this is the only issue.

> Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US
> ---
>
> Key: SPARK-21982
> URL: https://issues.apache.org/jira/browse/SPARK-21982
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> In UtilsSuite there is a case that Locale is not set by default which make 
> this test fail when your jvm has different Locale than US



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21982) Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162840#comment-16162840
 ] 

Apache Spark commented on SPARK-21982:
--

User 'Gschiavon' has created a pull request for this issue:
https://github.com/apache/spark/pull/19205

> Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US
> ---
>
> Key: SPARK-21982
> URL: https://issues.apache.org/jira/browse/SPARK-21982
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> In UtilsSuite there is a case that Locale is not set by default which make 
> this test fail when your jvm has different Locale than US



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21982) Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21982:


Assignee: (was: Apache Spark)

> Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US
> ---
>
> Key: SPARK-21982
> URL: https://issues.apache.org/jira/browse/SPARK-21982
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Priority: Minor
>
> In UtilsSuite there is a case that Locale is not set by default which make 
> this test fail when your jvm has different Locale than US



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21982) Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21982:


Assignee: Apache Spark

> Set Locale to US in order to pass UtilsSuite when your jvm Locale is not US
> ---
>
> Key: SPARK-21982
> URL: https://issues.apache.org/jira/browse/SPARK-21982
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: German Schiavon Matteo
>Assignee: Apache Spark
>Priority: Minor
>
> In UtilsSuite there is a case that Locale is not set by default which make 
> this test fail when your jvm has different Locale than US



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21983) Fix ANTLR 4.7 deprecations

2017-09-12 Thread Herman van Hovell (JIRA)
Herman van Hovell created SPARK-21983:
-

 Summary: Fix ANTLR 4.7 deprecations
 Key: SPARK-21983
 URL: https://issues.apache.org/jira/browse/SPARK-21983
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Herman van Hovell
Priority: Minor


We are currently using a couple of deprecated ANTLR API's:
# BufferedTokenStream.reset
# ANTLRNoCaseStringStream
# ParserRuleContext.addChild

We should fix these, and replace them by non deprecated alternatives.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2017-09-12 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17997:
-
Issue Type: New Feature  (was: Sub-task)
Parent: (was: SPARK-16026)

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2017-09-12 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17997:
-
Affects Version/s: (was: 2.1.0)
   2.3.0

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17997) Aggregation function for counting distinct values for multiple intervals

2017-09-12 Thread Zhenhua Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-17997:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-21975

> Aggregation function for counting distinct values for multiple intervals
> 
>
> Key: SPARK-17997
> URL: https://issues.apache.org/jira/browse/SPARK-17997
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Zhenhua Wang
>
> This is for computing ndv's for bins in equi-height histograms. A bin 
> consists of two endpoints which form an interval of values and the ndv in 
> that interval. For computing histogram statistics, after getting the 
> endpoints, we need an agg function to count distinct values in each interval.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21984) Use histogram stats in join estimation

2017-09-12 Thread Zhenhua Wang (JIRA)
Zhenhua Wang created SPARK-21984:


 Summary: Use histogram stats in join estimation
 Key: SPARK-21984
 URL: https://issues.apache.org/jira/browse/SPARK-21984
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.0
Reporter: Zhenhua Wang






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file

2017-09-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21610.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19199
[https://github.com/apache/spark/pull/19199]

> Corrupt records are not handled properly when creating a dataframe from a file
> --
>
> Key: SPARK-21610
> URL: https://issues.apache.org/jira/browse/SPARK-21610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: macOs Sierra 10.12.5
>Reporter: dmtran
> Fix For: 2.3.0
>
>
> Consider a jsonl file with 3 records. The third record has a value of type 
> string, instead of int.
> {code}
> echo '{"field": 1}
> {"field": 2}
> {"field": "3"}' >/tmp/sample.json
> {code}
> Create a dataframe from this file, with a schema that contains 
> "_corrupt_record" so that corrupt records are kept.
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType()
>   .add("field", ByteType)
>   .add("_corrupt_record", StringType)
> val file = "/tmp/sample.json"
> val dfFromFile = spark.read.schema(schema).json(file)
> {code}
> Run the following lines from a spark-shell:
> {code}
> scala> dfFromFile.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
> res1: Long = 0
> scala> dfFromFile.filter($"_corrupt_record".isNull).count()
> res2: Long = 3
> {code}
> The expected result is 1 corrupt record and 2 valid records, but the actual 
> one is 0 corrupt record and 3 valid records.
> The bug is not reproduced if we create a dataframe from a RDD:
> {code}
> scala> val rdd = sc.textFile(file)
> rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] 
> at textFile at :28
> scala> val dfFromRdd = spark.read.schema(schema).json(rdd)
> dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: 
> string]
> scala> dfFromRdd.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count()
> res5: Long = 1
> scala> dfFromRdd.filter($"_corrupt_record".isNull).count()
> res6: Long = 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21610) Corrupt records are not handled properly when creating a dataframe from a file

2017-09-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21610:


Assignee: Jen-Ming Chung

> Corrupt records are not handled properly when creating a dataframe from a file
> --
>
> Key: SPARK-21610
> URL: https://issues.apache.org/jira/browse/SPARK-21610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.2.0
> Environment: macOs Sierra 10.12.5
>Reporter: dmtran
>Assignee: Jen-Ming Chung
> Fix For: 2.3.0
>
>
> Consider a jsonl file with 3 records. The third record has a value of type 
> string, instead of int.
> {code}
> echo '{"field": 1}
> {"field": 2}
> {"field": "3"}' >/tmp/sample.json
> {code}
> Create a dataframe from this file, with a schema that contains 
> "_corrupt_record" so that corrupt records are kept.
> {code}
> import org.apache.spark.sql.types._
> val schema = new StructType()
>   .add("field", ByteType)
>   .add("_corrupt_record", StringType)
> val file = "/tmp/sample.json"
> val dfFromFile = spark.read.schema(schema).json(file)
> {code}
> Run the following lines from a spark-shell:
> {code}
> scala> dfFromFile.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromFile.filter($"_corrupt_record".isNotNull).count()
> res1: Long = 0
> scala> dfFromFile.filter($"_corrupt_record".isNull).count()
> res2: Long = 3
> {code}
> The expected result is 1 corrupt record and 2 valid records, but the actual 
> one is 0 corrupt record and 3 valid records.
> The bug is not reproduced if we create a dataframe from a RDD:
> {code}
> scala> val rdd = sc.textFile(file)
> rdd: org.apache.spark.rdd.RDD[String] = /tmp/sample.json MapPartitionsRDD[92] 
> at textFile at :28
> scala> val dfFromRdd = spark.read.schema(schema).json(rdd)
> dfFromRdd: org.apache.spark.sql.DataFrame = [field: tinyint, _corrupt_record: 
> string]
> scala> dfFromRdd.show(false)
> +-+---+
> |field|_corrupt_record|
> +-+---+
> |1|null   |
> |2|null   |
> |null |{"field": "3"} |
> +-+---+
> scala> dfFromRdd.filter($"_corrupt_record".isNotNull).count()
> res5: Long = 1
> scala> dfFromRdd.filter($"_corrupt_record".isNull).count()
> res6: Long = 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21809) Change Stage Page to use datatables to support sorting columns and searching

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21809:


Assignee: Apache Spark

> Change Stage Page to use datatables to support sorting columns and searching
> 
>
> Key: SPARK-21809
> URL: https://issues.apache.org/jira/browse/SPARK-21809
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Nuochen Lyu
>Assignee: Apache Spark
>Priority: Minor
>
> Support column sort and search for Stage Server using jQuery DataTable and 
> REST API. Before this commit, the Stage page was generated hard-coded HTML 
> and can not support search, also, the sorting was disabled if there is any 
> application that has more than one attempt. Supporting search and sort (over 
> all applications rather than the 20 entries in the current page) in any case 
> will greatly improve the user experience.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21809) Change Stage Page to use datatables to support sorting columns and searching

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21809:


Assignee: (was: Apache Spark)

> Change Stage Page to use datatables to support sorting columns and searching
> 
>
> Key: SPARK-21809
> URL: https://issues.apache.org/jira/browse/SPARK-21809
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Nuochen Lyu
>Priority: Minor
>
> Support column sort and search for Stage Server using jQuery DataTable and 
> REST API. Before this commit, the Stage page was generated hard-coded HTML 
> and can not support search, also, the sorting was disabled if there is any 
> application that has more than one attempt. Supporting search and sort (over 
> all applications rather than the 20 entries in the current page) in any case 
> will greatly improve the user experience.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21809) Change Stage Page to use datatables to support sorting columns and searching

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163044#comment-16163044
 ] 

Apache Spark commented on SPARK-21809:
--

User 'pgandhi999' has created a pull request for this issue:
https://github.com/apache/spark/pull/19207

> Change Stage Page to use datatables to support sorting columns and searching
> 
>
> Key: SPARK-21809
> URL: https://issues.apache.org/jira/browse/SPARK-21809
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Nuochen Lyu
>Priority: Minor
>
> Support column sort and search for Stage Server using jQuery DataTable and 
> REST API. Before this commit, the Stage page was generated hard-coded HTML 
> and can not support search, also, the sorting was disabled if there is any 
> application that has more than one attempt. Supporting search and sort (over 
> all applications rather than the 20 entries in the current page) in any case 
> will greatly improve the user experience.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-09-12 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163106#comment-16163106
 ] 

Seth Hendrickson commented on SPARK-19634:
--

Is there a plan for moving the linear algorithms that use the summarizer to 
this new implementation? 

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.3.0
>
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21087) CrossValidator, TrainValidationSplit should preserve all models after fitting: Scala

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163136#comment-16163136
 ] 

Apache Spark commented on SPARK-21087:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19208

> CrossValidator, TrainValidationSplit should preserve all models after 
> fitting: Scala
> 
>
> Key: SPARK-21087
> URL: https://issues.apache.org/jira/browse/SPARK-21087
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-09-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17642.
-
   Resolution: Fixed
 Assignee: Zhenhua Wang
Fix Version/s: 2.3.0

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command to show column 
> information.
> Show column statistics if EXTENDED or FORMATTED is specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:

{{
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
}}

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:

{{>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]}}

But in Spark >=2.1.0, it fails:

{{>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {{
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> }}
> But in Spark >=2.1.0, it fails:
> {{
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> }}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:
> {{
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> }}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)
Stuart created SPARK-21985:
--

 Summary: PySpark PairDeserializer is broken for double-zipped RDDs
 Key: SPARK-21985
 URL: https://issues.apache.org/jira/browse/SPARK-21985
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0, 2.1.1, 2.1.0
Reporter: Stuart


PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:

{{>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]}}

But in Spark >=2.1.0, it fails:

{{>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:python}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:

```
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
```

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:python}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails:
> {{
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> }}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:
> {{
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> }}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:python}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails:
> {{
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> }}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:
> {{
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> }}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:

```
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
```

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:

{{
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
}}

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> ```
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> ```
> But in Spark >=2.1.0, it fails:
> {{
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> }}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:
> {{
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> }}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{{
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
}}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{{
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
}}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails:
> {code:}
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> {code}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:
> {code:}
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> {code}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails:
> {code:}
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> {code}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:
> {code:}
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> {code}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apac

[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails, (regardless of Python 2 vs 3):

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails:

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails, (regardless of Python 2 vs 3):
> {code:}
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> {code}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:
> {code:}
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> {code}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---

[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails, (regardless of Python 2 vs 3):

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):
> {code:}
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
> line 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File 
> "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py",
>  line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> {code}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:
> {code:}
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> {code}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (SPARK-21985) PySpark PairDeserializer is broken for double-zipped RDDs

2017-09-12 Thread Stuart (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stuart updated SPARK-21985:
---
Description: 
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", line 810, 
in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?

  was:
PySpark fails to deserialize double-zipped RDDs.  For example, the following 
example used to work in Spark 2.0.2:


{code:}
>>> a = sc.parallelize('aaa')
>>> b = sc.parallelize('bbb')
>>> c = sc.parallelize('ccc')
>>> a_bc = a.zip( b.zip(c) )
>>> a_bc.collect()
[('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
{code}

But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):

{code:}
>>> a_bc.collect()
Traceback (most recent call last):
  File "", line 1, in 
  File "/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", 
line 810, in collect
return list(_load_from_socket(port, self._jrdd_deserializer))
  File 
"/magnetic/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
line 329, in _load_stream_without_unbatching
if len(key_batch) != len(val_batch):
TypeError: object of type 'itertools.izip' has no len()
{code}

As you can see, the error seems to be caused by [a check in the 
PairDeserializer 
class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:

{code:}
if len(key_batch) != len(val_batch):
raise ValueError("Can not deserialize PairRDD with different number of 
items"
 " in batches: (%d, %d)" % (len(key_batch), len(val_batch)))
{code}

If that check is removed, then the example above works without error.  Can the 
check simply be removed?


> PySpark PairDeserializer is broken for double-zipped RDDs
> -
>
> Key: SPARK-21985
> URL: https://issues.apache.org/jira/browse/SPARK-21985
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Stuart
>  Labels: bug
>
> PySpark fails to deserialize double-zipped RDDs.  For example, the following 
> example used to work in Spark 2.0.2:
> {code:}
> >>> a = sc.parallelize('aaa')
> >>> b = sc.parallelize('bbb')
> >>> c = sc.parallelize('ccc')
> >>> a_bc = a.zip( b.zip(c) )
> >>> a_bc.collect()
> [('a', ('b', 'c')), ('a', ('b', 'c')), ('a', ('b', 'c'))]
> {code}
> But in Spark >=2.1.0, it fails (regardless of Python 2 vs 3):
> {code:}
> >>> a_bc.collect()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/rdd.py", line 
> 810, in collect
> return list(_load_from_socket(port, self._jrdd_deserializer))
>   File "/workspace/spark-2.2.0-bin-hadoop2.7/python/pyspark/serializers.py", 
> line 329, in _load_stream_without_unbatching
> if len(key_batch) != len(val_batch):
> TypeError: object of type 'itertools.izip' has no len()
> {code}
> As you can see, the error seems to be caused by [a check in the 
> PairDeserializer 
> class|https://github.com/apache/spark/blob/d03aebbe6508ba441dc87f9546f27aeb27553d77/python/pyspark/serializers.py#L346-L348]:
> {code:}
> if len(key_batch) != len(val_batch):
> raise ValueError("Can not deserialize PairRDD with different number of 
> items"
>  " in batches: (%d, %d)" % (len(key_batch), 
> len(val_batch)))
> {code}
> If that check is removed, then the example above works without error.  Can 
> the check simply be removed?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---

[jira] [Resolved] (SPARK-21027) Parallel One vs. Rest Classifier

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21027.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 19110
[https://github.com/apache/spark/pull/19110]

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21027) Parallel One vs. Rest Classifier

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21027:
-

Assignee: Ajay Saini

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-18608:
-

Assignee: zhengruifeng

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-18608:
--
Target Version/s: 2.2.1, 2.3.0

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18608) Spark ML algorithms that check RDD cache level for internal caching double-cache data

2017-09-12 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-18608.
---
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

Issue resolved by pull request 19197
[https://github.com/apache/spark/pull/19197]

> Spark ML algorithms that check RDD cache level for internal caching 
> double-cache data
> -
>
> Key: SPARK-18608
> URL: https://issues.apache.org/jira/browse/SPARK-18608
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Nick Pentreath
>Assignee: zhengruifeng
> Fix For: 2.2.1, 2.3.0
>
>
> Some algorithms in Spark ML (e.g. {{LogisticRegression}}, 
> {{LinearRegression}}, and I believe now {{KMeans}}) handle persistence 
> internally. They check whether the input dataset is cached, and if not they 
> cache it for performance.
> However, the check is done using {{dataset.rdd.getStorageLevel == NONE}}. 
> This will actually always be true, since even if the dataset itself is 
> cached, the RDD returned by {{dataset.rdd}} will not be cached.
> Hence if the input dataset is cached, the data will end up being cached 
> twice, which is wasteful.
> To see this:
> {code}
> scala> import org.apache.spark.storage.StorageLevel
> import org.apache.spark.storage.StorageLevel
> scala> val df = spark.range(10).toDF("num")
> df: org.apache.spark.sql.DataFrame = [num: bigint]
> scala> df.storageLevel == StorageLevel.NONE
> res0: Boolean = true
> scala> df.persist
> res1: df.type = [num: bigint]
> scala> df.storageLevel == StorageLevel.MEMORY_AND_DISK
> res2: Boolean = true
> scala> df.rdd.getStorageLevel == StorageLevel.MEMORY_AND_DISK
> res3: Boolean = false
> scala> df.rdd.getStorageLevel == StorageLevel.NONE
> res4: Boolean = true
> {code}
> Before SPARK-16063, there was no way to check the storage level of the input 
> {{DataSet}}, but now we can, so the checks should be migrated to use 
> {{dataset.storageLevel}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2017-09-12 Thread Barry Becker (JIRA)
Barry Becker created SPARK-21986:


 Summary: QuantileDiscretizer picks wrong split point for data with 
lots of 0's
 Key: SPARK-21986
 URL: https://issues.apache.org/jira/browse/SPARK-21986
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.1.1
Reporter: Barry Becker


I have some simple test cases to help illustrate (see below).
I discovered this with data that had 96,000 rows, but can reproduce with much 
smaller data that has roughly the same distribution of values.

If I have data like
  Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)

and ask for 3 buckets, then it does the right thing and yields splits of 
Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)

However, if I add just one more zero, such that I have data like
 Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
then it will do the wrong thing and give splits of 
  Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))

I'm not bothered that it gave fewer buckets than asked for (that is to be 
expected), but I am bothered that it picked 0.0 instead of 40 as the one split 
point.
The way it did it, now I have 1 bucket with all the data, and a second with 
none of the data.
Am I interpreting something wrong?
Here are my 2 test cases in scala:
{code}
class QuantileDiscretizerSuite extends FunSuite {

  test("Quantile discretizer on data with lots of 0") {
verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
  Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
  }

  test("Quantile discretizer on data with one less 0") {
verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
  Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
  }
  
  def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
val theData: Seq[(Int, Double)] = data.map {
  case x: Int => (x, 0.0)
  case _ => (0, 0.0)
}

val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", 
"unused")

val qb = new QuantileDiscretizer()
  .setInputCol("rawCol")
  .setOutputCol("binnedColumn")
  .setRelativeError(0.0)
  .setNumBuckets(3)
  .fit(df)

assertResult(expectedSplits) {qb.getSplits}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21979) Improve QueryPlanConstraints framework

2017-09-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21979.
-
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 2.3.0

> Improve QueryPlanConstraints framework
> --
>
> Key: SPARK-21979
> URL: https://issues.apache.org/jira/browse/SPARK-21979
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Critical
> Fix For: 2.3.0
>
>
> Improve QueryPlanConstraints framework, make it robust and simple.
> In apache/spark#15319, constraints for expressions like a = f(b, c) is 
> resolved.
> However, for expressions like
> a = f(b, c) && c = g(a, b)
> The current QueryPlanConstraints framework will produce non-converging 
> constraints.
> Essentially, the problem is caused by having both the name and child of 
> aliases in the same constraint set. We infer constraints, and push down 
> constraints as predicates in filters, later on these predicates are 
> propagated as constraints, etc..
> Simply using the alias names only can resolve these problems. The size of 
> constraints is reduced without losing any information. We can always get 
> these inferred constraints on child of aliases when pushing down filters.
> Also, the EqualNullSafe between name and child in propagating alias is 
> meaningless
> allConstraints += EqualNullSafe(e, a.toAttribute)
> It just produce redundant constraints.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18267) Distribute PySpark via Python Package Index (pypi)

2017-09-12 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-18267.
-
   Resolution: Fixed
 Assignee: holdenk
Fix Version/s: 2.2.0

> Distribute PySpark via Python Package Index (pypi)
> --
>
> Key: SPARK-18267
> URL: https://issues.apache.org/jira/browse/SPARK-18267
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Project Infra, PySpark
>Reporter: Reynold Xin
>Assignee: holdenk
> Fix For: 2.2.0
>
>
> The goal is to distribute PySpark via pypi, so users can simply run Spark on 
> a single node via "pip install pyspark" (or "pip install apache-spark").



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-18128) Add support for publishing to PyPI

2017-09-12 Thread holdenk (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-18128.
-
   Resolution: Fixed
 Assignee: holdenk
Fix Version/s: 2.2.0

We got the package name registered on PyPi :)

> Add support for publishing to PyPI
> --
>
> Key: SPARK-18128
> URL: https://issues.apache.org/jira/browse/SPARK-18128
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: holdenk
>Assignee: holdenk
> Fix For: 2.2.0
>
>
> After SPARK-1267 is done we should add support for publishing to PyPI similar 
> to how we publish to maven central.
> Note: one of the open questions is what to do about package name since 
> someone has registered the package name PySpark on PyPI - we could use 
> ApachePySpark or we could try and get find who registered PySpark and get 
> them to transfer it to us (since they haven't published anything so maybe 
> fine?)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs

2017-09-12 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-21987:
--

 Summary: Spark 2.3 cannot read 2.2 event logs
 Key: SPARK-21987
 URL: https://issues.apache.org/jira/browse/SPARK-21987
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Blocker


Reported by [~jincheng] in a comment in SPARK-18085:

{noformat}
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized 
field "metadata" (class org.apache.spark.sql.execution.SparkPlanInfo), not 
marked as ignorable (4 known properties: "simpleString", "nodeName", 
"children", "metrics"])
 at [Source: 
{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
 at 
NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
 
Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
 Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
array\nRepartition 200, true\n+- LogicalRDD [uid#327L, gids#328]\n\n== 
Optimized Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
gids#328]\n\n== Physical Plan ==\nExchange RoundRobinPartitioning(200)\n+- Scan 
ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
 
RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
 
ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
 of output 
rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
 size total (min, med, 
max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
1, column: 1622] (through reference chain: 
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
at 
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
{noformat}

This was caused by SPARK-17701 (which at this moment is still open even though 
the patch has been committed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163585#comment-16163585
 ] 

Marcelo Vanzin commented on SPARK-18085:


I filed SPARK-21987 for the event log issue (lest we forget to do it).

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17701) Refactor DataSourceScanExec so its sameResult call does not compare strings

2017-09-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17701.
-
   Resolution: Fixed
 Assignee: Wenchen Fan
Fix Version/s: 2.3.0

> Refactor DataSourceScanExec so its sameResult call does not compare strings
> ---
>
> Key: SPARK-17701
> URL: https://issues.apache.org/jira/browse/SPARK-17701
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Eric Liang
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>
> Currently, RowDataSourceScanExec and FileSourceScanExec rely on a "metadata" 
> string map to implement equality comparison, since the RDDs they depend on 
> cannot be directly compared. This has resulted in a number of correctness 
> bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818.
> To make these comparisons less brittle, we should refactor these classes to 
> compare constructor parameters directly instead of relying on the metadata 
> map.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs

2017-09-12 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163622#comment-16163622
 ] 

Xiao Li commented on SPARK-21987:
-

Thanks for reporting this! We need to ensure Spark 2.3 still can process 2.2 
event logs and revert the changes in SparkPlanGraph

> Spark 2.3 cannot read 2.2 event logs
> 
>
> Key: SPARK-21987
> URL: https://issues.apache.org/jira/browse/SPARK-21987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> Reported by [~jincheng] in a comment in SPARK-18085:
> {noformat}
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
> Unrecognized field "metadata" (class 
> org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 
> known properties: "simpleString", "nodeName", "children", "metrics"])
>  at [Source: 
> {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
>  at 
> NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
>  
> Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
>  Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
> array\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- 
> LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange 
> RoundRobinPartitioning(200)\n+- Scan 
> ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
>  
> RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
>  
> ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
>  of output 
> rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
>  size total (min, med, 
> max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
> 1, column: 1622] (through reference chain: 
> org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
>   at 
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
> {noformat}
> This was caused by SPARK-17701 (which at this moment is still open even 
> though the patch has been committed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2017-09-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21986:
--
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

It's an approximate algorithm, and this is a tiny amount of data. I think it's 
at best a potential improvement, if it's doing slightly the wrong thing in a 
corner case.

However, is this wrong? you're asking for the 33%/66%-tiles. In both cases, at 
least 66% of the values are <= 0. I suppose it finds 40 in the first case as 
it's a bit approximate, but in the second case, it's far off.

> QuantileDiscretizer picks wrong split point for data with lots of 0's
> -
>
> Key: SPARK-21986
> URL: https://issues.apache.org/jira/browse/SPARK-21986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Barry Becker
>Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much 
> smaller data that has roughly the same distribution of values.
> If I have data like
>   Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of 
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
>  Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of 
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be 
> expected), but I am bothered that it picked 0.0 instead of 40 as the one 
> split point.
> The way it did it, now I have 1 bucket with all the data, and a second with 
> none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
>   test("Quantile discretizer on data with lots of 0") {
> verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
>   }
>   test("Quantile discretizer on data with one less 0") {
> verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
>   }
>   
>   def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
> val theData: Seq[(Int, Double)] = data.map {
>   case x: Int => (x, 0.0)
>   case _ => (0, 0.0)
> }
> val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", 
> "unused")
> val qb = new QuantileDiscretizer()
>   .setInputCol("rawCol")
>   .setOutputCol("binnedColumn")
>   .setRelativeError(0.0)
>   .setNumBuckets(3)
>   .fit(df)
> assertResult(expectedSplits) {qb.getSplits}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21988) Add default stats to StreamingExecutionRelation

2017-09-12 Thread Jose Torres (JIRA)
Jose Torres created SPARK-21988:
---

 Summary: Add default stats to StreamingExecutionRelation
 Key: SPARK-21988
 URL: https://issues.apache.org/jira/browse/SPARK-21988
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: Jose Torres


StreamingExecutionRelation currently doesn't implement stats.

This makes some sense, but unfortunately the LeafNode contract requires that 
nodes which survive analysis implement stats, and StreamingExecutionRelation 
can indeed survive analysis when running explain() on a streaming dataframe.

This value won't ever be used during execution, because 
StreamingExecutionRelation does *not* survive analysis on the execution path; 
it's replaced with each batch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163695#comment-16163695
 ] 

Apache Spark commented on SPARK-18838:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/19211

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21867) Support async spilling in UnsafeShuffleWriter

2017-09-12 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163705#comment-16163705
 ] 

Sital Kedia commented on SPARK-21867:
-

[~rxin] - You are right, it is very tricky to get it right. Working on the PR 
now, will have it in few days. 

> Support async spilling in UnsafeShuffleWriter
> -
>
> Key: SPARK-21867
> URL: https://issues.apache.org/jira/browse/SPARK-21867
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Sital Kedia
>Priority: Minor
>
> Currently, Spark tasks are single-threaded. But we see it could greatly 
> improve the performance of the jobs, if we can multi-thread some part of it. 
> For example, profiling our map tasks, which reads large amount of data from 
> HDFS and spill to disks, we see that we are blocked on HDFS read and spilling 
> majority of the time. Since both these operations are IO intensive the 
> average CPU consumption during map phase is significantly low. In theory, 
> both HDFS read and spilling can be done in parallel if we had additional 
> memory to store data read from HDFS while we are spilling the last batch read.
> Let's say we have 1G of shuffle memory available per task. Currently, in case 
> of map task, it reads from HDFS and the records are stored in the available 
> memory buffer. Once we hit the memory limit and there is no more space to 
> store the records, we sort and spill the content to disk. While we are 
> spilling to disk, since we do not have any available memory, we can not read 
> from HDFS concurrently. 
> Here we propose supporting async spilling for UnsafeShuffleWriter, so that we 
> can support reading from HDFS when sort and spill is happening 
> asynchronously.  Let's say the total 1G of shuffle memory can be split into 
> two regions - active region and spilling region - each of size 500 MB. We 
> start with reading from HDFS and filling the active region. Once we hit the 
> limit of active region, we issue an asynchronous spill, while fliping the 
> active region and spilling region. While the spil is happening 
> asynchronosuly, we still have 500 MB of memory available to read the data 
> from HDFS. This way we can amortize the high disk/network io cost during 
> spilling.
> We made a prototype hack to implement this feature and we could see our map 
> tasks were as much as 40% faster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21988) Add default stats to StreamingExecutionRelation

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21988:


Assignee: (was: Apache Spark)

> Add default stats to StreamingExecutionRelation
> ---
>
> Key: SPARK-21988
> URL: https://issues.apache.org/jira/browse/SPARK-21988
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>
> StreamingExecutionRelation currently doesn't implement stats.
> This makes some sense, but unfortunately the LeafNode contract requires that 
> nodes which survive analysis implement stats, and StreamingExecutionRelation 
> can indeed survive analysis when running explain() on a streaming dataframe.
> This value won't ever be used during execution, because 
> StreamingExecutionRelation does *not* survive analysis on the execution path; 
> it's replaced with each batch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21988) Add default stats to StreamingExecutionRelation

2017-09-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21988:


Assignee: Apache Spark

> Add default stats to StreamingExecutionRelation
> ---
>
> Key: SPARK-21988
> URL: https://issues.apache.org/jira/browse/SPARK-21988
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Apache Spark
>
> StreamingExecutionRelation currently doesn't implement stats.
> This makes some sense, but unfortunately the LeafNode contract requires that 
> nodes which survive analysis implement stats, and StreamingExecutionRelation 
> can indeed survive analysis when running explain() on a streaming dataframe.
> This value won't ever be used during execution, because 
> StreamingExecutionRelation does *not* survive analysis on the execution path; 
> it's replaced with each batch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21988) Add default stats to StreamingExecutionRelation

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163714#comment-16163714
 ] 

Apache Spark commented on SPARK-21988:
--

User 'joseph-torres' has created a pull request for this issue:
https://github.com/apache/spark/pull/19212

> Add default stats to StreamingExecutionRelation
> ---
>
> Key: SPARK-21988
> URL: https://issues.apache.org/jira/browse/SPARK-21988
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>
> StreamingExecutionRelation currently doesn't implement stats.
> This makes some sense, but unfortunately the LeafNode contract requires that 
> nodes which survive analysis implement stats, and StreamingExecutionRelation 
> can indeed survive analysis when running explain() on a streaming dataframe.
> This value won't ever be used during execution, because 
> StreamingExecutionRelation does *not* survive analysis on the execution path; 
> it's replaced with each batch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15705) Spark won't read ORC schema from metastore for partitioned tables

2017-09-12 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163747#comment-16163747
 ] 

Dongjoon Hyun commented on SPARK-15705:
---

Hi, All.

I'm tracking this bug. This seems to be fixed since 2.1.1.

{code}
scala> spark.table("default.test").printSchema
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

scala> sql("set spark.sql.hive.convertMetastoreOrc=true")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.table("default.test").printSchema
root
 |-- _col0: long (nullable = true)
 |-- _col1: string (nullable = true)
 |-- state: string (nullable = true)

scala> sc.version
res3: String = 2.0.2
{code}

{code}
scala> spark.table("default.test").printSchema
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

scala> sql("set spark.sql.hive.convertMetastoreOrc=true")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.table("default.test").printSchema
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

scala> sc.version
res3: String = 2.1.1
{code}

{code}
scala> spark.table("default.test").printSchema
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)


scala> sql("set spark.sql.hive.convertMetastoreOrc=true")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.table("default.test").printSchema
root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- state: string (nullable = true)

scala> sc.version
res3: String = 2.2.0
{code}

> Spark won't read ORC schema from metastore for partitioned tables
> -
>
> Key: SPARK-15705
> URL: https://issues.apache.org/jira/browse/SPARK-15705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
> Environment: HDP 2.3.4 (Hive 1.2.1, Hadoop 2.7.1)
>Reporter: Nic Eggert
>Assignee: Yin Huai
>Priority: Critical
> Fix For: 2.0.0
>
>
> Spark does not seem to read the schema from the Hive metastore for 
> partitioned tables stored as ORC files. It appears to read the schema from 
> the files themselves, which, if they were created with Hive, does not match 
> the metastore schema (at least not before before Hive 2.0, see HIVE-4243). To 
> reproduce:
> In Hive:
> {code}
> hive> create table default.test (id BIGINT, name STRING) partitioned by 
> (state STRING) stored as orc;
> hive> insert into table default.test partition (state="CA") values (1, 
> "mike"), (2, "steve"), (3, "bill");
> {code}
> In Spark
> {code}
> scala> spark.table("default.test").printSchema
> {code}
> Expected result: Spark should preserve the column names that were defined in 
> Hive.
> Actual Result:
> {code}
> root
>  |-- _col0: long (nullable = true)
>  |-- _col1: string (nullable = true)
>  |-- state: string (nullable = true)
> {code}
> Possibly related to SPARK-14959?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2017-09-12 Thread Barry Becker (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163768#comment-16163768
 ] 

Barry Becker commented on SPARK-21986:
--

But wait, the dataset I discovered the problem with was not a tiny amount of 
data (it was 96,000 values).
Also, I did not think the result was supposed to be approximate if the relative 
error was set to 0 (as I have done).
It seems way off in the first case. The first bucket contains 0 percent of the 
data. 


> QuantileDiscretizer picks wrong split point for data with lots of 0's
> -
>
> Key: SPARK-21986
> URL: https://issues.apache.org/jira/browse/SPARK-21986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Barry Becker
>Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much 
> smaller data that has roughly the same distribution of values.
> If I have data like
>   Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of 
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
>  Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of 
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be 
> expected), but I am bothered that it picked 0.0 instead of 40 as the one 
> split point.
> The way it did it, now I have 1 bucket with all the data, and a second with 
> none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
>   test("Quantile discretizer on data with lots of 0") {
> verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
>   }
>   test("Quantile discretizer on data with one less 0") {
> verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
>   }
>   
>   def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
> val theData: Seq[(Int, Double)] = data.map {
>   case x: Int => (x, 0.0)
>   case _ => (0, 0.0)
> }
> val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", 
> "unused")
> val qb = new QuantileDiscretizer()
>   .setInputCol("rawCol")
>   .setOutputCol("binnedColumn")
>   .setRelativeError(0.0)
>   .setNumBuckets(3)
>   .fit(df)
> assertResult(expectedSplits) {qb.getSplits}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21986) QuantileDiscretizer picks wrong split point for data with lots of 0's

2017-09-12 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163778#comment-16163778
 ] 

Sean Owen commented on SPARK-21986:
---

Yeah, I shouldn't say it's specific to a small data set size. The issue is more 
about the distribution of the input. I'm assuming your real input is like this 
example.
Let me also put aside the approximate nature too, as that may be irrelevant.

The process is returning the quantiles, not necessarily picking bucket ends 
that yield evenly distributed data, although, in many cases those are the same 
thing. The special property here is that one value spans several, even all, of 
the quantiles you requested. That is, why is 40 the 66% quantile in your second 
example? it lies at about the 83% quantile if I'm counting correctly, and 
there's at least one 0 that's obviously closer.

> QuantileDiscretizer picks wrong split point for data with lots of 0's
> -
>
> Key: SPARK-21986
> URL: https://issues.apache.org/jira/browse/SPARK-21986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Barry Becker
>Priority: Minor
>
> I have some simple test cases to help illustrate (see below).
> I discovered this with data that had 96,000 rows, but can reproduce with much 
> smaller data that has roughly the same distribution of values.
> If I have data like
>   Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> and ask for 3 buckets, then it does the right thing and yields splits of 
> Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity)
> However, if I add just one more zero, such that I have data like
>  Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0)
> then it will do the wrong thing and give splits of 
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
> I'm not bothered that it gave fewer buckets than asked for (that is to be 
> expected), but I am bothered that it picked 0.0 instead of 40 as the one 
> split point.
> The way it did it, now I have 1 bucket with all the data, and a second with 
> none of the data.
> Am I interpreting something wrong?
> Here are my 2 test cases in scala:
> {code}
> class QuantileDiscretizerSuite extends FunSuite {
>   test("Quantile discretizer on data with lots of 0") {
> verify(Seq(0, 0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, Double.PositiveInfinity))
>   }
>   test("Quantile discretizer on data with one less 0") {
> verify(Seq(0, 0, 0, 0, 0, 40, 0, 0, 45, 46, 0),
>   Seq(Double.NegativeInfinity, 0.0, 40.0, Double.PositiveInfinity))
>   }
>   
>   def verify(data: Seq[Int], expectedSplits: Seq[Double]): Unit = {
> val theData: Seq[(Int, Double)] = data.map {
>   case x: Int => (x, 0.0)
>   case _ => (0, 0.0)
> }
> val df = SPARK_SESSION.sqlContext.createDataFrame(theData).toDF("rawCol", 
> "unused")
> val qb = new QuantileDiscretizer()
>   .setInputCol("rawCol")
>   .setOutputCol("binnedColumn")
>   .setRelativeError(0.0)
>   .setNumBuckets(3)
>   .fit(df)
> assertResult(expectedSplits) {qb.getSplits}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21513) SQL to_json should support all column types

2017-09-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21513.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18875
[https://github.com/apache/spark/pull/18875]

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>  Labels: Starter
> Fix For: 2.3.0
>
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-12 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163976#comment-16163976
 ] 

Joseph K. Bradley commented on SPARK-21866:
---

1. For the namespace, here are my thoughts:

I don't feel too strongly about this, but I'd vote for putting it under 
{{org.apache.spark.ml.image}}.
Pros:
* The image package will be in the spark-ml sub-project, and this fits that 
structure.
* This will avoid polluting the o.a.s namespace, and we do not yet have any 
other data types listed under o.a.s.
Cons:
* Images are more general than ML.  We might want to move the image package out 
of spark-ml eventually.

2. For the SQL data source, HUGE +1 for making a data source

I'm glad it's mentioned in the SPIP, but I would really like to see it 
prioritized.  There's no need to make a dependency between SQL and ML by adding 
options to the image data source reader; data sources support optional 
arguments.  E.g., the CSV data source has option "delimiter" but that is wholly 
contained within the data source; it doesn't affect other data sources.  Is 
there an option needed by the image data source which will force us to abuse 
the data source API?

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to

[jira] [Commented] (SPARK-21513) SQL to_json should support all column types

2017-09-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16163982#comment-16163982
 ] 

Hyukjin Kwon commented on SPARK-21513:
--

Hi [~jerryshao], would you mind if I ask set the contributor role to the JIRA 
ID, 'goldmedal'  and assign this to him please when you are available?

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>  Labels: Starter
> Fix For: 2.3.0
>
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21513) SQL to_json should support all column types

2017-09-12 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao reassigned SPARK-21513:
---

Assignee: Jia-Xuan Liu

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>Assignee: Jia-Xuan Liu
>  Labels: Starter
> Fix For: 2.3.0
>
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21513) SQL to_json should support all column types

2017-09-12 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164000#comment-16164000
 ] 

Saisai Shao commented on SPARK-21513:
-

[~goldmedal] is it your correct JIRA name?

[~hyukjin.kwon] I added you to admin group, I think you can handle it yourself 
now.

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>Assignee: Jia-Xuan Liu
>  Labels: Starter
> Fix For: 2.3.0
>
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21513) SQL to_json should support all column types

2017-09-12 Thread Jia-Xuan Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164005#comment-16164005
 ] 

Jia-Xuan Liu commented on SPARK-21513:
--

[~jerryshao] Yes, thanks a lot. :)

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>Assignee: Jia-Xuan Liu
>  Labels: Starter
> Fix For: 2.3.0
>
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21513) SQL to_json should support all column types

2017-09-12 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164007#comment-16164007
 ] 

Hyukjin Kwon commented on SPARK-21513:
--

Thanks a lot too :).

> SQL to_json should support all column types
> ---
>
> Key: SPARK-21513
> URL: https://issues.apache.org/jira/browse/SPARK-21513
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Aaron Davidson
>Assignee: Jia-Xuan Liu
>  Labels: Starter
> Fix For: 2.3.0
>
>
> The built-in SQL UDF "to_json" currently supports serializing StructType 
> columns, as well as Arrays of StructType columns. If you attempt to use it on 
> a different type, for example a map, you get an error like this:
> {code}
> AnalysisException: cannot resolve 'structstojson(`tags`)' due to data type 
> mismatch: Input type map must be a struct or array of 
> structs.;;
> {code}
> This limitation seems arbitrary; if I were to go through the effort of 
> enclosing my map in a struct, it would be serializable. Same thing with any 
> other non-struct type.
> Therefore the desired improvement is to allow to_json to operate directly on 
> any column type. The associated code is 
> [here|https://github.com/apache/spark/blob/86174ea89b39a300caaba6baffac70f3dc702788/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L653].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17642) Support DESC FORMATTED TABLE COLUMN command to show column-level statistics

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164017#comment-16164017
 ] 

Apache Spark commented on SPARK-17642:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19213

> Support DESC FORMATTED TABLE COLUMN command to show column-level statistics
> ---
>
> Key: SPARK-17642
> URL: https://issues.apache.org/jira/browse/SPARK-17642
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
> Fix For: 2.3.0
>
>
> Support DESC (EXTENDED | FORMATTED) ? TABLE COLUMN command to show column 
> information.
> Show column statistics if EXTENDED or FORMATTED is specified.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread taiho choi (JIRA)
taiho choi created SPARK-21989:
--

 Summary: createDataset and the schema of encoder class
 Key: SPARK-21989
 URL: https://issues.apache.org/jira/browse/SPARK-21989
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: taiho choi


Hello.

public class SampleData implements Serializable {
public String str;
}

ArrayList arr= new ArrayList();
arr.add("{\"str\": \"everyone\"}");
arr.add("{\"str\": \"Hello\"}");

JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
Gson().fromJson(v, SampleData.class);});
Dataset df = sqc.createDataset(data2.rdd(), 
Encoders.bean(SampleData.class));
df.printSchema();

expected result of printSchema is str field of sampleData class.
actual result is following.
root

and if i call df.show() it displays like following.
++
||
++
||
||
++
what i expected is , "hello", "everyone" will be displayed.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21027) Parallel One vs. Rest Classifier

2017-09-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164120#comment-16164120
 ] 

Apache Spark commented on SPARK-21027:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/19214

> Parallel One vs. Rest Classifier
> 
>
> Key: SPARK-21027
> URL: https://issues.apache.org/jira/browse/SPARK-21027
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Ajay Saini
>Assignee: Ajay Saini
> Fix For: 2.3.0
>
>
> Currently, the Scala implementation of OneVsRest allows the user to run a 
> parallel implementation in which each class is evaluated in a different 
> thread. This implementation allows up to a 2X speedup as determined by 
> experiments but is not currently not tunable. Furthermore, the python 
> implementation of OneVsRest does not parallelize at all. It would be useful 
> to add a parallel, tunable implementation of OneVsRest to the python library 
> in order to speed up the algorithm.
>  A ticket for the Scala implementation of this classifier is here: 
> https://issues.apache.org/jira/browse/SPARK-21028



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:55 AM:
-

Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:java}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152
 ] 

Jen-Ming Chung commented on SPARK-21989:


Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:56 AM:
-

Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:java}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 4:58 AM:
-

Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 5:27 AM:
-

Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SampleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SampleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SimpleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SimpleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SimpleData](v, classOf[SimpleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21987) Spark 2.3 cannot read 2.2 event logs

2017-09-12 Thread jincheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164184#comment-16164184
 ] 

jincheng commented on SPARK-21987:
--

In fact , It only occurs when using SQL

> Spark 2.3 cannot read 2.2 event logs
> 
>
> Key: SPARK-21987
> URL: https://issues.apache.org/jira/browse/SPARK-21987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marcelo Vanzin
>Priority: Blocker
>
> Reported by [~jincheng] in a comment in SPARK-18085:
> {noformat}
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: 
> Unrecognized field "metadata" (class 
> org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 
> known properties: "simpleString", "nodeName", "children", "metrics"])
>  at [Source: 
> {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
>  at 
> NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
>  
> Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
>  Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
> array\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
> gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- 
> LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange 
> RoundRobinPartitioning(200)\n+- Scan 
> ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
>  
> RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
>  
> ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
>  of output 
> rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
>  size total (min, med, 
> max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
> 1, column: 1622] (through reference chain: 
> org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
>   at 
> com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
> {noformat}
> This was caused by SPARK-17701 (which at this moment is still open even 
> though the patch has been committed).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-12 Thread jincheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164187#comment-16164187
 ] 

jincheng commented on SPARK-18085:
--

Well done!  In fact , shs-ng/HEAD performs better than shs-ng/m9, especially in 
terms of page rending

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21989) createDataset and the schema of encoder class

2017-09-12 Thread Jen-Ming Chung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164152#comment-16164152
 ] 

Jen-Ming Chung edited comment on SPARK-21989 at 9/13/17 6:16 AM:
-

Hi [~client.test],
I wrote the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SampleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SampleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}


was (Author: jmchung):
Hi [~client.test],
I write the above code in Scala and run in Spark 2.2.0 can show the schema and 
content you expected.

{code:language=Scala}
  case class SampleData(str: String)
  ...
  import spark.implicits._
  val arr = Seq("{\"str\": \"everyone\"}", "{\"str\": \"Hello\"}")
  val rdd: RDD[SampleData] =
spark
  .sparkContext
  .parallelize(arr)
  .map(v => new Gson().fromJson[SampleData](v, classOf[SampleData]))
  val ds = spark.createDataset(rdd)
  ds.printSchema()
  root
|-- str: string (nullable = true)

  ds.show(false)
  ++
  |str |
  ++
  |everyone|
  |Hello   |
  ++
{code}

> createDataset and the schema of encoder class
> -
>
> Key: SPARK-21989
> URL: https://issues.apache.org/jira/browse/SPARK-21989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: taiho choi
>
> Hello.
> public class SampleData implements Serializable {
> public String str;
> }
> ArrayList arr= new ArrayList();
> arr.add("{\"str\": \"everyone\"}");
> arr.add("{\"str\": \"Hello\"}");
> JavaRDD data2 = sc.parallelize(arr).map(v -> {return new 
> Gson().fromJson(v, SampleData.class);});
> Dataset df = sqc.createDataset(data2.rdd(), 
> Encoders.bean(SampleData.class));
> df.printSchema();
> expected result of printSchema is str field of sampleData class.
> actual result is following.
> root
> and if i call df.show() it displays like following.
> ++
> ||
> ++
> ||
> ||
> ++
> what i expected is , "hello", "everyone" will be displayed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org