[jira] [Commented] (SPARK-12838) fix a bug in PythonRDD.scala

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103012#comment-15103012
 ] 

Apache Spark commented on SPARK-12838:
--

User 'zhagnlu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10785

> fix a bug  in PythonRDD.scala 
> --
>
> Key: SPARK-12838
> URL: https://issues.apache.org/jira/browse/SPARK-12838
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: zhanglu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12856) speed up hashCode of unsafe array

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12856:


Assignee: Apache Spark

> speed up hashCode of unsafe array
> -
>
> Key: SPARK-12856
> URL: https://issues.apache.org/jira/browse/SPARK-12856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12856) speed up hashCode of unsafe array

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12856:


Assignee: (was: Apache Spark)

> speed up hashCode of unsafe array
> -
>
> Key: SPARK-12856
> URL: https://issues.apache.org/jira/browse/SPARK-12856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12856) speed up hashCode of unsafe array

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15103011#comment-15103011
 ] 

Apache Spark commented on SPARK-12856:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10784

> speed up hashCode of unsafe array
> -
>
> Key: SPARK-12856
> URL: https://issues.apache.org/jira/browse/SPARK-12856
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12838) fix a bug in PythonRDD.scala

2016-01-15 Thread zhanglu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhanglu updated SPARK-12838:

Component/s: Spark Core
Summary: fix a bug  in PythonRDD.scala   (was: fix a problem in 
PythonRDD.scala )

> fix a bug  in PythonRDD.scala 
> --
>
> Key: SPARK-12838
> URL: https://issues.apache.org/jira/browse/SPARK-12838
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: zhanglu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12856) speed up hashCode of unsafe array

2016-01-15 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-12856:
---

 Summary: speed up hashCode of unsafe array
 Key: SPARK-12856
 URL: https://issues.apache.org/jira/browse/SPARK-12856
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12841) UnresolvedException with cast

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102996#comment-15102996
 ] 

Apache Spark commented on SPARK-12841:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10781

> UnresolvedException with cast
> -
>
> Key: SPARK-12841
> URL: https://issues.apache.org/jira/browse/SPARK-12841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Blocker
>
> {code}
> val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
> df1.where(df1.col("single").cast("string").equalTo("1"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12841) UnresolvedException with cast

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12841:


Assignee: Apache Spark  (was: Wenchen Fan)

> UnresolvedException with cast
> -
>
> Key: SPARK-12841
> URL: https://issues.apache.org/jira/browse/SPARK-12841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Apache Spark
>Priority: Blocker
>
> {code}
> val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
> df1.where(df1.col("single").cast("string").equalTo("1"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12841) UnresolvedException with cast

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12841:


Assignee: Wenchen Fan  (was: Apache Spark)

> UnresolvedException with cast
> -
>
> Key: SPARK-12841
> URL: https://issues.apache.org/jira/browse/SPARK-12841
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Michael Armbrust
>Assignee: Wenchen Fan
>Priority: Blocker
>
> {code}
> val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
> df1.where(df1.col("single").cast("string").equalTo("1"))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes

2016-01-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12840.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10777
[https://github.com/apache/spark/pull/10777]

> Support passing arbitrary objects (not just expressions) into code generated 
> classes
> 
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 2.0.0
>
>
> As of now, our code generator only allows passing Expression objects into the 
> generated class as arguments. In order to support whole-stage codegen (e.g. 
> for broadcast joins), the generated classes need to accept other types of 
> objects such as hash tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12855) Remove parser pluggability

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12855:
---

 Summary: Remove parser pluggability
 Key: SPARK-12855
 URL: https://issues.apache.org/jira/browse/SPARK-12855
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


The number of applications that are using this feature is small (as far as I 
know it came down from two to one as of Jan 2016). No other database systems 
support this feature, and it actually encourages 3rd party projects to not 
contribute their improvements back to Spark. We should just remove this 
functionality to simplify our own code base.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102920#comment-15102920
 ] 

Steve Loughran commented on SPARK-12807:


One thing to think about here is ramping up a notch and shading all the 
downstream dependencies in the YARN shuffle JAR. 

This is a JAR designed to be used in a specific place, the classpath. It now 
includes: netty, leveldb, some bits of com.google (in 1.6), some 
javax.annotation.

What is also has for extra fun is a leveldb jni.so in native, as well as a 
netty one. This is going to be a problem; unless you can somehow isolate and 
shade that this shuffle JAR is going to force in a specific leveldb version on 
every bit of code picking up this JAR.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102907#comment-15102907
 ] 

Apache Spark commented on SPARK-12807:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/10782

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12644) Basic support for vectorize/batch Parquet decoding

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12644.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Basic support for vectorize/batch Parquet decoding
> --
>
> Key: SPARK-12644
> URL: https://issues.apache.org/jira/browse/SPARK-12644
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
> Fix For: 2.0.0
>
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12644) Vectorize/Batch decode parquet

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12644:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-12854

> Vectorize/Batch decode parquet
> --
>
> Key: SPARK-12644
> URL: https://issues.apache.org/jira/browse/SPARK-12644
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12644) Basic support for vectorize/batch Parquet decoding

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12644:

Summary: Basic support for vectorize/batch Parquet decoding  (was: 
Vectorize/Batch decode parquet)

> Basic support for vectorize/batch Parquet decoding
> --
>
> Key: SPARK-12644
> URL: https://issues.apache.org/jira/browse/SPARK-12644
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Nong Li
>Assignee: Nong Li
>
> The parquet encodings are largely designed to decode faster in batches, 
> column by column. This can speed up the decoding considerably.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12854) Vectorize Parquet reader

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12854:
---

 Summary: Vectorize Parquet reader
 Key: SPARK-12854
 URL: https://issues.apache.org/jira/browse/SPARK-12854
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


The parquet encodings are largely designed to decode faster in batches, column 
by column. This can speed up the decoding considerably.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12853) Update query planner to use only bucketed reads if it is useful

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12853:
---

 Summary: Update query planner to use only bucketed reads if it is 
useful
 Key: SPARK-12853
 URL: https://issues.apache.org/jira/browse/SPARK-12853
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12852) Support create table DDL with bucketing

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12852:
---

 Summary: Support create table DDL with bucketing
 Key: SPARK-12852
 URL: https://issues.apache.org/jira/browse/SPARK-12852
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12851) Add the ability to understand tables bucketed by Hive

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12851:
---

 Summary: Add the ability to understand tables bucketed by Hive
 Key: SPARK-12851
 URL: https://issues.apache.org/jira/browse/SPARK-12851
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


We added bucketing functionality, but we current do not understand the 
bucketing properties if a table is generated by Hive. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102883#comment-15102883
 ] 

Steve Loughran commented on SPARK-12807:


There's a PR to shade in trunk; I'm going to do a 1.6 PR too, which should be 
identical (initially for ease of testing that the 1.6 branch is fixed)

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12704) we may repartition a relation even it's not needed

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12704.
---
Resolution: Later

Closing as later. We will revisit this when the time comes.


> we may repartition a relation even it's not needed
> --
>
> Key: SPARK-12704
> URL: https://issues.apache.org/jira/browse/SPARK-12704
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> The implementation of {{HashPartitioning.compatibleWith}} has been 
> sub-optimal for a while. Think of the following case:
> if {{table_a}} is hash partitioned by int column `i`, and {{table_b}} is also 
> partitioned by int column `i`, logically these 2 partitionings are 
> compatible. However, {{HashPartitioning.compatibleWith}} will return false 
> for this case as the {{AttributeReference}} of column `i` between these 2 
> tables have different expr ids.
> With this wrong result of {{HashPartitioning.compatibleWith}}, we will go 
> into [this 
> branch|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L390]
>  and may add unnecessary shuffle.
> This won't impact correctness if the join keys are exactly the same with hash 
> partitioning keys, as there’s still an opportunity to ​not​ partition that 
> child in that branch: 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/Exchange.scala#L428
> However, if the join keys are a super-set of hash partitioning keys, for 
> example, {{table_a}} and {{table_b}} are both hash partitioned by column `i`, 
> and we wanna join them using column `i, j`, logically we don't need shuffle 
> but in fact the 2 tables start out as partitioned only by `i` and redundantly 
> be repartitioned by `i, j`.
> A quick fix is just set the expr id of {{AttributeReference}} to 0 before we 
> call {{this.semanticEquals(o)}} in {{HashPartitioning.compatibleWith}}, but 
> for long term, I think we need a better design than the `compatibleWith`, 
> `guarantees`, and `satisfies` mechanism, as it's quite complex



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12848) Parse number as decimal

2016-01-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102881#comment-15102881
 ] 

Reynold Xin commented on SPARK-12848:
-

We discussed this more offline. Let's just switch to decimal.


> Parse number as decimal
> ---
>
> Key: SPARK-12848
> URL: https://issues.apache.org/jira/browse/SPARK-12848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
> columns, you will turn the decimal into double, lose the precision.
> We should follow most database had done, parse 1.23 as double, it will be 
> converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12850) Support bucket pruning (predicate pushdown for bucketed tables)

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12850:
---

 Summary: Support bucket pruning (predicate pushdown for bucketed 
tables)
 Key: SPARK-12850
 URL: https://issues.apache.org/jira/browse/SPARK-12850
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


We now support bucketing. One optimization opportunity is to push some 
predicates into the scan to skip scanning files that definitely won't match the 
values.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12849) Bucketing improvements follow-up

2016-01-15 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12849:
---

 Summary: Bucketing improvements follow-up
 Key: SPARK-12849
 URL: https://issues.apache.org/jira/browse/SPARK-12849
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


This is a follow-up ticket for SPARK-12538 to improvement bucketing support.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12394) Support writing out pre-hash-partitioned data and exploit that in join optimizations to avoid shuffle (i.e. bucketing in Hive)

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12394.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Support writing out pre-hash-partitioned data and exploit that in join 
> optimizations to avoid shuffle (i.e. bucketing in Hive)
> --
>
> Key: SPARK-12394
> URL: https://issues.apache.org/jira/browse/SPARK-12394
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Nong Li
> Fix For: 2.0.0
>
> Attachments: BucketedTables.pdf
>
>
> In many cases users know ahead of time the columns that they will be joining 
> or aggregating on.  Ideally they should be able to leverage this information 
> and pre-shuffle the data so that subsequent queries do not require a shuffle. 
>  Hive supports this functionality by allowing the user to define buckets, 
> which are hash partitioning of the data based on some key.
>  - Allow the user to specify a set of columns when caching or writing out data
>  - Allow the user to specify some parallelism
>  - Shuffle the data when writing / caching such that its distributed by these 
> columns
>  - When planning/executing  a query, use this distribution to avoid another 
> shuffle when reading, assuming the join or aggregation is compatible with the 
> columns specified
>  - Should work with existing save modes: append, overwrite, etc
>  - Should work at least with all Hadoops FS data sources
>  - Should work with any data source when caching



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5292) optimize join for table that are already sharded/support for hive bucket

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5292:
---
Assignee: Wenchen Fan

> optimize join for table that are already sharded/support for hive bucket
> 
>
> Key: SPARK-5292
> URL: https://issues.apache.org/jira/browse/SPARK-5292
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: gagan taneja
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Currently join do not consider the locality of the data and perform the 
> shuffle anyway
> If the user takes the responsilbity of distributing the data based on some 
> hash or shared the data, spark join should be able to leverage sharding to 
> optimize join calculation/eliminate shuffle



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12538) bucketed table support

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12538.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> bucketed table support
> --
>
> Key: SPARK-12538
> URL: https://issues.apache.org/jira/browse/SPARK-12538
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> cc [~nongli] , please attach the design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12649) support reading bucketed table

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12649.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> support reading bucketed table
> --
>
> Key: SPARK-12649
> URL: https://issues.apache.org/jira/browse/SPARK-12649
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5292) optimize join for table that are already sharded/support for hive bucket

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5292:
---
Fix Version/s: 2.0.0

> optimize join for table that are already sharded/support for hive bucket
> 
>
> Key: SPARK-5292
> URL: https://issues.apache.org/jira/browse/SPARK-5292
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: gagan taneja
> Fix For: 2.0.0
>
>
> Currently join do not consider the locality of the data and perform the 
> shuffle anyway
> If the user takes the responsilbity of distributing the data based on some 
> hash or shared the data, spark join should be able to leverage sharding to 
> optimize join calculation/eliminate shuffle



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11512) Bucket Join

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11512:

Fix Version/s: 2.0.0

> Bucket Join
> ---
>
> Key: SPARK-11512
> URL: https://issues.apache.org/jira/browse/SPARK-11512
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>
> Sort merge join on two datasets on the file system that have already been 
> partitioned the same with the same number of partitions and sorted within 
> each partition, and we don't need to sort it again while join with the 
> sorted/partitioned keys
> This functionality exists in
> - Hive (hive.optimize.bucketmapjoin.sortedmerge)
> - Pig (USING 'merge')
> - MapReduce (CompositeInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12842) Add Hadoop 2.7 build profile

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12842.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Add Hadoop 2.7 build profile
> 
>
> Key: SPARK-12842
> URL: https://issues.apache.org/jira/browse/SPARK-12842
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.0
>
>
> We should add a Hadoop 2.7 build profile so that we can automate tests 
> against Hadoop 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12783:


Assignee: Apache Spark  (was: Wenchen Fan)

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Apache Spark
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;
>   - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> name: arguments, type: interface scala.collection.Seq)
>   - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> staticinvoke(class 
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
> scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  
> [Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType

[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102817#comment-15102817
 ] 

Apache Spark commented on SPARK-12783:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10781

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;
>   - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> name: arguments, type: interface scala.collection.Seq)
>   - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> staticinvoke(class 
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
> scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),a

[jira] [Assigned] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12783:


Assignee: Wenchen Fan  (was: Apache Spark)

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;
>   - field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> name: arguments, type: interface scala.collection.Seq)
>   - object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
> staticinvoke(class 
> org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
> scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  
> [Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType,

[jira] [Commented] (SPARK-12848) Parse number as decimal

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102812#comment-15102812
 ] 

Herman van Hovell commented on SPARK-12848:
---

[~Davies] We discussed the regression in the PR 
(https://github.com/apache/spark/pull/10745). I removed the functionality you 
currently ask for today 
(https://github.com/hvanhovell/spark/commit/7e31ee8a8ac36a600e0965ceefd297c33ffe0edc).
 We can revert this, the only thing is that we need to disable some Hive tests 
(which expect a Double).

> Parse number as decimal
> ---
>
> Key: SPARK-12848
> URL: https://issues.apache.org/jira/browse/SPARK-12848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
> columns, you will turn the decimal into double, lose the precision.
> We should follow most database had done, parse 1.23 as double, it will be 
> converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib

2016-01-15 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102802#comment-15102802
 ] 

Davies Liu commented on SPARK-11219:


It's nice to have, useful when you use online help in console. 

> Make Parameter Description Format Consistent in PySpark.MLlib
> -
>
> Key: SPARK-11219
> URL: https://issues.apache.org/jira/browse/SPARK-11219
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> There are several different formats for describing params in PySpark.MLlib, 
> making it unclear what the preferred way to document is, i.e. vertical 
> alignment vs single line.
> This is to agree on a format and make it consistent across PySpark.MLlib.
> Following the discussion in SPARK-10560, using 2 lines with an indentation is 
> both readable and doesn't lead to changing many lines when adding/removing 
> parameters.  If the parameter uses a default value, put this in parenthesis 
> in a new line under the description.
> Example:
> {noformat}
> :param stepSize:
>   Step size for each iteration of gradient descent.
>   (default: 0.1)
> :param numIterations:
>   Number of iterations run for each batch of data.
>   (default: 50)
> {noformat}
> h2. Current State of Parameter Description Formating
> h4. Classification
>   * LogisticRegressionModel - single line descriptions, fix indentations
>   * LogisticRegressionWithSGD - vertical alignment, sporatic default values
>   * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values
>   * SVMModel - single line
>   * SVMWithSGD - vertical alignment, sporatic default values
>   * NaiveBayesModel - single line
>   * NaiveBayes - single line
> h4. Clustering
>   * KMeansModel - missing param description
>   * KMeans - missing param description and defaults
>   * GaussianMixture - vertical align, incorrect default formatting
>   * PowerIterationClustering - single line with wrapped indentation, missing 
> defaults
>   * StreamingKMeansModel - single line wrapped
>   * StreamingKMeans - single line wrapped, missing defaults
>   * LDAModel - single line
>   * LDA - vertical align, mising some defaults
> h4. FPM  
>   * FPGrowth - single line
>   * PrefixSpan - single line, defaults values in backticks
> h4. Recommendation
>   * ALS - does not have param descriptions
> h4. Regression
>   * LabeledPoint - single line
>   * LinearModel - single line
>   * LinearRegressionWithSGD - vertical alignment
>   * RidgeRegressionWithSGD - vertical align
>   * IsotonicRegressionModel - single line
>   * IsotonicRegression - single line, missing default
> h4. Tree
>   * DecisionTree - single line with vertical indentation, missing defaults
>   * RandomForest - single line with wrapped indent, missing some defaults
>   * GradientBoostedTrees - single line with wrapped indent
> NOTE
> This issue will just focus on model/algorithm descriptions, which are the 
> largest source of inconsistent formatting
> evaluation.py, feature.py, random.py, utils.py - these supporting classes 
> have param descriptions as single line, but are consistent so don't need to 
> be changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12848) Parse number as decimal

2016-01-15 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102797#comment-15102797
 ] 

Davies Liu commented on SPARK-12848:


 [~hvanhovell] The `BD` tag only works in Hive, other databases (MySQL, 
PostgreSQL, Impala etc) does not need this tag to get decimal work correctly.  
The reason I create this JIRA as subtask is that the previous SQL parser can 
handler this, but new parser can't (kind of a regression).

> Parse number as decimal
> ---
>
> Key: SPARK-12848
> URL: https://issues.apache.org/jira/browse/SPARK-12848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
> columns, you will turn the decimal into double, lose the precision.
> We should follow most database had done, parse 1.23 as double, it will be 
> converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12848) Parse number as decimal

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102763#comment-15102763
 ] 

Herman van Hovell commented on SPARK-12848:
---

Assuming that we are talking about literals here. It is quite easy to change 
the parse defaults for that.

The way it is currently done is that when we find a decimal number, {{1.23}} 
for example, we will convert it into a Double (always). When a user needs a 
Decimal, he (or she) can use a BigDecimal literal for this by tagging the 
number with {{BD}}.

[~davies] I might not be getting the point you are making, but I think we have 
covered this by using BigDecimal literals. Could you provide an example 
otherwise?

> Parse number as decimal
> ---
>
> Key: SPARK-12848
> URL: https://issues.apache.org/jira/browse/SPARK-12848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
> columns, you will turn the decimal into double, lose the precision.
> We should follow most database had done, parse 1.23 as double, it will be 
> converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12683) SQL timestamp is wrong when accessed as Python datetime

2016-01-15 Thread Jason C Lee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102749#comment-15102749
 ] 

Jason C Lee commented on SPARK-12683:
-

Looks like collect() eventually calls py4j's collectToPython, which then 
returns the port that contains the wrong answer in the socket. I am not all 
that familiar with how py4j works. Any expert of py4j is welcome here!

> SQL timestamp is wrong when accessed as Python datetime
> ---
>
> Key: SPARK-12683
> URL: https://issues.apache.org/jira/browse/SPARK-12683
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.1, 1.5.2, 1.6.0
> Environment: Windows 7 Pro x64
> Python 3.4.3
> py4j 0.9
>Reporter: Gerhard Fiedler
> Attachments: spark_bug_date.py
>
>
> When accessing SQL timestamp data through {{.show()}}, it looks correct, but 
> when accessing it (as Python {{datetime}}) through {{.collect()}}, it is 
> wrong.
> {code}
> from datetime import datetime
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> if __name__ == "__main__":
> spark_context = SparkContext(appName='SparkBugTimestampHour')
> sql_context = SQLContext(spark_context)
> sql_text = """select cast('2100-09-09 12:11:10.09' as timestamp) as ts"""
> data_frame = sql_context.sql(sql_text)
> data_frame.show(truncate=False)
> # Result from .show() (as expected, looks correct):
> # +--+
> # |ts|
> # +--+
> # |2100-09-09 12:11:10.09|
> # +--+
> rows = data_frame.collect()
> row = rows[0]
> ts = row[0]
> print('ts={ts}'.format(ts=ts))
> # Expected result from this print statement:
> # ts=2100-09-09 12:11:10.09
> #
> # Actual, wrong result (note the hours being 18 instead of 12):
> # ts=2100-09-09 18:11:10.09
> #
> # This error seems to be dependent on some characteristic of the system. 
> We couldn't reproduce
> # this on all of our systems, but it is not clear what the differences 
> are. One difference is
> # the processor: it failed on Intel Xeon E5-2687W v2.
> assert isinstance(ts, datetime)
> assert ts.year == 2100 and ts.month == 9 and ts.day == 9
> assert ts.minute == 11 and ts.second == 10 and ts.microsecond == 9
> if ts.hour != 12:
> print('hour is not correct; should be 12, is actually 
> {hour}'.format(hour=ts.hour))
> spark_context.stop()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102740#comment-15102740
 ] 

Shixiong Zhu commented on SPARK-12847:
--

Ah, I think this one should be a sub-task. Let me change it.

> Remove StreamingListenerBus and post all Streaming events to the same thread 
> as Spark events
> 
>
> Key: SPARK-12847
> URL: https://issues.apache.org/jira/browse/SPARK-12847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SparkListener.onOtherEvent  was added in 
> https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch 
> SQL special events instead of creating a new separate listener bus.
> Streaming can also use the similar approach to eliminate the 
> StreamingListenerBus. Right now, nondeterministic message order in two 
> listener buses are really tricky when someone implements both SparkListener 
> and StreamingListener. And if we can use only one listener bus in Spark, the 
> nondeterministic message order will be eliminated and we can also remove a 
> lot of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12847:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-12140

> Remove StreamingListenerBus and post all Streaming events to the same thread 
> as Spark events
> 
>
> Key: SPARK-12847
> URL: https://issues.apache.org/jira/browse/SPARK-12847
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SparkListener.onOtherEvent  was added in 
> https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch 
> SQL special events instead of creating a new separate listener bus.
> Streaming can also use the similar approach to eliminate the 
> StreamingListenerBus. Right now, nondeterministic message order in two 
> listener buses are really tricky when someone implements both SparkListener 
> and StreamingListener. And if we can use only one listener bus in Spark, the 
> nondeterministic message order will be eliminated and we can also remove a 
> lot of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12848) Parse number as decimal

2016-01-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12848:
--

 Summary: Parse number as decimal
 Key: SPARK-12848
 URL: https://issues.apache.org/jira/browse/SPARK-12848
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Davies Liu


Right now, Hive parser will parse 1.23 as double, when it's used with decimal 
columns, you will turn the decimal into double, lose the precision.

We should follow most database had done, parse 1.23 as double, it will be 
converted into double when used with double.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102729#comment-15102729
 ] 

Apache Spark commented on SPARK-12807:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/10780

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102730#comment-15102730
 ] 

Marcelo Vanzin commented on SPARK-12847:


Kinda the same as SPARK-12140.

> Remove StreamingListenerBus and post all Streaming events to the same thread 
> as Spark events
> 
>
> Key: SPARK-12847
> URL: https://issues.apache.org/jira/browse/SPARK-12847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SparkListener.onOtherEvent  was added in 
> https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch 
> SQL special events instead of creating a new separate listener bus.
> Streaming can also use the similar approach to eliminate the 
> StreamingListenerBus. Right now, nondeterministic message order in two 
> listener buses are really tricky when someone implements both SparkListener 
> and StreamingListener. And if we can use only one listener bus in Spark, the 
> nondeterministic message order will be eliminated and we can also remove a 
> lot of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12807:


Assignee: Apache Spark

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Assignee: Apache Spark
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12847:


Assignee: Apache Spark  (was: Shixiong Zhu)

> Remove StreamingListenerBus and post all Streaming events to the same thread 
> as Spark events
> 
>
> Key: SPARK-12847
> URL: https://issues.apache.org/jira/browse/SPARK-12847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>
> SparkListener.onOtherEvent  was added in 
> https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch 
> SQL special events instead of creating a new separate listener bus.
> Streaming can also use the similar approach to eliminate the 
> StreamingListenerBus. Right now, nondeterministic message order in two 
> listener buses are really tricky when someone implements both SparkListener 
> and StreamingListener. And if we can use only one listener bus in Spark, the 
> nondeterministic message order will be eliminated and we can also remove a 
> lot of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12807:


Assignee: (was: Apache Spark)

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102728#comment-15102728
 ] 

Apache Spark commented on SPARK-12847:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10779

> Remove StreamingListenerBus and post all Streaming events to the same thread 
> as Spark events
> 
>
> Key: SPARK-12847
> URL: https://issues.apache.org/jira/browse/SPARK-12847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SparkListener.onOtherEvent  was added in 
> https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch 
> SQL special events instead of creating a new separate listener bus.
> Streaming can also use the similar approach to eliminate the 
> StreamingListenerBus. Right now, nondeterministic message order in two 
> listener buses are really tricky when someone implements both SparkListener 
> and StreamingListener. And if we can use only one listener bus in Spark, the 
> nondeterministic message order will be eliminated and we can also remove a 
> lot of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12847:


Assignee: Shixiong Zhu  (was: Apache Spark)

> Remove StreamingListenerBus and post all Streaming events to the same thread 
> as Spark events
> 
>
> Key: SPARK-12847
> URL: https://issues.apache.org/jira/browse/SPARK-12847
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> SparkListener.onOtherEvent  was added in 
> https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch 
> SQL special events instead of creating a new separate listener bus.
> Streaming can also use the similar approach to eliminate the 
> StreamingListenerBus. Right now, nondeterministic message order in two 
> listener buses are really tricky when someone implements both SparkListener 
> and StreamingListener. And if we can use only one listener bus in Spark, the 
> nondeterministic message order will be eliminated and we can also remove a 
> lot of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12833) Initial import of databricks/spark-csv

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102726#comment-15102726
 ] 

Apache Spark commented on SPARK-12833:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10778

> Initial import of databricks/spark-csv
> --
>
> Key: SPARK-12833
> URL: https://issues.apache.org/jira/browse/SPARK-12833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12847) Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

2016-01-15 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-12847:


 Summary: Remove StreamingListenerBus and post all Streaming events 
to the same thread as Spark events
 Key: SPARK-12847
 URL: https://issues.apache.org/jira/browse/SPARK-12847
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu


SparkListener.onOtherEvent  was added in 
https://github.com/apache/spark/pull/10061. SQLListener uses it to dispatch SQL 
special events instead of creating a new separate listener bus.

Streaming can also use the similar approach to eliminate the 
StreamingListenerBus. Right now, nondeterministic message order in two listener 
buses are really tricky when someone implements both SparkListener and 
StreamingListener. And if we can use only one listener bus in Spark, the 
nondeterministic message order will be eliminated and we can also remove a lot 
of codes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA

2016-01-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11925.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 9908
[https://github.com/apache/spark/pull/9908]

> Add PySpark missing methods for ml.feature during Spark 1.6 QA
> --
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add PySpark missing methods and params for ml.feature
> * RegexTokenizer should support setting toLowercase.
> * MinMaxScalerModel should support output originalMin and originalMax.
> * PCAModel should support output pc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5159) Thrift server does not respect hive.server2.enable.doAs=true

2016-01-15 Thread Luciano Resende (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102702#comment-15102702
 ] 

Luciano Resende commented on SPARK-5159:


[~ilovesoup] As I mentioned before, most if not all your changes have been 
applied via SPARK-6910 

@All, I understand there is a bigger issue here, regarding data that is stored 
out of hive, but I would treat that as a different epic for Spark Data 
Security, while for this current issue, I would like us to concentrate on the 
remaining issue related to doAs when Kerberos is enabled.

> Thrift server does not respect hive.server2.enable.doAs=true
> 
>
> Key: SPARK-5159
> URL: https://issues.apache.org/jira/browse/SPARK-5159
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0
>Reporter: Andrew Ray
> Attachments: spark_thrift_server_log.txt
>
>
> I'm currently testing the spark sql thrift server on a kerberos secured 
> cluster in YARN mode. Currently any user can access any table regardless of 
> HDFS permissions as all data is read as the hive user. In HiveServer2 the 
> property hive.server2.enable.doAs=true causes all access to be done as the 
> submitting user. We should do the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-01-15 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12846:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11806

> Follow up SPARK-12707, Update documentation and other related code
> --
>
> Key: SPARK-12846
> URL: https://issues.apache.org/jira/browse/SPARK-12846
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Jeff Zhang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12846) Follow up SPARK-12707, Update documentation and other related code

2016-01-15 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12846:
--

 Summary: Follow up SPARK-12707, Update documentation and other 
related code
 Key: SPARK-12846
 URL: https://issues.apache.org/jira/browse/SPARK-12846
 Project: Spark
  Issue Type: Improvement
Reporter: Jeff Zhang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12840:

Description: As of now, our code generator only allows passing Expression 
objects into the generated class as arguments. In order to support whole-stage 
codegen (e.g. for broadcast joins), the generated classes need to accept other 
types of objects such as hash tables.  (was: Right now, we only support 
expression.)

> Support passing arbitrary objects (not just expressions) into code generated 
> classes
> 
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> As of now, our code generator only allows passing Expression objects into the 
> generated class as arguments. In order to support whole-stage codegen (e.g. 
> for broadcast joins), the generated classes need to accept other types of 
> objects such as hash tables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12575) Grammar parity with existing SQL parser

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12575.
-
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.0.0

> Grammar parity with existing SQL parser
> ---
>
> Key: SPARK-12575
> URL: https://issues.apache.org/jira/browse/SPARK-12575
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Herman van Hovell
> Fix For: 2.0.0
>
>
> The new parser should be compatible with our existing SQL parser built using 
> Scala parser combinator. One thing that is different is how we parse time 
> intervals. There might be more.
> Once we reach parity, we should just switch and remove the old SQL parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102662#comment-15102662
 ] 

Steve Loughran commented on SPARK-12807:


work on YARN for isolation will address this in Hadoop 2.8+. But that does 
nothing for Hadoop <= 2.8. Shading will do this


> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102654#comment-15102654
 ] 

Sean Owen commented on SPARK-12807:
---

I see, it's only the shuffle and only 1.6, and only happens to affect the 
shuffle service on YARN. Spark has otherwise been using later Jackson for a 
while. Shading is indeed probably the best thing for all of Spark's usages.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10538) java.lang.NegativeArraySizeException during join

2016-01-15 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102653#comment-15102653
 ] 

Davies Liu commented on SPARK-10538:


@mayxine  The problem you posted is not related to this JIRA, it could be that 
rdd1.partitions.length * rdd2.partitions.length is overflow, if the number of 
partitions of two RDD are too large.

> java.lang.NegativeArraySizeException during join
> 
>
> Key: SPARK-10538
> URL: https://issues.apache.org/jira/browse/SPARK-10538
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>Assignee: Davies Liu
> Attachments: java.lang.NegativeArraySizeException.png, 
> screenshot-1.png
>
>
> Hi,
> I've got a problem during joining tables in PySpark. (in my example 20 of 
> them)
> I can observe that during calculation of first partition (on one of 
> consecutive joins) there is a big shuffle read size (294.7 MB / 146 records) 
> vs on others partitions (approx. 272.5 KB / 113 record)
> I can also observe that just before the crash python process going up to few 
> gb of RAM.
> After some time there is an exception:
> {code}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:90)
>   at 
> org.apache.spark.sql.execution.TungstenProject$$anonfun$3$$anonfun$apply$3.apply(basicOperators.scala:88)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:119)
>   at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> I'm running this on 2 nodes cluster (12 cores, 64 GB RAM)
> Config:
> {code}
> spark.driver.memory  10g
> spark.executor.extraJavaOptions -XX:-UseGCOverheadLimit -XX:+UseParallelGC 
> -Dfile.encoding=UTF8
> spark.executor.memory   60g
> spark.storage.memoryFraction0.05
> spark.shuffle.memoryFraction0.75
> spark.driver.maxResultSize  10g  
> spark.cores.max 24
> spark.kryoserializer.buffer.max 1g
> spark.default.parallelism   200
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12840) Support passing arbitrary objects (not just expressions) into code generated classes

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-12840:

Summary: Support passing arbitrary objects (not just expressions) into code 
generated classes  (was: Support pass any object into codegen as reference)

> Support passing arbitrary objects (not just expressions) into code generated 
> classes
> 
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102647#comment-15102647
 ] 

Steve Loughran commented on SPARK-12807:


FWIW, I'm workng on shading jackson in the shuffle JAR

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102645#comment-15102645
 ] 

Steve Loughran commented on SPARK-12807:


problem is there are no guarantees that the spark versions are backwards 
compatible with the older version. If they come first, the NM itself may fail.


> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12840) Support pass any object into codegen as reference

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102629#comment-15102629
 ] 

Apache Spark commented on SPARK-12840:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10777

> Support pass any object into codegen as reference
> -
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12840) Support pass any object into codegen as reference

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12840:


Assignee: Apache Spark  (was: Davies Liu)

> Support pass any object into codegen as reference
> -
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12840) Support pass any object into codegen as reference

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12840:


Assignee: Davies Liu  (was: Apache Spark)

> Support pass any object into codegen as reference
> -
>
> Key: SPARK-12840
> URL: https://issues.apache.org/jira/browse/SPARK-12840
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> Right now, we only support expression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12149) Executor UI improvement suggestions - Color UI

2016-01-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-12149:
--
Assignee: Alex Bozarth

> Executor UI improvement suggestions - Color UI
> --
>
> Key: SPARK-12149
> URL: https://issues.apache.org/jira/browse/SPARK-12149
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
>
> Splitting off the Color UI portion of the parent UI improvements task, 
> description copied below:
> Fill some of the cells with color in order to make it easier to absorb the 
> info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> if dark blue then write the value in white (same for the RED and GREEN above
> Merging another idea from SPARK-2132: 
> Color GC time red when over a percentage of task time



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12716) Executor UI improvement suggestions - Totals

2016-01-15 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-12716.
---
   Resolution: Fixed
 Assignee: Alex Bozarth
Fix Version/s: 2.0.0

> Executor UI improvement suggestions - Totals
> 
>
> Key: SPARK-12716
> URL: https://issues.apache.org/jira/browse/SPARK-12716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
> Fix For: 2.0.0
>
>
> Splitting off the Totals portion of the parent UI improvements task, 
> description copied below:
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
> ...
> Report the TOTALS in each column (do this at the TOP so no need to scroll to 
> the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102532#comment-15102532
 ] 

Herman van Hovell commented on SPARK-12835:
---

Thanks for that.

The {{df.groupby(key).agg(avg_diff)}} is problematic. The Lag window function 
doesn't have any partitioning defined so it will move all data to a single 
thread on a single node. The {{diff}} value can also be based on dates with 
different keys.

> StackOverflowError when aggregating over column from window function
> 
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try 
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a 
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> ++
> |diff|
> ++
> |null|
> |  31|
> |  28|
> |   5|
> | 170|
> |  39|
> ++
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on 
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12030:
---
Attachment: (was: spark.jpg)

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Nong Li
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12030:
---
Attachment: (was: t2.tar.gz)

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Nong Li
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12030) Incorrect results when aggregate joined data

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12030:
---
Attachment: (was: t1.tar.gz)

> Incorrect results when aggregate joined data
> 
>
> Key: SPARK-12030
> URL: https://issues.apache.org/jira/browse/SPARK-12030
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>Assignee: Nong Li
>Priority: Blocker
> Fix For: 1.5.3, 1.6.0
>
> Attachments: spark.jpg, t2.tar.gz
>
>
> I have following issue.
> I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)
> {code}
> t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
> t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
> joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")
> {code}
> Important: both table are cached, so results should be the same on every 
> query.
> Then I did come counts:
> {code}
> t1.count() -> 5900729
> t1.registerTempTable("t1")
> sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
> t2.count() -> 54298
> joined.count() -> 5900729
> {code}
> And here magic begins - I counted distinct id1 from joined table
> {code}
> joined.registerTempTable("joined")
> sqlCtx.sql("select distinct(id1) from joined").count()
> {code}
> Results varies *(are different on every run)* between 5899000 and 
> 590 but never are equal to 5900729.
> In addition. I did more queries:
> {code}
> sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 
> 1").collect() 
> {code}
> This gives some results but this query return *1*
> {code}
> len(sqlCtx.sql("select * from joined where id1 = result").collect())
> {code}
> What's wrong ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12845) During join Spark should pushdown predicates to both tables

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12845:
---
Description: 
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query:
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.

  was:
I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query which have no 
sense
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.


> During join Spark should pushdown predicates to both tables
> ---
>
> Key: SPARK-12845
> URL: https://issues.apache.org/jira/browse/SPARK-12845
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> I have following issue.
> I'm connecting two tables with where condition
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
> {code}
> In this code predicate is only push down to t1.
> To have predicates on both table I should run following query:
> {code}
> select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 
> 1234
> {code}
> Spark should present same behaviour for both queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12843:
---
Issue Type: Bug  (was: Improvement)

> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12845) During join Spark should pushdown predicates to both tables

2016-01-15 Thread JIRA
Maciej Bryński created SPARK-12845:
--

 Summary: During join Spark should pushdown predicates to both 
tables
 Key: SPARK-12845
 URL: https://issues.apache.org/jira/browse/SPARK-12845
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Maciej Bryński


I have following issue.
I'm connecting two tables with where condition
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234
{code}
In this code predicate is only push down to t1.
To have predicates on both table I should run following query which have no 
sense
{code}
select * from t1 join t2 in t1.id1 = t2.id2 where t1.id = 1234 and t2.id2 = 1234
{code}

Spark should present same behaviour for both queries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12844) Spark documentation should be more precise about the algebraic properties of functions in various transformations

2016-01-15 Thread Jimmy Lin (JIRA)
Jimmy Lin created SPARK-12844:
-

 Summary: Spark documentation should be more precise about the 
algebraic properties of functions in various transformations
 Key: SPARK-12844
 URL: https://issues.apache.org/jira/browse/SPARK-12844
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Reporter: Jimmy Lin
Priority: Minor


Spark documentation should be more precise about the algebraic properties of 
functions in various transformations. The way the current documentation is 
written is potentially confusing. For example, in Spark 1.6, the scaladoc for 
reduce in RDD says:

> Reduces the elements of this RDD using the specified commutative and 
> associative binary operator.

This is precise and accurate. In the documentation of reduceByKey in 
PairRDDFunctions, on the other hand, it says:

> Merge the values for each key using an associative reduce function.

To be more precise, this function must also be commutative in order for the 
computation to be correct. Writing commutative for reduce and not reduceByKey 
gives the false impression that the function in the latter does not need to be 
commutative.

The same applies to aggregateByKey. To be precise, both seqOp and combOp need 
to be associative (mentioned) AND commutative (not mentioned) in order for the 
computation to be correct. It would be desirable to fix these inconsistencies 
throughout the documentation.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function

2016-01-15 Thread Kalle Jepsen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102491#comment-15102491
 ] 

Kalle Jepsen commented on SPARK-12835:
--

The [traceback|http://pastebin.com/pRRCAben] really is ridiculously long.

In my actual application I would have the window partitioned and the 
aggregation done in {{df.groupby(key).agg(avg_diff}}. Would that still be 
problematic with regard to performance? The error is the same there though, 
that's why I've chosen the more concise minimal example above.

> StackOverflowError when aggregating over column from window function
> 
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try 
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a 
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> ++
> |diff|
> ++
> |null|
> |  31|
> |  28|
> |   5|
> | 170|
> |  39|
> ++
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on 
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Muthu Jayakumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102482#comment-15102482
 ] 

Muthu Jayakumar commented on SPARK-12783:
-

Hello Kevin,

Here is what I am seeing...

from shell:
{code}
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.0
  /_/

Using Scala version 2.11.7 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_60)
Type in expressions to have them evaluated.
Type :help for more information.

scala> case class MyMap(map: Map[String, String])
defined class MyMap

scala> :paste
// Entering paste mode (ctrl-D to finish)

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(Map(a->b))
  }

  def toStr: String = {
a
  }
}

// Exiting paste mode, now interpreting.

defined class TestCaseClass

scala> TestCaseClass("a", "nn")
res4: TestCaseClass = TestCaseClass(a,nn)

scala>   import sqlContext.implicits._
import sqlContext.implicits._

scala> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", 
"data1"), TestCaseClass("2015-05-01", "data2"))).toDF()
org.apache.spark.sql.AnalysisException: Unable to generate an encoder for inner 
class `TestCaseClass` without access to the scope that this class was defined 
in. Try moving this class out of its parent class.;
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:264)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$$anonfun$2.applyOrElse(ExpressionEncoder.scala:260)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:243)
  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:53)
  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:242)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:233)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolve(ExpressionEncoder.scala:260)
  at org.apache.spark.sql.Dataset.(Dataset.scala:78)
  at org.apache.spark.sql.Dataset.(Dataset.scala:89)
  at org.apache.spark.sql.SQLContext.createDataset(SQLContext.scala:507)
  ... 52 elided
{code}

I do remember seeing the above error stack, if the case class was defined 
inside the scope of an object (For example: If defined inside MyApp like in the 
example below as it becomes an inner class)
>From code, I added an explicit import and eventually changed to use fully 
>qualified class names like below...

{code}
import scala.collection.{Map => ImMap}

case class MyMap(map: ImMap[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(ImMap(a->b))
  }

  def toStr: String = {
a
  }
}

object MyApp extends App { 
 //Get handle to contexts...
 import sqlContext.implicits._
  val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF()
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
}

{code}

and

{code}
case class MyMap(map: scala.collection.Map[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(scala.collection.Map(a->b))
  }

  def toStr: String = {
a
  }
}

object MyApp extends App { 
 //Get handle to contexts...
 import sqlContext.implicits._
  val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF()
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
}

{code}

Please advice on what I may be missing. I misread the earlier comment and tried 
to use immutable map incorrectly :(. 

> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caus

[jira] [Assigned] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10985:


Assignee: (was: Apache Spark)

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Priority: Minor
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102477#comment-15102477
 ] 

Apache Spark commented on SPARK-10985:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10776

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Priority: Minor
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12701) Logging FileAppender should use join to ensure thread is finished

2016-01-15 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-12701:
--
Fix Version/s: 1.6.1

> Logging FileAppender should use join to ensure thread is finished
> -
>
> Key: SPARK-12701
> URL: https://issues.apache.org/jira/browse/SPARK-12701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Currently, FileAppender for logging uses wait/notifyAll to signal that the 
> writing thread has finished.  While I was trying to write a regression test 
> for a fix of SPARK-9844, the writing thread was not able to fully complete 
> before the process was shutdown, despite calling 
> {{FileAppender.awaitTermination}}.  Using join ensures the thread completes 
> and would simplify things a little more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10985) Avoid passing evicted blocks throughout BlockManager / CacheManager

2016-01-15 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10985:


Assignee: Apache Spark

> Avoid passing evicted blocks throughout BlockManager / CacheManager
> ---
>
> Key: SPARK-10985
> URL: https://issues.apache.org/jira/browse/SPARK-10985
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Reporter: Andrew Or
>Assignee: Apache Spark
>Priority: Minor
>
> This is a minor refactoring task.
> Currently when we attempt to put a block in, we get back an array buffer of 
> blocks that are dropped in the process. We do this to propagate these blocks 
> back to our TaskContext, which will add them to its TaskMetrics so we can see 
> them in the SparkUI storage tab properly.
> Now that we have TaskContext.get, we can just use that to propagate this 
> information. This simplifies a lot of the signatures and gets rid of weird 
> return types like the following everywhere:
> {code}
> ArrayBuffer[(BlockId, BlockStatus)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12624) When schema is specified, we should treat undeclared fields as null (in Python)

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102473#comment-15102473
 ] 

Maciej Bryński edited comment on SPARK-12624 at 1/15/16 9:17 PM:
-

[~davies]
Isn't related to my comment here:
https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15068627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15068627


was (Author: maver1ck):
[~davies]
Isn't related to my comment here:
https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15074733&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15074733

> When schema is specified, we should treat undeclared fields as null (in 
> Python)
> ---
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See https://github.com/apache/spark/pull/10564
> Basically that test case should pass without the above fix and just assume b 
> is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12624) When schema is specified, we should treat undeclared fields as null (in Python)

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102473#comment-15102473
 ] 

Maciej Bryński commented on SPARK-12624:


[~davies]
Isn't related to my comment here:
https://issues.apache.org/jira/browse/SPARK-11437?focusedCommentId=15074733&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15074733

> When schema is specified, we should treat undeclared fields as null (in 
> Python)
> ---
>
> Key: SPARK-12624
> URL: https://issues.apache.org/jira/browse/SPARK-12624
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> See https://github.com/apache/spark/pull/10564
> Basically that test case should pass without the above fix and just assume b 
> is null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread Muthu Jayakumar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102351#comment-15102351
 ] 

Muthu Jayakumar edited comment on SPARK-12783 at 1/15/16 9:09 PM:
--

I tried the following, but got similar error...

{code}
case class MyMap(map: scala.collection.immutable.Map[String, String])

case class TestCaseClass(a: String, b: String){
  def toMyMap: MyMap = {
MyMap(Map(a->b))
  }


  def toStr: String = {
a
  }
}

//main thread...
val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
TestCaseClass("2015-05-01", "data2"))).toDF() 
  df1.as[TestCaseClass].map(_.toStr).show() //works fine
  df1.as[TestCaseClass].map(_.toMyMap).show() //error
  df1.as[TestCaseClass].map(each=> each.a -> each.b).show() //works fine
{code}

{quote}
Serialization stack:
- object not serializable (class: 
scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
package lang)
- field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
class scala.reflect.internal.Symbols$Symbol)
- object (class scala.reflect.internal.Types$UniqueThisType, 
java.lang.type)
- field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
- field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
type: class scala.reflect.internal.Types$Type)
- object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
- field (class: 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
type: class scala.reflect.api.Types$TypeApi)
- object (class 
org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
- field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
name: function, type: interface scala.Function1)
- object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),- 
field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
- field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
targetObject, type: class org.apache.spark.sql.catalyst.expressions.Expression)
- object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)))
- writeObject data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, 
scala.collection.immutable.List$SerializationProxy@2660f093)
- writeReplace data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.$colon$colon, 
List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)), 
invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;
- field (class: org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
name: arguments, type: interface scala.collection.Seq)
- object (class org.apache.spark.sql.catalyst.expressions.StaticInvoke, 
staticinvoke(class 
org.apache.spark.sql.catalyst.util.ArrayBasedMapData$,ObjectType(interface 
scala.collection.Map),toScalaMap,invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 
[Ljava.lang.Object;)),invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
 field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
"collector.MyMap"),valueArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
 [Ljava.lang.Object;)),true))
- writeObject data (class: 
scala.collection.immutable.List$SerializationProxy)
- object (class scala.collection.immutable.List$SerializationProxy, 
scala.collection.immutable.List$SerializationProxy@72af5ac7)
- writeReplace data (class: 
scala.collection.immutable.List$SerializationProxy)
- object

[jira] [Updated] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-15 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-12843?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-12843:
---
Description: 
SQL Query:
{code}
select * from table limit 100
{code}
force Spark to scan all partition even when data are available on the beginning 
of scan.

This behaviour should be avoided and scan should stop when enough data is 
collected.

Is it related to: [SPARK-9850] ?

  was:
SQL Query:
{code}
select * from table limit 100
{code}
force Spark to scan all partition even when data are available on the beginning 
of scan.

Is it related to: [SPARK-9850] ?


> Spark should avoid scanning all partitions when limit is set
> 
>
> Key: SPARK-12843
> URL: https://issues.apache.org/jira/browse/SPARK-12843
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Maciej Bryński
>
> SQL Query:
> {code}
> select * from table limit 100
> {code}
> force Spark to scan all partition even when data are available on the 
> beginning of scan.
> This behaviour should be avoided and scan should stop when enough data is 
> collected.
> Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12843) Spark should avoid scanning all partitions when limit is set

2016-01-15 Thread JIRA
Maciej Bryński created SPARK-12843:
--

 Summary: Spark should avoid scanning all partitions when limit is 
set
 Key: SPARK-12843
 URL: https://issues.apache.org/jira/browse/SPARK-12843
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.6.0
Reporter: Maciej Bryński


SQL Query:
{code}
select * from table limit 100
{code}
force Spark to scan all partition even when data are available on the beginning 
of scan.

Is it related to: [SPARK-9850] ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102441#comment-15102441
 ] 

Maciej Bryński edited comment on SPARK-12807 at 1/15/16 8:43 PM:
-

I'm asking if it's possible.

About running Spark shuffle. Did you miss link to: 
https://issues.apache.org/jira/browse/SPARK-9439 ?
Problem started with Spark 1.6.0, because it's first version of Spark where 
Shuffle has Jackson dependency



was (Author: maver1ck):
I'm asking if it's possible.

About running Spark shuffle. Did you miss link to: 
https://issues.apache.org/jira/browse/SPARK-9439 ?
Problem started with Spark 1.6.0, because it's first version of Spark where 
Spark Shuffle has Jackson dependency


> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102441#comment-15102441
 ] 

Maciej Bryński commented on SPARK-12807:


I'm asking if it's possible.

About running Spark shuffle. Did you miss link to: 
https://issues.apache.org/jira/browse/SPARK-9439 ?
Problem started with Spark 1.6.0, because it's first version of Spark where 
Spark Shuffle has Jackson dependency


> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102430#comment-15102430
 ] 

Sean Owen commented on SPARK-12807:
---

Are you asking if it's possible, a possible explanation, a workaround?
I'm still not sure why it's a problem (now). For example people seem to be 
running Spark shuffle just fine with recent Hadoop.

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12842) Add Hadoop 2.7 build profile

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102424#comment-15102424
 ] 

Apache Spark commented on SPARK-12842:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/10775

> Add Hadoop 2.7 build profile
> 
>
> Key: SPARK-12842
> URL: https://issues.apache.org/jira/browse/SPARK-12842
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> We should add a Hadoop 2.7 build profile so that we can automate tests 
> against Hadoop 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12842) Add Hadoop 2.7 build profile

2016-01-15 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-12842:
--

 Summary: Add Hadoop 2.7 build profile
 Key: SPARK-12842
 URL: https://issues.apache.org/jira/browse/SPARK-12842
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Josh Rosen
Assignee: Josh Rosen


We should add a Hadoop 2.7 build profile so that we can automate tests against 
Hadoop 2.7.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12807) Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3

2016-01-15 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-12807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102416#comment-15102416
 ] 

Maciej Bryński commented on SPARK-12807:


Sean,
Maybe it's possible to compile YARN Shuffle with different version of Jackson 
than version using by Spark Core ?

> Spark External Shuffle not working in Hadoop clusters with Jackson 2.2.3
> 
>
> Key: SPARK-12807
> URL: https://issues.apache.org/jira/browse/SPARK-12807
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, YARN
>Affects Versions: 1.6.0
> Environment: A Hadoop cluster with Jackson 2.2.3, spark running with 
> dynamic allocation enabled
>Reporter: Steve Loughran
>Priority: Critical
>
> When you try to try to use dynamic allocation on a Hadoop 2.6-based cluster, 
> you get to see a stack trace in the NM logs, indicating a jackson 2.x version 
> mismatch.
> (reported on the spark dev list)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12833) Initial import of databricks/spark-csv

2016-01-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102407#comment-15102407
 ] 

Apache Spark commented on SPARK-12833:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/10774

> Initial import of databricks/spark-csv
> --
>
> Key: SPARK-12833
> URL: https://issues.apache.org/jira/browse/SPARK-12833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12841) UnresolvedException with cast

2016-01-15 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-12841:


 Summary: UnresolvedException with cast
 Key: SPARK-12841
 URL: https://issues.apache.org/jira/browse/SPARK-12841
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Michael Armbrust
Assignee: Wenchen Fan
Priority: Blocker


{code}
val df1 = sc.makeRDD(1 to 5).map(i => (i, i * 2)).toDF("single", "double")
df1.where(df1.col("single").cast("string").equalTo("1"))
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12667) Remove block manager's internal "external block store" API

2016-01-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-12667.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10752
[https://github.com/apache/spark/pull/10752]

> Remove block manager's internal "external block store" API
> --
>
> Key: SPARK-12667
> URL: https://issues.apache.org/jira/browse/SPARK-12667
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12835) StackOverflowError when aggregating over column from window function

2016-01-15 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102386#comment-15102386
 ] 

Herman van Hovell commented on SPARK-12835:
---

I can reproduce your problem with the following scala code:
{noformat}
import java.sql.Date

import org.apache.spark.sql.expressions.Window

val df = Seq(
(Date.valueOf("2014-01-01")),
(Date.valueOf("2014-02-01")),
(Date.valueOf("2014-03-01")),
(Date.valueOf("2014-03-06")),
(Date.valueOf("2014-08-23")),
(Date.valueOf("2014-10-01"))).
map(Tuple1.apply).
toDF("ts")

// This doesn't work:
df.select(avg(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts").show

// This does work:
df.select(datediff($"ts", lag($"ts", 1).over(Window.orderBy($"ts"))).as("diff"))
  .select(avg($"diff"))
  .show
{noformat}

It seems there is a small bug in the analyzer.

> StackOverflowError when aggregating over column from window function
> 
>
> Key: SPARK-12835
> URL: https://issues.apache.org/jira/browse/SPARK-12835
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.6.0
>Reporter: Kalle Jepsen
>
> I am encountering a StackoverflowError with a very long traceback, when I try 
> to directly aggregate on a column created by a window function.
> E.g. I am trying to determine the average timespan between dates in a 
> Dataframe column by using a window-function:
> {code}
> from pyspark import SparkContext
> from pyspark.sql import HiveContext, Window, functions
> from datetime import datetime
> sc = SparkContext()
> sq = HiveContext(sc)
> data = [
> [datetime(2014,1,1)],
> [datetime(2014,2,1)],
> [datetime(2014,3,1)],
> [datetime(2014,3,6)],
> [datetime(2014,8,23)],
> [datetime(2014,10,1)],
> ]
> df = sq.createDataFrame(data, schema=['ts'])
> ts = functions.col('ts')
>
> w = Window.orderBy(ts)
> diff = functions.datediff(
> ts,
> functions.lag(ts, count=1).over(w)
> )
> avg_diff = functions.avg(diff)
> {code}
> While {{df.select(diff.alias('diff')).show()}} correctly renders as
> {noformat}
> ++
> |diff|
> ++
> |null|
> |  31|
> |  28|
> |   5|
> | 170|
> |  39|
> ++
> {noformat}
> doing {code}
> df.select(avg_diff).show()
> {code} throws a {{java.lang.StackOverflowError}}.
> When I say
> {code}
> df2 = df.select(diff.alias('diff'))
> df2.select(functions.avg('diff'))
> {code}
> however, there's no error.
> Am I wrong to assume that the above should work?
> I've already described the same in [this question on 
> stackoverflow.com|http://stackoverflow.com/questions/34793999/averaging-over-window-function-leads-to-stackoverflowerror].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12833) Initial import of databricks/spark-csv

2016-01-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12833.
-
   Resolution: Fixed
Fix Version/s: 2.0.0

> Initial import of databricks/spark-csv
> --
>
> Key: SPARK-12833
> URL: https://issues.apache.org/jira/browse/SPARK-12833
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Hossein Falaki
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12783) Dataset map serialization error

2016-01-15 Thread kevin yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15102361#comment-15102361
 ] 

kevin yu commented on SPARK-12783:
--

Hello Muthu: do the import first, it seems working.
scala> import scala.collection.Map
import scala.collection.Map



scala> case class MyMap(map: Map[String, String]) 
defined class MyMap

scala> 

scala> case class TestCaseClass(a: String, b: String)  {
 |   def toMyMap: MyMap = {
 | MyMap(Map(a->b))
 |   }
 | 
 |   def toStr: String = {
 | a
 |   }
 | }
defined class TestCaseClass

scala> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", 
"data1"), TestCaseClass("2015-05-01", "data2"))).toDF()
df1: org.apache.spark.sql.DataFrame = [a: string, b: string]

scala> df1.as[TestCaseClass].map(_.toMyMap).show() 
++  
| map|
++
|Map(2015-05-01 ->...|
|Map(2015-05-01 ->...|
++


> Dataset map serialization error
> ---
>
> Key: SPARK-12783
> URL: https://issues.apache.org/jira/browse/SPARK-12783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Muthu Jayakumar
>Assignee: Wenchen Fan
>Priority: Critical
>
> When Dataset API is used to map to another case class, an error is thrown.
> {code}
> case class MyMap(map: Map[String, String])
> case class TestCaseClass(a: String, b: String){
>   def toMyMap: MyMap = {
> MyMap(Map(a->b))
>   }
>   def toStr: String = {
> a
>   }
> }
> //Main method section below
> import sqlContext.implicits._
> val df1 = sqlContext.createDataset(Seq(TestCaseClass("2015-05-01", "data1"), 
> TestCaseClass("2015-05-01", "data2"))).toDF()
> df1.as[TestCaseClass].map(_.toStr).show() //works fine
> df1.as[TestCaseClass].map(_.toMyMap).show() //fails
> {code}
> Error message:
> {quote}
> Caused by: java.io.NotSerializableException: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1
> Serialization stack:
>   - object not serializable (class: 
> scala.reflect.runtime.SynchronizedSymbols$SynchronizedSymbol$$anon$1, value: 
> package lang)
>   - field (class: scala.reflect.internal.Types$ThisType, name: sym, type: 
> class scala.reflect.internal.Symbols$Symbol)
>   - object (class scala.reflect.internal.Types$UniqueThisType, 
> java.lang.type)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: pre, type: 
> class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$ClassNoArgsTypeRef, String)
>   - field (class: scala.reflect.internal.Types$TypeRef, name: normalized, 
> type: class scala.reflect.internal.Types$Type)
>   - object (class scala.reflect.internal.Types$AliasNoArgsTypeRef, String)
>   - field (class: 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, name: keyType$1, 
> type: class scala.reflect.api.Types$TypeApi)
>   - object (class 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$6, )
>   - field (class: org.apache.spark.sql.catalyst.expressions.MapObjects, 
> name: function, type: interface scala.Function1)
>   - object (class org.apache.spark.sql.catalyst.expressions.MapObjects, 
> mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType))
>   - field (class: org.apache.spark.sql.catalyst.expressions.Invoke, name: 
> targetObject, type: class 
> org.apache.spark.sql.catalyst.expressions.Expression)
>   - object (class org.apache.spark.sql.catalyst.expressions.Invoke, 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)))
>   - writeObject data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.List$SerializationProxy, 
> scala.collection.immutable.List$SerializationProxy@4c7e3aab)
>   - writeReplace data (class: 
> scala.collection.immutable.List$SerializationProxy)
>   - object (class scala.collection.immutable.$colon$colon, 
> List(invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scala.collection.immutable.Map", name: "map"),- root class: 
> "collector.MyMap"),keyArray,ArrayType(StringType,true)),StringType),array,ObjectType(class
>  [Ljava.lang.Object;)), 
> invoke(mapobjects(,invoke(upcast('map,MapType(StringType,StringType,true),-
>  field (class: "scal

  1   2   >