date:20150330

[jira] [Updated] (SPARK-6575) Add configuration to disable schema merging while converting metastore Parquet tables

2015-03-30 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6575:

Priority: Blocker  (was: Major)

> Add configuration to disable schema merging while converting metastore 
> Parquet tables
> -
>
> Key: SPARK-6575
> URL: https://issues.apache.org/jira/browse/SPARK-6575
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
>
> Consider a metastore Parquet table that
> # doesn't have schema evolution issue
> # has lots of data files and/or partitions
> In this case, driver schema merging can be both slow and unnecessary. Would 
> be good to have a configuration to let the use disable schema merging when 
> converting such a metastore Parquet table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2015-03-30 Thread Debasish Das (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388128#comment-14388128
 ] 

Debasish Das commented on SPARK-5564:
-

[~sparks] we are trying to access the EC2 dataset but giving error:

[ec2-user@ip-172-31-38-56 ~]$ aws s3 ls 
s3://files.sparks.requester.pays/enwiki_category_text/

A client error (AccessDenied) occurred when calling the ListObjects operation: 
Access Denied

Could you please take a look if it is still available for use ?

> Support sparse LDA solutions
> 
>
> Key: SPARK-5564
> URL: https://issues.apache.org/jira/browse/SPARK-5564
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4550:
---

Assignee: Apache Spark  (was: Sandy Ryza)

> In sort-based shuffle, store map outputs in serialized form
> ---
>
> Key: SPARK-4550
> URL: https://issues.apache.org/jira/browse/SPARK-4550
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Assignee: Apache Spark
>Priority: Critical
> Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala
>
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that 
> it ends up storing many more java objects in memory.  If Spark could store 
> map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are 
> independent from each other and occupy contiguous segments of memory.  E.g. 
> when Kryo reference tracking is left on, objects may contain pointers to 
> objects farther back in the stream, which means that the sort can't relocate 
> objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4550) In sort-based shuffle, store map outputs in serialized form

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4550:
---

Assignee: Sandy Ryza  (was: Apache Spark)

> In sort-based shuffle, store map outputs in serialized form
> ---
>
> Key: SPARK-4550
> URL: https://issues.apache.org/jira/browse/SPARK-4550
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
> Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala
>
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that 
> it ends up storing many more java objects in memory.  If Spark could store 
> map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are 
> independent from each other and occupy contiguous segments of memory.  E.g. 
> when Kryo reference tracking is left on, objects may contain pointers to 
> objects farther back in the stream, which means that the sort can't relocate 
> objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6627) Clean up of shuffle code and interfaces

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388107#comment-14388107
 ] 

Apache Spark commented on SPARK-6627:
-

User 'pwendell' has created a pull request for this issue:
https://github.com/apache/spark/pull/5286

> Clean up of shuffle code and interfaces
> ---
>
> Key: SPARK-6627
> URL: https://issues.apache.org/jira/browse/SPARK-6627
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> The shuffle code in Spark is somewhat messy and could use some interface 
> clean-up, especially with some larger changes outstanding. This is a catch 
> all for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6627) Clean up of shuffle code and interfaces

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6627:
---

Assignee: Apache Spark  (was: Patrick Wendell)

> Clean up of shuffle code and interfaces
> ---
>
> Key: SPARK-6627
> URL: https://issues.apache.org/jira/browse/SPARK-6627
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Patrick Wendell
>Assignee: Apache Spark
>Priority: Critical
>
> The shuffle code in Spark is somewhat messy and could use some interface 
> clean-up, especially with some larger changes outstanding. This is a catch 
> all for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6627) Clean up of shuffle code and interfaces

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6627:
---

Assignee: Patrick Wendell  (was: Apache Spark)

> Clean up of shuffle code and interfaces
> ---
>
> Key: SPARK-6627
> URL: https://issues.apache.org/jira/browse/SPARK-6627
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Critical
>
> The shuffle code in Spark is somewhat messy and could use some interface 
> clean-up, especially with some larger changes outstanding. This is a catch 
> all for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6627) Clean up of shuffle code and interfaces

2015-03-30 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-6627:
--

 Summary: Clean up of shuffle code and interfaces
 Key: SPARK-6627
 URL: https://issues.apache.org/jira/browse/SPARK-6627
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical


The shuffle code in Spark is somewhat messy and could use some interface 
clean-up, especially with some larger changes outstanding. This is a catch all 
for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4514) SparkContext localProperties does not inherit property updates across thread reuse

2015-03-30 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388095#comment-14388095
 ] 

Josh Rosen commented on SPARK-4514:
---

I don't know that there's a good way to fix this for all arbitrary ways in 
which users might create or re-use threads.  This inheritance behavior is 
slightly more understandable in cases where users explicitly create child 
threads.  Although our documentation doesn't seem to explicitly promise that 
properties will be inherited, I think that users might have come to rely on 
this behavior so I don't think that we can remove it at this point.  We can 
certainly fix it for the AsyncRDDActions case, though, because we can manually 
thread the properties in the constructor.

This pain could have probably been avoided if the original design used 
something like Scala's {{DynamicVariable}} where you're forced to explicitly 
consider the scope / lifecycle of the thread-local property.
 
I'm going to try to fix this for the AsyncRDDActions case and will try to 
improve the documentation to warn about this pitfall for the more general cases 
involving arbitrary user code.  Let me know if you can spot another solution 
which won't break existing user code that relies on property inheritance in the 
non-thread-reuse cases.

> SparkContext localProperties does not inherit property updates across thread 
> reuse
> --
>
> Key: SPARK-4514
> URL: https://issues.apache.org/jira/browse/SPARK-4514
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
>Reporter: Erik Erlandson
>Assignee: Josh Rosen
>Priority: Critical
>
> The current job group id of a Spark context is stored in the 
> {{localProperties}} member value.   This data structure is designed to be 
> thread local, and its settings are not preserved when {{ComplexFutureAction}} 
> instantiates a new {{Future}}.  
> One consequence of this is that {{takeAsync()}} does not behave in the same 
> way as other async actions, e.g. {{countAsync()}}.  For example, this test 
> (if copied into StatusTrackerSuite.scala), will fail, because 
> {{"my-job-group2"}} is not propagated to the Future which actually 
> instantiates the job:
> {code:java}
>   test("getJobIdsForGroup() with takeAsync()") {
> sc = new SparkContext("local", "test", new SparkConf(false))
> sc.setJobGroup("my-job-group2", "description")
> sc.statusTracker.getJobIdsForGroup("my-job-group2") should be (Seq.empty)
> val firstJobFuture = sc.parallelize(1 to 1000, 1).takeAsync(1)
> val firstJobId = eventually(timeout(10 seconds)) {
>   firstJobFuture.jobIds.head
> }
> eventually(timeout(10 seconds)) {
>   sc.statusTracker.getJobIdsForGroup("my-job-group2") should be 
> (Seq(firstJobId))
> }
>   }
> {code}
> It also impacts current PR for SPARK-1021, which involves additional uses of 
> {{ComplexFutureAction}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-6625:

Target Version/s: 1.3.1

> Add common string filters to data sources
> -
>
> Key: SPARK-6625
> URL: https://issues.apache.org/jira/browse/SPARK-6625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Filters such as startsWith, endsWith, contains will be very useful for data 
> sources that provide search functionality, e.g. Succinct, Elastic Search, 
> Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6626) TwitterUtils.createStream documentation error

2015-03-30 Thread Jayson Sunshine (JIRA)

Jayson Sunshine created SPARK-6626:
--

 Summary: TwitterUtils.createStream documentation error
 Key: SPARK-6626
 URL: https://issues.apache.org/jira/browse/SPARK-6626
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Jayson Sunshine
Priority: Minor


At 
http://spark.apache.org/docs/1.3.0/streaming-programming-guide.html#input-dstreams-and-receivers,
 under 'Advanced Sources', the documentation provides the following call for 
Scala:

TwitterUtils.createStream(ssc)

However, with only one parameter to this method it appears a jssc object is 
required, not a ssc object: 
http://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/twitter/TwitterUtils.html

To make the above call work one must instead provide an option argument, for 
example:

TwitterUtils.createStream(ssc, None)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6625:
---
Assignee: Reynold Xin

> Add common string filters to data sources
> -
>
> Key: SPARK-6625
> URL: https://issues.apache.org/jira/browse/SPARK-6625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Filters such as startsWith, endsWith, contains will be very useful for data 
> sources that provide search functionality, e.g. Succinct, Elastic Search, 
> Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6625:
---
Description: Filters such as startsWith, endsWith, contains will be very 
useful for data sources that provide search functionality, e.g. Succinct, 
Elastic Search, Solr.  (was: Filters such as StartsWith, EndsWith, Contains 
will be very useful for search-like data sources such as Succinct, Elastic 
Search, Solr, etc.
)

> Add common string filters to data sources
> -
>
> Key: SPARK-6625
> URL: https://issues.apache.org/jira/browse/SPARK-6625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Filters such as startsWith, endsWith, contains will be very useful for data 
> sources that provide search functionality, e.g. Succinct, Elastic Search, 
> Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6625:
---

Assignee: (was: Apache Spark)

> Add common string filters to data sources
> -
>
> Key: SPARK-6625
> URL: https://issues.apache.org/jira/browse/SPARK-6625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Filters such as StartsWith, EndsWith, Contains will be very useful for 
> search-like data sources such as Succinct, Elastic Search, Solr, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388039#comment-14388039
 ] 

Apache Spark commented on SPARK-6625:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5285

> Add common string filters to data sources
> -
>
> Key: SPARK-6625
> URL: https://issues.apache.org/jira/browse/SPARK-6625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Filters such as StartsWith, EndsWith, Contains will be very useful for 
> search-like data sources such as Succinct, Elastic Search, Solr, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6625:
---

Assignee: Apache Spark

> Add common string filters to data sources
> -
>
> Key: SPARK-6625
> URL: https://issues.apache.org/jira/browse/SPARK-6625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Filters such as StartsWith, EndsWith, Contains will be very useful for 
> search-like data sources such as Succinct, Elastic Search, Solr, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6625:
---
Description: 
Filters such as StartsWith, EndsWith, Contains will be very useful for 
search-like data sources such as Succinct, Elastic Search, Solr, etc.


  was:
Filters such as StartsWith, EndsWith, Like (with regex) will be very useful for 
search-like data sources such as Succinct, Elastic Search, Solr, etc.



> Add common string filters to data sources
> -
>
> Key: SPARK-6625
> URL: https://issues.apache.org/jira/browse/SPARK-6625
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Filters such as StartsWith, EndsWith, Contains will be very useful for 
> search-like data sources such as Succinct, Elastic Search, Solr, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6623) Alias DataFrame.na.fill/drop in Python

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6623:
---

Assignee: (was: Apache Spark)

> Alias DataFrame.na.fill/drop in Python
> --
>
> Key: SPARK-6623
> URL: https://issues.apache.org/jira/browse/SPARK-6623
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> To be more consistent with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6623) Alias DataFrame.na.fill/drop in Python

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388026#comment-14388026
 ] 

Apache Spark commented on SPARK-6623:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5284

> Alias DataFrame.na.fill/drop in Python
> --
>
> Key: SPARK-6623
> URL: https://issues.apache.org/jira/browse/SPARK-6623
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> To be more consistent with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6623) Alias DataFrame.na.fill/drop in Python

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6623:
---

Assignee: Apache Spark

> Alias DataFrame.na.fill/drop in Python
> --
>
> Key: SPARK-6623
> URL: https://issues.apache.org/jira/browse/SPARK-6623
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> To be more consistent with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6258) Python MLlib API missing items: Clustering

2015-03-30 Thread Hrishikesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388003#comment-14388003
 ] 

Hrishikesh commented on SPARK-6258:
---

[~josephkb] Thank you for your response and valuable suggestions! Will send the 
PR asap.

> Python MLlib API missing items: Clustering
> --
>
> Key: SPARK-6258
> URL: https://issues.apache.org/jira/browse/SPARK-6258
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> This JIRA lists items missing in the Python API for this sub-package of MLlib.
> This list may be incomplete, so please check again when sending a PR to add 
> these features to the Python API.
> Also, please check for major disparities between documentation; some parts of 
> the Python API are less well-documented than their Scala counterparts.  Some 
> items may be listed in the umbrella JIRA linked to this task.
> KMeans
> * setEpsilon
> * setInitializationSteps
> KMeansModel
> * computeCost
> * k
> GaussianMixture
> * setInitialModel
> GaussianMixtureModel
> * k
> Completely missing items which should be fixed in separate JIRAs (which have 
> been created and linked to the umbrella JIRA)
> * LDA
> * PowerIterationClustering
> * StreamingKMeans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5124) Standardize internal RPC interface

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388002#comment-14388002
 ] 

Apache Spark commented on SPARK-5124:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5283

> Standardize internal RPC interface
> --
>
> Key: SPARK-5124
> URL: https://issues.apache.org/jira/browse/SPARK-5124
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Shixiong Zhu
> Fix For: 1.4.0
>
> Attachments: Pluggable RPC - draft 1.pdf, Pluggable RPC - draft 2.pdf
>
>
> In Spark we use Akka as the RPC layer. It would be great if we can 
> standardize the internal RPC interface to facilitate testing. This will also 
> provide the foundation to try other RPC implementations in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6612) Python KMeans parity

2015-03-30 Thread Hrishikesh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14388000#comment-14388000
 ] 

Hrishikesh commented on SPARK-6612:
---

Please assign this ticket to me.

> Python KMeans parity
> 
>
> Key: SPARK-6612
> URL: https://issues.apache.org/jira/browse/SPARK-6612
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
> are missing:
> KMeans
> * setEpsilon
> * setInitializationSteps
> KMeansModel
> * computeCost
> * k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3454:
---

Assignee: Imran Rashid  (was: Apache Spark)

> Expose JSON representation of data shown in WebUI
> -
>
> Key: SPARK-3454
> URL: https://issues.apache.org/jira/browse/SPARK-3454
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Assignee: Imran Rashid
> Attachments: sparkmonitoringjsondesign.pdf
>
>
> If WebUI support to JSON format extracting, it's helpful for user who want to 
> analyse stage / task / executor information.
> Fortunately, WebUI has renderJson method so we can implement the method in 
> each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3454) Expose JSON representation of data shown in WebUI

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3454:
---

Assignee: Apache Spark  (was: Imran Rashid)

> Expose JSON representation of data shown in WebUI
> -
>
> Key: SPARK-3454
> URL: https://issues.apache.org/jira/browse/SPARK-3454
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
> Attachments: sparkmonitoringjsondesign.pdf
>
>
> If WebUI support to JSON format extracting, it's helpful for user who want to 
> analyse stage / task / executor information.
> Fortunately, WebUI has renderJson method so we can implement the method in 
> each subclass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6625) Add common string filters to data sources

2015-03-30 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-6625:
--

 Summary: Add common string filters to data sources
 Key: SPARK-6625
 URL: https://issues.apache.org/jira/browse/SPARK-6625
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


Filters such as StartsWith, EndsWith, Like (with regex) will be very useful for 
search-like data sources such as Succinct, Elastic Search, Solr, etc.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6624) Convert filters into CNF for data sources

2015-03-30 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-6624:
--

 Summary: Convert filters into CNF for data sources
 Key: SPARK-6624
 URL: https://issues.apache.org/jira/browse/SPARK-6624
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


We should turn filters into conjunctive normal form (CNF) before we pass them 
to data sources. Otherwise, filters are not very useful if there is a single 
filter with a bunch of ORs.

Note that we already try to do some of these in BooleanSimplification, but I 
think we should just formalize it to use CNF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6623) Alias DataFrame.na.fill/drop in Python

2015-03-30 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-6623:
--

 Summary: Alias DataFrame.na.fill/drop in Python
 Key: SPARK-6623
 URL: https://issues.apache.org/jira/browse/SPARK-6623
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


To be more consistent with Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6622) Spark SQL cannot communicate with Hive meta store

2015-03-30 Thread Deepak Kumar V (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Kumar V updated SPARK-6622:
--
Description: 
I have multiple tables (among them is dw_bid) that are created through Apache 
Hive.  I have data in avro on HDFS that i want to join with dw_bid table, this 
join needs to be done using Spark SQL.  

Spark SQL is unable to communicate with Apache Hive Meta store and fails with 
exception

org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
connection to the given database. JDBC url = 
jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. 
Terminating connection pool (set lazyInit to true if you expect to start your 
database after your app). Original Exception: --

java.sql.SQLException: No suitable driver found for jdbc:mysql://hostname.vip. 
company.com:3306/HDB

at java.sql.DriverManager.getConnection(DriverManager.java:596)




Spark Submit Command

./bin/spark-submit -v --master yarn-cluster --driver-class-path 
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars 
/apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml
 --num-executors 1 --driver-memory 4g --driver-java-options 
"-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue 
hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp 
spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro 
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

MySQL Java Conector Versions tried
mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib 
folder)
mysql-connector-java-5.1.34.jar
mysql-connector-java-5.1.35.jar

Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x 
(http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz)

$ hive --version
Hive 0.13.0.2.1.3.6-2
Subversion 
git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
 -r 87da9430050fb9cc429d79d95626d26ea382b96c



  was:
I have multiple tables (among them is dw_bid) that are created through Apache 
Hive.  I have data in avro on HDFS that i want to join with dw_bid table, this 
join needs to be done using Spark SQL.  

Spark SQL is unable to communicate with Apache Hive Meta store and fails with 
exception

org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
connection to the given database. JDBC url = 
jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. 
Terminating connection pool (set lazyInit to true if you expect to start your 
database after your app). Original Exception: --

java.sql.SQLException: No suitable driver found for jdbc:mysql://hostname.vip. 
company.com:3306/HDB

at java.sql.DriverManager.getConnection(DriverManager.java:596)


Spark Submit Command

./bin/spark-submit -v --master yarn-cluster --driver-class-path 
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars 
/apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml
 --num-executors 1 --driver-memory 4g --driver-java-options 
"-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue 
hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp 
spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro 
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

MySQL Java Conector Versions tried
mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib 
folder)
mysql-connector-java-5.1.34.jar
mysql-connector-java-5.1.35.jar

Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x 
(http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz)

$ hive --version
Hive 0.13.0.2.1.3.6-2
Subversion 
git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop

[jira] [Updated] (SPARK-6603) SQLContext.registerFunction -> SQLContext.udf.register

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6603:
---
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-6116

> SQLContext.registerFunction -> SQLContext.udf.register
> --
>
> Key: SPARK-6603
> URL: https://issues.apache.org/jira/browse/SPARK-6603
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
> Fix For: 1.3.1, 1.4.0
>
>
> We didn't change the Python implementation to use that. Maybe the best 
> strategy is to deprecate SQLContext.registerFunction, and just add 
> SQLContext.udf.register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6622) Spark SQL cannot communicate with Hive meta store

2015-03-30 Thread Deepak Kumar V (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Deepak Kumar V updated SPARK-6622:
--
Attachment: exception.txt

Full stack trace

> Spark SQL cannot communicate with Hive meta store
> -
>
> Key: SPARK-6622
> URL: https://issues.apache.org/jira/browse/SPARK-6622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.3.0
>Reporter: Deepak Kumar V
>  Labels: Hive
> Attachments: exception.txt
>
>
> I have multiple tables (among them is dw_bid) that are created through Apache 
> Hive.  I have data in avro on HDFS that i want to join with dw_bid table, 
> this join needs to be done using Spark SQL.  
> Spark SQL is unable to communicate with Apache Hive Meta store and fails with 
> exception
> org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
> connection to the given database. JDBC url = 
> jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. 
> Terminating connection pool (set lazyInit to true if you expect to start your 
> database after your app). Original Exception: --
> java.sql.SQLException: No suitable driver found for 
> jdbc:mysql://hostname.vip. company.com:3306/HDB
>   at java.sql.DriverManager.getConnection(DriverManager.java:596)
> Spark Submit Command
> ./bin/spark-submit -v --master yarn-cluster --driver-class-path 
> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
>  --jars 
> /apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml
>  --num-executors 1 --driver-memory 4g --driver-java-options 
> "-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue 
> hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp 
> spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 
> input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro 
> subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2
> MySQL Java Conector Versions tried
> mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib 
> folder)
> mysql-connector-java-5.1.34.jar
> mysql-connector-java-5.1.35.jar
> Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x 
> (http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz)
> $ hive --version
> Hive 0.13.0.2.1.3.6-2
> Subversion 
> git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
>  -r 87da9430050fb9cc429d79d95626d26ea382b96c
> $
> Code:
> package com.ebay.ep.poc.spark.reporting.process.service
> import com.ebay.ep.poc.spark.reporting.process.util.DateUtil._
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import collection.mutable.HashMap
> import com.databricks.spark.avro._
> class HadoopSuccessEvents2Service extends ReportingService {
>   override def execute(arguments: HashMap[String, String], sc: SparkContext) {
> val detail = "reporting.detail." + arguments.get("subcommand").get
> val startDate = arguments.get("startDate").get
> val endDate = arguments.get("endDate").get
> val input = arguments.get("input").get
> val output = arguments.get("output").get
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val successDetail_S1 = sqlContext.avroFile(input)
> successDetail_S1.registerTempTable("sojsuccessevents1")
> 
> println("show tables")
> sqlContext.sql("show tables")
> println("show tables")
> sqlContext.sql("CREATE TABLE `sojsuccessevents2_spark`( `guid` string 
> COMMENT 'from deserializer', `sessionkey` bigint COMMENT 'from deserializer', 
> `sessionstartdate` string COMMENT 'from deserializer', `sojdatadate` string 
> COMMENT 'from deserializer', `seqnum` int COMMENT 'from deserializer', 
> `eventtimestamp` string COMMENT 'from deserializer', `siteid` int COMMENT 
> 'from deserializer', `successeventtype` string COMMENT 'from deserializer', 
> `sourcetype` string COMMENT 'from deserializer', `itemid` bigint COMMENT 
> 'from deserializer', `shopcartid` bigint COMMENT 'from deserializer', 
> `transactionid` bigint COMMENT 'from deserializer', `offerid` bigint COMMENT 
> 'from deserializer', `userid` bigint COMMENT 'from deserializer', 
> `priorpage1seqnum` int COMMENT 'f

[jira] [Created] (SPARK-6622) Spark SQL cannot communicate with Hive meta store

2015-03-30 Thread Deepak Kumar V (JIRA)

Deepak Kumar V created SPARK-6622:
-

 Summary: Spark SQL cannot communicate with Hive meta store
 Key: SPARK-6622
 URL: https://issues.apache.org/jira/browse/SPARK-6622
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.3.0
Reporter: Deepak Kumar V


I have multiple tables (among them is dw_bid) that are created through Apache 
Hive.  I have data in avro on HDFS that i want to join with dw_bid table, this 
join needs to be done using Spark SQL.  

Spark SQL is unable to communicate with Apache Hive Meta store and fails with 
exception

org.datanucleus.exceptions.NucleusDataStoreException: Unable to open a test 
connection to the given database. JDBC url = 
jdbc:mysql://hostname.vip.company.com:3306/HDB, username = hiveuser. 
Terminating connection pool (set lazyInit to true if you expect to start your 
database after your app). Original Exception: --

java.sql.SQLException: No suitable driver found for jdbc:mysql://hostname.vip. 
company.com:3306/HDB

at java.sql.DriverManager.getConnection(DriverManager.java:596)


Spark Submit Command

./bin/spark-submit -v --master yarn-cluster --driver-class-path 
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar
 --jars 
/apache/hadoop/lib/hadoop-lzo-0.6.0.jar,/home/dvasthimal/spark1.3/mysql-connector-java-5.1.35-bin.jar,/home/dvasthimal/spark1.3/spark-avro_2.10-1.0.0.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-api-jdo-3.2.6.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-core-3.2.10.jar,/home/dvasthimal/spark1.3/spark-1.3.0-bin-hadoop2.4/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/conf/hive-site.xml
 --num-executors 1 --driver-memory 4g --driver-java-options 
"-XX:MaxPermSize=2G" --executor-memory 2g --executor-cores 1 --queue 
hdmi-express --class com.ebay.ep.poc.spark.reporting.SparkApp 
spark_reporting-1.0-SNAPSHOT.jar startDate=2015-02-16 endDate=2015-02-16 
input=/user/dvasthimal/epdatasets/successdetail1/part-r-0.avro 
subcommand=successevents2 output=/user/dvasthimal/epdatasets/successdetail2

MySQL Java Conector Versions tried
mysql-connector-java-5.0.8-bin.jar (Picked from Apache Hive installation lib 
folder)
mysql-connector-java-5.1.34.jar
mysql-connector-java-5.1.35.jar

Spark Version: 1.3.0 - Prebuilt for Hadoop 2.4.x 
(http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz)

$ hive --version
Hive 0.13.0.2.1.3.6-2
Subversion 
git://ip-10-0-0-90.ec2.internal/grid/0/jenkins/workspace/BIGTOP-HDP_RPM_REPO-HDP-2.1.3.6-centos6/bigtop/build/hive/rpm/BUILD/hive-0.13.0.2.1.3.6
 -r 87da9430050fb9cc429d79d95626d26ea382b96c

$

Code:
package com.ebay.ep.poc.spark.reporting.process.service

import com.ebay.ep.poc.spark.reporting.process.util.DateUtil._

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

import collection.mutable.HashMap

import com.databricks.spark.avro._

class HadoopSuccessEvents2Service extends ReportingService {

  override def execute(arguments: HashMap[String, String], sc: SparkContext) {
val detail = "reporting.detail." + arguments.get("subcommand").get
val startDate = arguments.get("startDate").get
val endDate = arguments.get("endDate").get
val input = arguments.get("input").get
val output = arguments.get("output").get

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val successDetail_S1 = sqlContext.avroFile(input)
successDetail_S1.registerTempTable("sojsuccessevents1")

println("show tables")
sqlContext.sql("show tables")
println("show tables")
sqlContext.sql("CREATE TABLE `sojsuccessevents2_spark`( `guid` string 
COMMENT 'from deserializer', `sessionkey` bigint COMMENT 'from deserializer', 
`sessionstartdate` string COMMENT 'from deserializer', `sojdatadate` string 
COMMENT 'from deserializer', `seqnum` int COMMENT 'from deserializer', 
`eventtimestamp` string COMMENT 'from deserializer', `siteid` int COMMENT 'from 
deserializer', `successeventtype` string COMMENT 'from deserializer', 
`sourcetype` string COMMENT 'from deserializer', `itemid` bigint COMMENT 'from 
deserializer', `shopcartid` bigint COMMENT 'from deserializer', `transactionid` 
bigint COMMENT 'from deserializer', `offerid` bigint COMMENT 'from 
deserializer', `userid` bigint COMMENT 'from deserializer', `priorpage1seqnum` 
int COMMENT 'from deserializer', `priorpage1pageid` int COMMENT 'from 
deserializer', `exclwmsearchattemptseqnum` int COMMENT 'from deserializer', 
`exclpriorsearchpageid` int COMMENT 'from deserializer', 
`exclpriorsearchseqnum` int COMMENT 'from deserializer', 
`exclpriorsearchcategory` int COMMENT 'from deserializer', `exclpriorsearchl1` 
int COMMENT 'from deserializer', `exclpriorsearchl2` int COMMENT 'fro

[jira] [Commented] (SPARK-6562) DataFrame.na.replace value support

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387933#comment-14387933
 ] 

Apache Spark commented on SPARK-6562:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5282

> DataFrame.na.replace value support
> --
>
> Key: SPARK-6562
> URL: https://issues.apache.org/jira/browse/SPARK-6562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Support replacing a set of values with another set of values (i.e. map join), 
> similar to Pandas' replace.
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6562) DataFrame.na.replace value support

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6562:
---

Assignee: (was: Apache Spark)

> DataFrame.na.replace value support
> --
>
> Key: SPARK-6562
> URL: https://issues.apache.org/jira/browse/SPARK-6562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Support replacing a set of values with another set of values (i.e. map join), 
> similar to Pandas' replace.
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6562) DataFrame.na.replace value support

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6562:
---

Assignee: Apache Spark

> DataFrame.na.replace value support
> --
>
> Key: SPARK-6562
> URL: https://issues.apache.org/jira/browse/SPARK-6562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>
> Support replacing a set of values with another set of values (i.e. map join), 
> similar to Pandas' replace.
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6562) DataFrame.na.replace value support

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-6562:
---
Summary: DataFrame.na.replace value support  (was: DataFrame.replace value 
support)

> DataFrame.na.replace value support
> --
>
> Key: SPARK-6562
> URL: https://issues.apache.org/jira/browse/SPARK-6562
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> Support replacing a set of values with another set of values (i.e. map join), 
> similar to Pandas' replace.
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.replace.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6618:
---

Assignee: Apache Spark  (was: Yin Huai)

> HiveMetastoreCatalog.lookupRelation should use fine-grained lock
> 
>
> Key: SPARK-6618
> URL: https://issues.apache.org/jira/browse/SPARK-6618
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Blocker
>
> Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
>  and the scope of lock will cover resolving data source tables 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
>  So, lookupRelation can be extremely expensive when we are doing expensive 
> operations like parquet schema discovery. So, we should use fine-grained lock 
> for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6618:
---

Assignee: Yin Huai  (was: Apache Spark)

> HiveMetastoreCatalog.lookupRelation should use fine-grained lock
> 
>
> Key: SPARK-6618
> URL: https://issues.apache.org/jira/browse/SPARK-6618
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
>  and the scope of lock will cover resolving data source tables 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
>  So, lookupRelation can be extremely expensive when we are doing expensive 
> operations like parquet schema discovery. So, we should use fine-grained lock 
> for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387924#comment-14387924
 ] 

Apache Spark commented on SPARK-6618:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5281

> HiveMetastoreCatalog.lookupRelation should use fine-grained lock
> 
>
> Key: SPARK-6618
> URL: https://issues.apache.org/jira/browse/SPARK-6618
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
>  and the scope of lock will cover resolving data source tables 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
>  So, lookupRelation can be extremely expensive when we are doing expensive 
> operations like parquet schema discovery. So, we should use fine-grained lock 
> for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6555) Override equals and hashCode in MetastoreRelation

2015-03-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian reassigned SPARK-6555:
-

Assignee: Cheng Lian

> Override equals and hashCode in MetastoreRelation
> -
>
> Key: SPARK-6555
> URL: https://issues.apache.org/jira/browse/SPARK-6555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.1, 1.2.1, 1.3.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>
> This is a follow-up of SPARK-6450.
> As explained in [this 
> comment|https://issues.apache.org/jira/browse/SPARK-6450?focusedCommentId=14379499&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14379499]
>  of SPARK-6450, we resorted to a more surgical fix due to the upcoming 1.3.1 
> release. But overriding {{equals}} and {{hashCode}} is the proper fix to that 
> problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6573) expect pandas null values as numpy.nan (not only as None)

2015-03-30 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387921#comment-14387921
 ] 

Reynold Xin commented on SPARK-6573:


Are numpy.nan turned into Double.NaN in the JVM? If yes, maybe we should 
consider all NaN numbers as null in the JVM.

> expect pandas null values as numpy.nan (not only as None)
> -
>
> Key: SPARK-6573
> URL: https://issues.apache.org/jira/browse/SPARK-6573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Fabian Boehnlein
>
> In pandas it is common to use numpy.nan as the null value, for missing data 
> or whatever.
> http://pandas.pydata.org/pandas-docs/dev/gotchas.html#nan-integer-na-values-and-na-type-promotions
> http://stackoverflow.com/questions/17534106/what-is-the-difference-between-nan-and-none
> http://pandas.pydata.org/pandas-docs/dev/missing_data.html#filling-missing-values-fillna
> createDataFrame however only works with None as null values, parsing them as 
> None in the RDD.
> I suggest to add support for np.nan values in pandas DataFrames.
> current stracktrace when calling a DataFrame with object type columns with 
> np.nan values (which are floats)
> {code}
> TypeError Traceback (most recent call last)
>  in ()
> > 1 sqldf = sqlCtx.createDataFrame(df_, schema=schema)
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> createDataFrame(self, data, schema, samplingRatio)
> 339 schema = self._inferSchema(data.map(lambda r: 
> row_cls(*r)), samplingRatio)
> 340 
> --> 341 return self.applySchema(data, schema)
> 342 
> 343 def registerDataFrameAsTable(self, rdd, tableName):
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/context.py in 
> applySchema(self, rdd, schema)
> 246 
> 247 for row in rows:
> --> 248 _verify_type(row, schema)
> 249 
> 250 # convert python objects to sql data
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1064  "length of fields (%d)" % (len(obj), 
> len(dataType.fields)))
>1065 for v, f in zip(obj, dataType.fields):
> -> 1066 _verify_type(v, f.dataType)
>1067 
>1068 _cached_cls = weakref.WeakValueDictionary()
> /opt/spark/spark-1.3.0-bin-hadoop2.4/python/pyspark/sql/types.py in 
> _verify_type(obj, dataType)
>1048 if type(obj) not in _acceptable_types[_type]:
>1049 raise TypeError("%s can not accept object in type %s"
> -> 1050 % (dataType, type(obj)))
>1051 
>1052 if isinstance(dataType, ArrayType):
> TypeError: StringType can not accept object in type {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6119) DataFrame.dropna support

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6119.

   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1
 Assignee: Reynold Xin

> DataFrame.dropna support
> 
>
> Key: SPARK-6119
> URL: https://issues.apache.org/jira/browse/SPARK-6119
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: DataFrame
> Fix For: 1.3.1, 1.4.0
>
>
> Support dropping rows with null values (dropna). Similar to Pandas' dropna
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.dropna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6563) DataFrame.fillna

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6563.

   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1
 Assignee: Reynold Xin

> DataFrame.fillna
> 
>
> Key: SPARK-6563
> URL: https://issues.apache.org/jira/browse/SPARK-6563
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.3.1, 1.4.0
>
>
> Support replacing all null value for a column (or all columns) with a fixed 
> value.
> Similar to Pandas' fillna.
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.fillna.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6621:
---

Assignee: Apache Spark

> Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should 
> call onStop
> --
>
> Key: SPARK-6621
> URL: https://issues.apache.org/jira/browse/SPARK-6621
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387911#comment-14387911
 ] 

Apache Spark commented on SPARK-6621:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/5280

> Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should 
> call onStop
> --
>
> Key: SPARK-6621
> URL: https://issues.apache.org/jira/browse/SPARK-6621
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6621:
---

Assignee: (was: Apache Spark)

> Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should 
> call onStop
> --
>
> Key: SPARK-6621
> URL: https://issues.apache.org/jira/browse/SPARK-6621
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5456) Decimal Type comparison issue

2015-03-30 Thread Kuldeep (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387904#comment-14387904
 ] 

Kuldeep commented on SPARK-5456:


[~karthikg01]

1) Switch to hive context, I am not trying to deride the plain sql context, but 
the hive context is just better tested and has a well defined syntax borrowed 
from hive.
2) Even in hive context i have faced problems with bigdecimals, so like your 
workaround i also convert bigdecimals to a double (not int). And for all 
practical purposes it is more than enough. I have not seem many datasources 
with those types. rdbms maps `NUMERIC` type to bigdecimal in jdbc but you can 
always workaround this by have a simple map transformation before you register 
it to sql context.

2 cents.

> Decimal Type comparison issue
> -
>
> Key: SPARK-5456
> URL: https://issues.apache.org/jira/browse/SPARK-5456
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Kuldeep
>
> Not quite able to figure this out but here is a junit test to reproduce this, 
> in JavaAPISuite.java
> {code:title=DecimalBug.java}
>   @Test
>   public void decimalQueryTest() {
> List decimalTable = new ArrayList();
> decimalTable.add(RowFactory.create(new BigDecimal("1"), new 
> BigDecimal("2")));
> decimalTable.add(RowFactory.create(new BigDecimal("3"), new 
> BigDecimal("4")));
> JavaRDD rows = sc.parallelize(decimalTable);
> List fields = new ArrayList(7);
> fields.add(DataTypes.createStructField("a", 
> DataTypes.createDecimalType(), true));
> fields.add(DataTypes.createStructField("b", 
> DataTypes.createDecimalType(), true));
> sqlContext.applySchema(rows.rdd(), 
> DataTypes.createStructType(fields)).registerTempTable("foo");
> Assert.assertEquals(sqlContext.sql("select * from foo where a > 
> 0").collectAsList(), decimalTable);
>   }
> {code}
> Fails with
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.spark.sql.types.Decimal



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6621) Calling EventLoop.stop in EventLoop.onReceive and EventLoop.onError should call onStop

2015-03-30 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-6621:
---

 Summary: Calling EventLoop.stop in EventLoop.onReceive and 
EventLoop.onError should call onStop
 Key: SPARK-6621
 URL: https://issues.apache.org/jira/browse/SPARK-6621
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387898#comment-14387898
 ] 

Apache Spark commented on SPARK-6620:
-

User 'vlyubin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5279

> Speed up toDF() and rdd() functions by constructing converters in 
> ScalaReflection
> -
>
> Key: SPARK-6620
> URL: https://issues.apache.org/jira/browse/SPARK-6620
> Project: Spark
>  Issue Type: Improvement
>Reporter: Volodymyr Lyubinets
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6620:
---

Assignee: (was: Apache Spark)

> Speed up toDF() and rdd() functions by constructing converters in 
> ScalaReflection
> -
>
> Key: SPARK-6620
> URL: https://issues.apache.org/jira/browse/SPARK-6620
> Project: Spark
>  Issue Type: Improvement
>Reporter: Volodymyr Lyubinets
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6620?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6620:
---

Assignee: Apache Spark

> Speed up toDF() and rdd() functions by constructing converters in 
> ScalaReflection
> -
>
> Key: SPARK-6620
> URL: https://issues.apache.org/jira/browse/SPARK-6620
> Project: Spark
>  Issue Type: Improvement
>Reporter: Volodymyr Lyubinets
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6620) Speed up toDF() and rdd() functions by constructing converters in ScalaReflection

2015-03-30 Thread Volodymyr Lyubinets (JIRA)

Volodymyr Lyubinets created SPARK-6620:
--

 Summary: Speed up toDF() and rdd() functions by constructing 
converters in ScalaReflection
 Key: SPARK-6620
 URL: https://issues.apache.org/jira/browse/SPARK-6620
 Project: Spark
  Issue Type: Improvement
Reporter: Volodymyr Lyubinets






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6606) Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd object.

2015-03-30 Thread SuYan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SuYan closed SPARK-6606.

Resolution: Duplicate

Duplicate with SPARK-5360, see https://github.com/apache/spark/pull/4145

> Accumulator deserialized twice because the NarrowCoGroupSplitDep contains rdd 
> object.
> -
>
> Key: SPARK-6606
> URL: https://issues.apache.org/jira/browse/SPARK-6606
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0, 1.3.0
>Reporter: SuYan
>
> 1. Use code like belows, will found accumulator deserialized twice.
> first: 
> {code}
> task = ser.deserialize[Task[Any]](taskBytes, 
> Thread.currentThread.getContextClassLoader)
> {code}
> second:
> {code}
> val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
>   ByteBuffer.wrap(taskBinary.value), 
> Thread.currentThread.getContextClassLoader)
> {code}
> which the first deserialized is not what expected.
> because ResultTask or ShuffleMapTask will have a partition object.
> in class 
> {code}
> CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: 
> Partitioner)
> {code}, the CogroupPartition may contains a  CoGroupDep:
> {code}
> NarrowCoGroupSplitDep(
> rdd: RDD[_],
> splitIndex: Int,
> var split: Partition
>   ) extends CoGroupSplitDep {
> {code}
> in that *NarrowCoGroupSplitDep*, it will bring into rdd object, which result 
> into the first deserialized.
> example:
> {code}
>val acc1 = sc.accumulator(0, "test1")
> val acc2 = sc.accumulator(0, "test2")
> val rdd1 = sc.parallelize((1 to 10).toSeq, 3)
> val rdd2 = sc.parallelize((1 to 10).toSeq, 3)
> val combine1 = rdd1.map { case a => (a, 1)}.combineByKey(a => {
>   acc1 += 1
>   a
> }, (a: Int, b: Int) => {
>   a + b
> },
>   (a: Int, b: Int) => {
> a + b
>   }, new HashPartitioner(3), mapSideCombine = false)
> val combine2 = rdd2.map { case a => (a, 1)}.combineByKey(
>   a => {
> acc2 += 1
> a
>   },
>   (a: Int, b: Int) => {
> a + b
>   },
>   (a: Int, b: Int) => {
> a + b
>   }, new HashPartitioner(3), mapSideCombine = false)
> combine1.cogroup(combine2, new HashPartitioner(3)).count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387847#comment-14387847
 ] 

Apache Spark commented on SPARK-5371:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/5278

> Failure to analyze query with UNION ALL and double aggregation
> --
>
> Key: SPARK-5371
> URL: https://issues.apache.org/jira/browse/SPARK-5371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: David Ross
>Assignee: Michael Armbrust
>Priority: Critical
>
> This SQL session:
> {code}
> DROP TABLE
> test1;
> DROP TABLE
> test2;
> CREATE TABLE
> test1
> (
> c11 INT,
> c12 INT,
> c13 INT,
> c14 INT
> );
> CREATE TABLE
> test2
> (
> c21 INT,
> c22 INT,
> c23 INT,
> c24 INT
> );
> SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3; 
> {code}
> Produces this error:
> {code}
> 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3'
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
> 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> java.util.NoSuchElementException: key not found: c_2#23488
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$ano

[jira] [Assigned] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5371:
---

Assignee: Michael Armbrust  (was: Apache Spark)

> Failure to analyze query with UNION ALL and double aggregation
> --
>
> Key: SPARK-5371
> URL: https://issues.apache.org/jira/browse/SPARK-5371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: David Ross
>Assignee: Michael Armbrust
>Priority: Critical
>
> This SQL session:
> {code}
> DROP TABLE
> test1;
> DROP TABLE
> test2;
> CREATE TABLE
> test1
> (
> c11 INT,
> c12 INT,
> c13 INT,
> c14 INT
> );
> CREATE TABLE
> test2
> (
> c21 INT,
> c22 INT,
> c23 INT,
> c24 INT
> );
> SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3; 
> {code}
> Produces this error:
> {code}
> 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3'
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
> 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> java.util.NoSuchElementException: key not found: c_2#23488
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:3

[jira] [Assigned] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5371:
---

Assignee: Apache Spark  (was: Michael Armbrust)

> Failure to analyze query with UNION ALL and double aggregation
> --
>
> Key: SPARK-5371
> URL: https://issues.apache.org/jira/browse/SPARK-5371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: David Ross
>Assignee: Apache Spark
>Priority: Critical
>
> This SQL session:
> {code}
> DROP TABLE
> test1;
> DROP TABLE
> test2;
> CREATE TABLE
> test1
> (
> c11 INT,
> c12 INT,
> c13 INT,
> c14 INT
> );
> CREATE TABLE
> test2
> (
> c21 INT,
> c22 INT,
> c23 INT,
> c24 INT
> );
> SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3; 
> {code}
> Produces this error:
> {code}
> 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3'
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
> 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> java.util.NoSuchElementException: key not found: c_2#23488
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

[jira] [Updated] (SPARK-5371) Failure to analyze query with UNION ALL and double aggregation

2015-03-30 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5371:

Summary: Failure to analyze query with UNION ALL and double aggregation  
(was: SparkSQL Fails to analyze Query with UNION ALL in subquery)

> Failure to analyze query with UNION ALL and double aggregation
> --
>
> Key: SPARK-5371
> URL: https://issues.apache.org/jira/browse/SPARK-5371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: David Ross
>Assignee: Michael Armbrust
>Priority: Critical
>
> This SQL session:
> {code}
> DROP TABLE
> test1;
> DROP TABLE
> test2;
> CREATE TABLE
> test1
> (
> c11 INT,
> c12 INT,
> c13 INT,
> c14 INT
> );
> CREATE TABLE
> test2
> (
> c21 INT,
> c22 INT,
> c23 INT,
> c24 INT
> );
> SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3; 
> {code}
> Produces this error:
> {code}
> 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3'
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
> 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> java.util.NoSuchElementException: key not found: c_2#23488
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.

[jira] [Assigned] (SPARK-5371) SparkSQL Fails to parse Query with UNION ALL in subquery

2015-03-30 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-5371:
---

Assignee: Michael Armbrust

> SparkSQL Fails to parse Query with UNION ALL in subquery
> 
>
> Key: SPARK-5371
> URL: https://issues.apache.org/jira/browse/SPARK-5371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: David Ross
>Assignee: Michael Armbrust
>
> This SQL session:
> {code}
> DROP TABLE
> test1;
> DROP TABLE
> test2;
> CREATE TABLE
> test1
> (
> c11 INT,
> c12 INT,
> c13 INT,
> c14 INT
> );
> CREATE TABLE
> test2
> (
> c21 INT,
> c22 INT,
> c23 INT,
> c24 INT
> );
> SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3; 
> {code}
> Produces this error:
> {code}
> 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3'
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
> 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> java.util.NoSuchElementException: key not found: c_2#23488
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:162)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(It

[jira] [Updated] (SPARK-5371) SparkSQL Fails to parse Query with UNION ALL in subquery

2015-03-30 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5371:

 Priority: Critical  (was: Major)
 Target Version/s: 1.3.1
Affects Version/s: 1.2.0
   1.3.0

> SparkSQL Fails to parse Query with UNION ALL in subquery
> 
>
> Key: SPARK-5371
> URL: https://issues.apache.org/jira/browse/SPARK-5371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: David Ross
>Assignee: Michael Armbrust
>Priority: Critical
>
> This SQL session:
> {code}
> DROP TABLE
> test1;
> DROP TABLE
> test2;
> CREATE TABLE
> test1
> (
> c11 INT,
> c12 INT,
> c13 INT,
> c14 INT
> );
> CREATE TABLE
> test2
> (
> c21 INT,
> c22 INT,
> c23 INT,
> c24 INT
> );
> SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3; 
> {code}
> Produces this error:
> {code}
> 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3'
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
> 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> java.util.NoSuchElementException: key not found: c_2#23488
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:16

[jira] [Updated] (SPARK-5371) SparkSQL Fails to analyze Query with UNION ALL in subquery

2015-03-30 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-5371:

Summary: SparkSQL Fails to analyze Query with UNION ALL in subquery  (was: 
SparkSQL Fails to parse Query with UNION ALL in subquery)

> SparkSQL Fails to analyze Query with UNION ALL in subquery
> --
>
> Key: SPARK-5371
> URL: https://issues.apache.org/jira/browse/SPARK-5371
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: David Ross
>Assignee: Michael Armbrust
>Priority: Critical
>
> This SQL session:
> {code}
> DROP TABLE
> test1;
> DROP TABLE
> test2;
> CREATE TABLE
> test1
> (
> c11 INT,
> c12 INT,
> c13 INT,
> c14 INT
> );
> CREATE TABLE
> test2
> (
> c21 INT,
> c22 INT,
> c23 INT,
> c24 INT
> );
> SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3; 
> {code}
> Produces this error:
> {code}
> 15/01/23 00:25:21 INFO thriftserver.SparkExecuteStatementOperation: Running 
> query 'SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3'
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parsing command: SELECT
> MIN(t3.c_1),
> MIN(t3.c_2),
> MIN(t3.c_3),
> MIN(t3.c_4)
> FROM
> (
> SELECT
> SUM(t1.c11) c_1,
> NULLc_2,
> NULLc_3,
> NULLc_4
> FROM
> test1 t1
> UNION ALL
> SELECT
> NULLc_1,
> SUM(t2.c22) c_2,
> SUM(t2.c23) c_3,
> SUM(t2.c24) c_4
> FROM
> test2 t2 ) t3
> 15/01/23 00:25:21 INFO parse.ParseDriver: Parse Completed
> 15/01/23 00:25:21 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query:
> java.util.NoSuchElementException: key not found: c_2#23488
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.default(AttributeMap.scala:29)
>   at scala.collection.MapLike$class.apply(MapLike.scala:141)
>   at 
> org.apache.spark.sql.catalyst.expressions.AttributeMap.apply(AttributeMap.scala:29)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:77)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$1.applyOrElse(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:135)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$.pushToRight(Optimizer.scala:76)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1$$anonfun$applyOrElse$6.apply(Optimizer.scala:98)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:98)
>   at 
> org.apache.spark.sql.catalyst.optimizer.UnionPushdown$$anonfun$apply$1.applyOrElse(Optimizer.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:144)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode

[jira] [Resolved] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread SaintBacchus (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus resolved SPARK-6605.
-
Resolution: Won't Fix

{{reduceByKeyAndWindow }} has two implementations and leads to two different 
result when coming an empty window.
But we consider it as a difference not a problem. If user wants to remove the 
empty keys using {{ReducedWindowedDStream}}, he can have a {{filter}} function 
to remove it.

> Same transformation in DStream leads to different result
> 
>
> Key: SPARK-6605
> URL: https://issues.apache.org/jira/browse/SPARK-6605
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: SaintBacchus
> Fix For: 1.4.0
>
>
> The transformation *reduceByKeyAndWindow* has two implementations: one use 
> the *WindowDstream* and the other use *ReducedWindowedDStream*.
> But the result always is the same, except when an empty windows occurs.
> As a wordcount example, if a period of time (larger than window time) has no 
> data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
> second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6605) Same transformation in DStream leads to different result

2015-03-30 Thread SaintBacchus (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387811#comment-14387811
 ] 

SaintBacchus edited comment on SPARK-6605 at 3/31/15 1:54 AM:
--

{{reduceByKeyAndWindow}} has two implementations and leads to two different 
result when coming an empty window.
But we consider it as a difference not a problem. If user wants to remove the 
empty keys using {{ReducedWindowedDStream}}, he can have a {{filter}} function 
to remove it.


was (Author: carlmartin):
{{reduceByKeyAndWindow }} has two implementations and leads to two different 
result when coming an empty window.
But we consider it as a difference not a problem. If user wants to remove the 
empty keys using {{ReducedWindowedDStream}}, he can have a {{filter}} function 
to remove it.

> Same transformation in DStream leads to different result
> 
>
> Key: SPARK-6605
> URL: https://issues.apache.org/jira/browse/SPARK-6605
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: SaintBacchus
> Fix For: 1.4.0
>
>
> The transformation *reduceByKeyAndWindow* has two implementations: one use 
> the *WindowDstream* and the other use *ReducedWindowedDStream*.
> But the result always is the same, except when an empty windows occurs.
> As a wordcount example, if a period of time (larger than window time) has no 
> data coming, the first *reduceByKeyAndWindow*  has no elem inside but the 
> second has many elem with the zero value inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6619) Improve Jar caching on executors

2015-03-30 Thread Mingyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387783#comment-14387783
 ] 

Mingyu Kim commented on SPARK-6619:
---

[~li-zhihui], [~joshrosen] since you worked on SPARK-2713. I'll prepare a PR in 
the next couple of days, but wanted to your thoughts in the meantime.

> Improve Jar caching on executors
> 
>
> Key: SPARK-6619
> URL: https://issues.apache.org/jira/browse/SPARK-6619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mingyu Kim
>
> Taking SPARK-2713 one step further so that
> - The cached jars can be used by multiple applications. In order to do that, 
> I'm planning to use MD5 as the cache key as opposed to url hash and timestamp.
> - The cached jars are hard-linked to the work directory as opposed to being 
> copied.
> Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 
> 158 jars with the total size of 56MB, and this takes ~10s to ship to the 
> executor at the start-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6619) Improve Jar caching on executors

2015-03-30 Thread Mingyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingyu Kim updated SPARK-6619:
--
Description: 
Taking SPARK-2713 one step further so that
- The cached jars can be used by multiple applications. In order to do that, 
I'm planning to use MD5 as the cache key as opposed to url hash and timestamp.
- The cached jars are hard-linked to the work directory as opposed to being 
copied.

Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 
158 jars with the total size of 56MB, and this takes ~10s to ship to the 
executor at the start-up.

  was:
Taking SPARK-2713 one step further so that the cached jars can be used by 
multiple applications. In order to do that, I'm planning to use MD5 as the 
cache key as opposed to url hash and timestamp.

Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 
158 jars with the total size of 56MB, and this takes 5~10s to 

Summary: Improve Jar caching on executors  (was: Jar cache on Executors 
should use file content hash as the key)

> Improve Jar caching on executors
> 
>
> Key: SPARK-6619
> URL: https://issues.apache.org/jira/browse/SPARK-6619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mingyu Kim
>
> Taking SPARK-2713 one step further so that
> - The cached jars can be used by multiple applications. In order to do that, 
> I'm planning to use MD5 as the cache key as opposed to url hash and timestamp.
> - The cached jars are hard-linked to the work directory as opposed to being 
> copied.
> Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 
> 158 jars with the total size of 56MB, and this takes ~10s to ship to the 
> executor at the start-up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6619) Jar cache on Executors should use file content hash as the key

2015-03-30 Thread Mingyu Kim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mingyu Kim updated SPARK-6619:
--
Description: 
Taking SPARK-2713 one step further so that the cached jars can be used by 
multiple applications. In order to do that, I'm planning to use MD5 as the 
cache key as opposed to url hash and timestamp.

Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 
158 jars with the total size of 56MB, and this takes 5~10s to 

  was:Taking SPARK-2713 one step further so that the cached jars can be used by 
multiple applications. In order to do that, I'm planning to use MD5 as the 
cache key as opposed to url hash and timestamp.


> Jar cache on Executors should use file content hash as the key
> --
>
> Key: SPARK-6619
> URL: https://issues.apache.org/jira/browse/SPARK-6619
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Mingyu Kim
>
> Taking SPARK-2713 one step further so that the cached jars can be used by 
> multiple applications. In order to do that, I'm planning to use MD5 as the 
> cache key as opposed to url hash and timestamp.
> Re: perf. Computing MD5 using "openssl" on my local Macbook Pro took 1.2s for 
> 158 jars with the total size of 56MB, and this takes 5~10s to 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6619) Jar cache on Executors should use file content hash as the key

2015-03-30 Thread Mingyu Kim (JIRA)

Mingyu Kim created SPARK-6619:
-

 Summary: Jar cache on Executors should use file content hash as 
the key
 Key: SPARK-6619
 URL: https://issues.apache.org/jira/browse/SPARK-6619
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Mingyu Kim


Taking SPARK-2713 one step further so that the cached jars can be used by 
multiple applications. In order to do that, I'm planning to use MD5 as the 
cache key as opposed to url hash and timestamp.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-30 Thread Littlestar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387748#comment-14387748
 ] 

Littlestar edited comment on SPARK-6239 at 3/31/15 1:09 AM:


>>>I would imagine a relative value is more usually useful. 
when recnum=12345678, minsupport=0.003, recnum*minsupport near to integer.

Some result with little difference is lost because of double precision.



was (Author: cnstar9988):
>>If I want to set minCount=2, I must use.setMinSupport(1.99/(rdd.count())), 
>>because of double's precision.


How to reopen this PR and mark relation to  pull/5246, thanks.


> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6239) Spark MLlib fpm#FPGrowth minSupport should use long instead

2015-03-30 Thread Littlestar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387748#comment-14387748
 ] 

Littlestar commented on SPARK-6239:
---

>>If I want to set minCount=2, I must use.setMinSupport(1.99/(rdd.count())), 
>>because of double's precision.


How to reopen this PR and mark relation to  pull/5246, thanks.


> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ---
>
> Key: SPARK-6239
> URL: https://issues.apache.org/jira/browse/SPARK-6239
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> Spark MLlib fpm#FPGrowth minSupport should use long instead
> ==
> val minCount = math.ceil(minSupport * count).toLong
> because:
> 1. [count]numbers of datasets is not kown before read.
> 2. [minSupport ]double precision.
> from mahout#FPGrowthDriver.java
> addOption("minSupport", "s", "(Optional) The minimum number of times a 
> co-occurrence must be present."
>   + " Default Value: 3", "3");
> I just want to set minCount=2 for test.
> Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-03-30 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387738#comment-14387738
 ] 

Yin Huai commented on SPARK-6618:
-

cc [~marmbrus] and [~lian cheng].

I am going to address this today.

> HiveMetastoreCatalog.lookupRelation should use fine-grained lock
> 
>
> Key: SPARK-6618
> URL: https://issues.apache.org/jira/browse/SPARK-6618
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
>  and the scope of lock will cover resolving data source tables 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
>  So, lookupRelation can be extremely expensive when we are doing expensive 
> operations like parquet schema discovery. So, we should use fine-grained lock 
> for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-03-30 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6618:

Target Version/s: 1.3.1  (was: 1.3.0)

> HiveMetastoreCatalog.lookupRelation should use fine-grained lock
> 
>
> Key: SPARK-6618
> URL: https://issues.apache.org/jira/browse/SPARK-6618
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Blocker
>
> Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
>  and the scope of lock will cover resolving data source tables 
> (https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
>  So, lookupRelation can be extremely expensive when we are doing expensive 
> operations like parquet schema discovery. So, we should use fine-grained lock 
> for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6618) HiveMetastoreCatalog.lookupRelation should use fine-grained lock

2015-03-30 Thread Yin Huai (JIRA)

Yin Huai created SPARK-6618:
---

 Summary: HiveMetastoreCatalog.lookupRelation should use 
fine-grained lock
 Key: SPARK-6618
 URL: https://issues.apache.org/jira/browse/SPARK-6618
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker


Right now the entire method of HiveMetastoreCatalog.lookupRelation has a lock 
(https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L173)
 and the scope of lock will cover resolving data source tables 
(https://github.com/apache/spark/blob/branch-1.3/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala#L93).
 So, lookupRelation can be extremely expensive when we are doing expensive 
operations like parquet schema discovery. So, we should use fine-grained lock 
for lookupRelation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6617) Word2Vec is nondeterministic

2015-03-30 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6617:
-
Summary: Word2Vec is nondeterministic  (was: Word2Vec is not deterministic)

> Word2Vec is nondeterministic
> 
>
> Key: SPARK-6617
> URL: https://issues.apache.org/jira/browse/SPARK-6617
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> Word2Vec uses repartition: 
> https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L291,
>  which doesn't provide deterministic ordering. This makes QA a little harder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6617) Word2Vec is not deterministic

2015-03-30 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-6617:


 Summary: Word2Vec is not deterministic
 Key: SPARK-6617
 URL: https://issues.apache.org/jira/browse/SPARK-6617
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Xiangrui Meng
Priority: Minor


Word2Vec uses repartition: 
https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L291,
 which doesn't provide deterministic ordering. This makes QA a little harder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6369) InsertIntoHiveTable and Parquet Relation should use logic from SparkHadoopWriter

2015-03-30 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-6369.
---
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

Issue resolved by pull request 5139
[https://github.com/apache/spark/pull/5139]

> InsertIntoHiveTable and Parquet Relation should use logic from 
> SparkHadoopWriter
> 
>
> Key: SPARK-6369
> URL: https://issues.apache.org/jira/browse/SPARK-6369
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 1.3.1, 1.4.0
>
>
> Right now it is possible that we will corrupt the output if there is a race 
> between competing speculative tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.

2015-03-30 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-6616:

Description: 
There are numerous instances throughout the code base of the following:

{code}
if (!stopped) {
stopped = true
...
}
{code}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {code}SparkContext.stop() {code}.

A cursory examination reveals this in {code}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop() {code}.



  was:
There are numerous instances throughout the code base of the following:

{code}
if (!stopped) {
stopped = true
...
}
{code}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop() {{code}}.

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop() {{code}}.




> IsStopped set to true in before stop() is complete.
> ---
>
> Key: SPARK-6616
> URL: https://issues.apache.org/jira/browse/SPARK-6616
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ilya Ganelin
>
> There are numerous instances throughout the code base of the following:
> {code}
> if (!stopped) {
> stopped = true
> ...
> }
> {code}
> In general, this is bad practice since it can cause an incomplete cleanup if 
> there is an error during shutdown and not all code executes. Incomplete 
> cleanup is harder to track down than a double cleanup that triggers some 
> error. I propose fixing this throughout the code, starting with the cleanup 
> sequence with {code}SparkContext.stop() {code}.
> A cursory examination reveals this in {code}SparkContext.stop(), 
> SparkEnv.stop(), and ContextCleaner.stop() {code}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.

2015-03-30 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-6616:

Description: 
There are numerous instances throughout the code base of the following:

{code}
if (!stopped) {
stopped = true
...
}
{code}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {code}SparkContext.stop() {code}

A cursory examination reveals this in {code}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop() {code}



  was:
There are numerous instances throughout the code base of the following:

{code}
if (!stopped) {
stopped = true
...
}
{code}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {code}SparkContext.stop() {code}.

A cursory examination reveals this in {code}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop() {code}.




> IsStopped set to true in before stop() is complete.
> ---
>
> Key: SPARK-6616
> URL: https://issues.apache.org/jira/browse/SPARK-6616
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ilya Ganelin
>
> There are numerous instances throughout the code base of the following:
> {code}
> if (!stopped) {
> stopped = true
> ...
> }
> {code}
> In general, this is bad practice since it can cause an incomplete cleanup if 
> there is an error during shutdown and not all code executes. Incomplete 
> cleanup is harder to track down than a double cleanup that triggers some 
> error. I propose fixing this throughout the code, starting with the cleanup 
> sequence with {code}SparkContext.stop() {code}
> A cursory examination reveals this in {code}SparkContext.stop(), 
> SparkEnv.stop(), and ContextCleaner.stop() {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.

2015-03-30 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-6616:

Description: 
There are numerous instances throughout the code base of the following:

{code}
if (!stopped) {
stopped = true
...
}
{code}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop() {{code}}.

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop() {{code}}.



  was:
There are numerous instances throughout the code base of the following:

{{code}}
if (!stopped) {
stopped = true
...
}
{{code}}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop() {{code}}.

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop() {{code}}.




> IsStopped set to true in before stop() is complete.
> ---
>
> Key: SPARK-6616
> URL: https://issues.apache.org/jira/browse/SPARK-6616
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ilya Ganelin
>
> There are numerous instances throughout the code base of the following:
> {code}
> if (!stopped) {
> stopped = true
> ...
> }
> {code}
> In general, this is bad practice since it can cause an incomplete cleanup if 
> there is an error during shutdown and not all code executes. Incomplete 
> cleanup is harder to track down than a double cleanup that triggers some 
> error. I propose fixing this throughout the code, starting with the cleanup 
> sequence with {{code}}SparkContext.stop() {{code}}.
> A cursory examination reveals this in {{code}}SparkContext.stop(), 
> SparkEnv.stop(), and ContextCleaner.stop() {{code}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.

2015-03-30 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-6616:

Description: 
There are numerous instances throughout the code base of the following:

{{code}}
if (!stopped) {
stopped = true
...
}
{{code}}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop() {{code}}.

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop() {{code}}.



  was:
There are numerous instances throughout the code base of the following:

{{code}}
if (!stopped) {
stopped = true
...
}
{{code}}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop().{{code}}

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop().{{code}}




> IsStopped set to true in before stop() is complete.
> ---
>
> Key: SPARK-6616
> URL: https://issues.apache.org/jira/browse/SPARK-6616
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ilya Ganelin
>
> There are numerous instances throughout the code base of the following:
> {{code}}
> if (!stopped) {
> stopped = true
> ...
> }
> {{code}}
> In general, this is bad practice since it can cause an incomplete cleanup if 
> there is an error during shutdown and not all code executes. Incomplete 
> cleanup is harder to track down than a double cleanup that triggers some 
> error. I propose fixing this throughout the code, starting with the cleanup 
> sequence with {{code}}SparkContext.stop() {{code}}.
> A cursory examination reveals this in {{code}}SparkContext.stop(), 
> SparkEnv.stop(), and ContextCleaner.stop() {{code}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.

2015-03-30 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-6616:

Description: 
There are numerous instances throughout the code base of the following:

{{code}}
if (!stopped) {
stopped = true
...
}
{{code}}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop().{{code}}

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop().{{code}}



  was:
There are numerous instances throughout the code base of the following:

{{code}}
if (!stopped) {
stopped = true
...
}
{{code}}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop()```.{{code}}

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop().{{code}}




> IsStopped set to true in before stop() is complete.
> ---
>
> Key: SPARK-6616
> URL: https://issues.apache.org/jira/browse/SPARK-6616
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ilya Ganelin
>
> There are numerous instances throughout the code base of the following:
> {{code}}
> if (!stopped) {
> stopped = true
> ...
> }
> {{code}}
> In general, this is bad practice since it can cause an incomplete cleanup if 
> there is an error during shutdown and not all code executes. Incomplete 
> cleanup is harder to track down than a double cleanup that triggers some 
> error. I propose fixing this throughout the code, starting with the cleanup 
> sequence with {{code}}SparkContext.stop().{{code}}
> A cursory examination reveals this in {{code}}SparkContext.stop(), 
> SparkEnv.stop(), and ContextCleaner.stop().{{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6616) IsStopped set to true in before stop() is complete.

2015-03-30 Thread Ilya Ganelin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ilya Ganelin updated SPARK-6616:

Description: 
There are numerous instances throughout the code base of the following:

{{code}}
if (!stopped) {
stopped = true
...
}
{{code}}

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with {{code}}SparkContext.stop()```.{{code}}

A cursory examination reveals this in {{code}}SparkContext.stop(), 
SparkEnv.stop(), and ContextCleaner.stop().{{code}}



  was:
There are numerous instances throughout the code base of the following:

```
if (!stopped) {
stopped = true
...
}
```

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with ```SparkContext.stop()```.

A cursory examination reveals this in ```SparkContext.stop(), SparkEnv.stop(), 
and ContextCleaner.stop()```.




> IsStopped set to true in before stop() is complete.
> ---
>
> Key: SPARK-6616
> URL: https://issues.apache.org/jira/browse/SPARK-6616
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ilya Ganelin
>
> There are numerous instances throughout the code base of the following:
> {{code}}
> if (!stopped) {
> stopped = true
> ...
> }
> {{code}}
> In general, this is bad practice since it can cause an incomplete cleanup if 
> there is an error during shutdown and not all code executes. Incomplete 
> cleanup is harder to track down than a double cleanup that triggers some 
> error. I propose fixing this throughout the code, starting with the cleanup 
> sequence with {{code}}SparkContext.stop()```.{{code}}
> A cursory examination reveals this in {{code}}SparkContext.stop(), 
> SparkEnv.stop(), and ContextCleaner.stop().{{code}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6616) IsStopped set to true in before stop() is complete.

2015-03-30 Thread Ilya Ganelin (JIRA)

Ilya Ganelin created SPARK-6616:
---

 Summary: IsStopped set to true in before stop() is complete.
 Key: SPARK-6616
 URL: https://issues.apache.org/jira/browse/SPARK-6616
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Ilya Ganelin


There are numerous instances throughout the code base of the following:

```
if (!stopped) {
stopped = true
...
}
```

In general, this is bad practice since it can cause an incomplete cleanup if 
there is an error during shutdown and not all code executes. Incomplete cleanup 
is harder to track down than a double cleanup that triggers some error. I 
propose fixing this throughout the code, starting with the cleanup sequence 
with ```SparkContext.stop()```.

A cursory examination reveals this in ```SparkContext.stop(), SparkEnv.stop(), 
and ContextCleaner.stop()```.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-6615) Python API for Word2Vec

2015-03-30 Thread Kai Sasaki (JIRA)

Kai Sasaki created SPARK-6615:
-

 Summary: Python API for Word2Vec
 Key: SPARK-6615
 URL: https://issues.apache.org/jira/browse/SPARK-6615
 Project: Spark
  Issue Type: Task
  Components: MLlib, PySpark
Affects Versions: 1.3.0
Reporter: Kai Sasaki
Priority: Minor
 Fix For: 1.4.0


This is the sub-task of 
[SPARK-6254|https://issues.apache.org/jira/browse/SPARK-6254].

Wrap missing method for {{Word2Vec}} and {{Word2VecModel}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies

2015-03-30 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387568#comment-14387568
 ] 

Josh Rosen commented on SPARK-6492:
---

Timeouts are one way to fix this, but I wonder if we could also try to remove 
the circular wait condition by modifying EventLoop so that {{stopped}} is set 
before we call {{onError}}.  This would prevent calls to {{EventLoop.stop()}} 
from blocking while the event loop is in the process of shutting down, which 
should prevent this race.

> SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
> ---
>
> Key: SPARK-6492
> URL: https://issues.apache.org/jira/browse/SPARK-6492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Priority: Critical
>
> A deadlock can occur when DAGScheduler death causes a SparkContext to be shut 
> down while user code is concurrently racing to stop the SparkContext in a 
> finally block.
> For example:
> {code}
> try {
>   sc = new SparkContext("local", "test")
>   // start running a job that causes the DAGSchedulerEventProcessor to 
> crash
>   someRDD.doStuff()
> }
> } finally {
>   sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes 
> the above job to fail with an exception
> }
> {code}
> This leads to a deadlock.  The event processor thread tries to lock on the 
> {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because 
> the thread that holds that lock is waiting for the event processor thread to 
> join:
> {code}
> "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 
> waiting for monitor entry [0x0001223ad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
>   - waiting to lock <0x0007f5037b08> (a java.lang.Object)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52)
> {code}
> {code}
> "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 
> tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1355)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1405)
>   - locked <0x0007f5037b08> (a java.lang.Object)
> [...]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5886) Add LabelIndexer

2015-03-30 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387567#comment-14387567
 ] 

Joseph K. Bradley commented on SPARK-5886:
--

Also, should this index native types other than Strings?  It seems a shame to 
need a new class for other types like Double and Int as well.

Maybe the 2 distinctions we need are:
* This class indexes native types.
* [SPARK-4081] indexes Vector and Array types.


> Add LabelIndexer
> 
>
> Key: SPARK-5886
> URL: https://issues.apache.org/jira/browse/SPARK-5886
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `LabelIndexer` takes a column of labels (raw categories) and outputs an 
> integer column with labels indexed by their frequency.
> {code}
> va li = new LabelIndexer()
>   .setInputCol("country")
>   .setOutputCol("countryIndex")
> {code}
> In the output column, we should store the label to index map as an ML 
> attribute. The index should be ordered by frequency, where the most frequent 
> label gets index 0, to enhance sparsity.
> We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6492:
---

Assignee: (was: Apache Spark)

> SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
> ---
>
> Key: SPARK-6492
> URL: https://issues.apache.org/jira/browse/SPARK-6492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Priority: Critical
>
> A deadlock can occur when DAGScheduler death causes a SparkContext to be shut 
> down while user code is concurrently racing to stop the SparkContext in a 
> finally block.
> For example:
> {code}
> try {
>   sc = new SparkContext("local", "test")
>   // start running a job that causes the DAGSchedulerEventProcessor to 
> crash
>   someRDD.doStuff()
> }
> } finally {
>   sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes 
> the above job to fail with an exception
> }
> {code}
> This leads to a deadlock.  The event processor thread tries to lock on the 
> {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because 
> the thread that holds that lock is waiting for the event processor thread to 
> join:
> {code}
> "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 
> waiting for monitor entry [0x0001223ad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
>   - waiting to lock <0x0007f5037b08> (a java.lang.Object)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52)
> {code}
> {code}
> "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 
> tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1355)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1405)
>   - locked <0x0007f5037b08> (a java.lang.Object)
> [...]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5205:
---

Assignee: Apache Spark

> Inconsistent behaviour between Streaming job and others, when click kill link 
> in WebUI
> --
>
> Key: SPARK-5205
> URL: https://issues.apache.org/jira/browse/SPARK-5205
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: uncleGen
>Assignee: Apache Spark
>
> The "kill" link is used to kill a stage in job. It works in any kinds of 
> Spark job but Spark Streaming. To be specific, we can only kill the stage 
> which is used to run "Receiver", but not kill the "Receivers". Well, the 
> stage can be killed and cleaned from the ui, but the receivers are still 
> alive and receiving data. I think it dose not fit with the common sense. 
> IMHO, killing the "receiver" stage means kill the "receivers" and stopping 
> receiving data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-5205) Inconsistent behaviour between Streaming job and others, when click kill link in WebUI

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5205:
---

Assignee: (was: Apache Spark)

> Inconsistent behaviour between Streaming job and others, when click kill link 
> in WebUI
> --
>
> Key: SPARK-5205
> URL: https://issues.apache.org/jira/browse/SPARK-5205
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: uncleGen
>
> The "kill" link is used to kill a stage in job. It works in any kinds of 
> Spark job but Spark Streaming. To be specific, we can only kill the stage 
> which is used to run "Receiver", but not kill the "Receivers". Well, the 
> stage can be killed and cleaned from the ui, but the receivers are still 
> alive and receiving data. I think it dose not fit with the common sense. 
> IMHO, killing the "receiver" stage means kill the "receivers" and stopping 
> receiving data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6492:
---

Assignee: Apache Spark

> SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
> ---
>
> Key: SPARK-6492
> URL: https://issues.apache.org/jira/browse/SPARK-6492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>Priority: Critical
>
> A deadlock can occur when DAGScheduler death causes a SparkContext to be shut 
> down while user code is concurrently racing to stop the SparkContext in a 
> finally block.
> For example:
> {code}
> try {
>   sc = new SparkContext("local", "test")
>   // start running a job that causes the DAGSchedulerEventProcessor to 
> crash
>   someRDD.doStuff()
> }
> } finally {
>   sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes 
> the above job to fail with an exception
> }
> {code}
> This leads to a deadlock.  The event processor thread tries to lock on the 
> {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because 
> the thread that holds that lock is waiting for the event processor thread to 
> join:
> {code}
> "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 
> waiting for monitor entry [0x0001223ad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
>   - waiting to lock <0x0007f5037b08> (a java.lang.Object)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52)
> {code}
> {code}
> "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 
> tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1355)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1405)
>   - locked <0x0007f5037b08> (a java.lang.Object)
> [...]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6492) SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387558#comment-14387558
 ] 

Apache Spark commented on SPARK-6492:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5277

> SparkContext.stop() can deadlock when DAGSchedulerEventProcessLoop dies
> ---
>
> Key: SPARK-6492
> URL: https://issues.apache.org/jira/browse/SPARK-6492
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0, 1.4.0
>Reporter: Josh Rosen
>Priority: Critical
>
> A deadlock can occur when DAGScheduler death causes a SparkContext to be shut 
> down while user code is concurrently racing to stop the SparkContext in a 
> finally block.
> For example:
> {code}
> try {
>   sc = new SparkContext("local", "test")
>   // start running a job that causes the DAGSchedulerEventProcessor to 
> crash
>   someRDD.doStuff()
> }
> } finally {
>   sc.stop() // stop the sparkcontext once the failure in DAGScheduler causes 
> the above job to fail with an exception
> }
> {code}
> This leads to a deadlock.  The event processor thread tries to lock on the 
> {{SparkContext.SPARK_CONTEXT_CONSTRUCTOR_LOCK}} and becomes blocked because 
> the thread that holds that lock is waiting for the event processor thread to 
> join:
> {code}
> "dag-scheduler-event-loop" daemon prio=5 tid=0x7ffa69456000 nid=0x9403 
> waiting for monitor entry [0x0001223ad000]
>java.lang.Thread.State: BLOCKED (on object monitor)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1398)
>   - waiting to lock <0x0007f5037b08> (a java.lang.Object)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onError(DAGScheduler.scala:1412)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:52)
> {code}
> {code}
> "pool-1-thread-1-ScalaTest-running-SparkContextSuite" prio=5 
> tid=0x7ffa69864800 nid=0x5903 in Object.wait() [0x0001202dc000]
>java.lang.Thread.State: WAITING (on object monitor)
>   at java.lang.Object.wait(Native Method)
>   - waiting on <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1281)
>   - locked <0x0007f4b28000> (a 
> org.apache.spark.util.EventLoop$$anon$1)
>   at java.lang.Thread.join(Thread.java:1355)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:79)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1352)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1405)
>   - locked <0x0007f5037b08> (a java.lang.Object)
> [...]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5886) Add LabelIndexer

2015-03-30 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387554#comment-14387554
 ] 

Joseph K. Bradley commented on SPARK-5886:
--

Was there any discussion about this indexing multiple columns at the same time? 
 I think it should be able to since it sounds easier and more efficient when 
indexing many columns.

> Add LabelIndexer
> 
>
> Key: SPARK-5886
> URL: https://issues.apache.org/jira/browse/SPARK-5886
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `LabelIndexer` takes a column of labels (raw categories) and outputs an 
> integer column with labels indexed by their frequency.
> {code}
> va li = new LabelIndexer()
>   .setInputCol("country")
>   .setOutputCol("countryIndex")
> {code}
> In the output column, we should store the label to index map as an ML 
> attribute. The index should be ordered by frequency, where the most frequent 
> label gets index 0, to enhance sparsity.
> We can discuss whether this should index multiple columns at the same time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6603) SQLContext.registerFunction -> SQLContext.udf.register

2015-03-30 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6603.

   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.1

> SQLContext.registerFunction -> SQLContext.udf.register
> --
>
> Key: SPARK-6603
> URL: https://issues.apache.org/jira/browse/SPARK-6603
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Davies Liu
> Fix For: 1.3.1, 1.4.0
>
>
> We didn't change the Python implementation to use that. Maybe the best 
> strategy is to deprecate SQLContext.registerFunction, and just add 
> SQLContext.udf.register.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi

2015-03-30 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387541#comment-14387541
 ] 

Joseph K. Bradley commented on SPARK-6251:
--

I'm closing this since we need to revamp the optimization API anyways.

> Mark parts of LBFGS, GradientDescent as DeveloperApi
> 
>
> Key: SPARK-6251
> URL: https://issues.apache.org/jira/browse/SPARK-6251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
>
> Should be DeveloperApi:
> * optimize
> * setGradient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi

2015-03-30 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-6251.

Resolution: Won't Fix

> Mark parts of LBFGS, GradientDescent as DeveloperApi
> 
>
> Key: SPARK-6251
> URL: https://issues.apache.org/jira/browse/SPARK-6251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
>
> Should be DeveloperApi:
> * optimize
> * setGradient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6251:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

> Mark parts of LBFGS, GradientDescent as DeveloperApi
> 
>
> Key: SPARK-6251
> URL: https://issues.apache.org/jira/browse/SPARK-6251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Trivial
>
> Should be DeveloperApi:
> * optimize
> * setGradient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6251) Mark parts of LBFGS, GradientDescent as DeveloperApi

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6251:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

> Mark parts of LBFGS, GradientDescent as DeveloperApi
> 
>
> Key: SPARK-6251
> URL: https://issues.apache.org/jira/browse/SPARK-6251
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Trivial
>
> Should be DeveloperApi:
> * optimize
> * setGradient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6614:
---

Assignee: Apache Spark  (was: Josh Rosen)

> OutputCommitCoordinator should clear authorized committers only after 
> authorized committer fails, not after any failure
> ---
>
> Key: SPARK-6614
> URL: https://issues.apache.org/jira/browse/SPARK-6614
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> In OutputCommitCoordinator, there is some logic to clear the authorized 
> committer's lock on committing in case it fails.  However, it looks like the 
> current code also clears this lock if _other_ tasks fail, which is an obvious 
> bug: 
> https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118.
>   In theory, it's possible that this could allow a new committer to start, 
> run to completion, and commit output before the authorized committer 
> finished, but it's unlikely that this race occurs often in practice due to 
> the complex combination of failure and timing conditions that would be 
> required to expose it.  Still, we should fix this issue.
> This was discovered by [~adav] while reading the OutputCommitCoordinator code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure

2015-03-30 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6614:
---

Assignee: Josh Rosen  (was: Apache Spark)

> OutputCommitCoordinator should clear authorized committers only after 
> authorized committer fails, not after any failure
> ---
>
> Key: SPARK-6614
> URL: https://issues.apache.org/jira/browse/SPARK-6614
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In OutputCommitCoordinator, there is some logic to clear the authorized 
> committer's lock on committing in case it fails.  However, it looks like the 
> current code also clears this lock if _other_ tasks fail, which is an obvious 
> bug: 
> https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118.
>   In theory, it's possible that this could allow a new committer to start, 
> run to completion, and commit output before the authorized committer 
> finished, but it's unlikely that this race occurs often in practice due to 
> the complex combination of failure and timing conditions that would be 
> required to expose it.  Still, we should fix this issue.
> This was discovered by [~adav] while reading the OutputCommitCoordinator code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387534#comment-14387534
 ] 

Apache Spark commented on SPARK-6614:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5276

> OutputCommitCoordinator should clear authorized committers only after 
> authorized committer fails, not after any failure
> ---
>
> Key: SPARK-6614
> URL: https://issues.apache.org/jira/browse/SPARK-6614
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In OutputCommitCoordinator, there is some logic to clear the authorized 
> committer's lock on committing in case it fails.  However, it looks like the 
> current code also clears this lock if _other_ tasks fail, which is an obvious 
> bug: 
> https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118.
>   In theory, it's possible that this could allow a new committer to start, 
> run to completion, and commit output before the authorized committer 
> finished, but it's unlikely that this race occurs often in practice due to 
> the complex combination of failure and timing conditions that would be 
> required to expose it.  Still, we should fix this issue.
> This was discovered by [~adav] while reading the OutputCommitCoordinator code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2015-03-30 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14387532#comment-14387532
 ] 

Apache Spark commented on SPARK-2883:
-

User 'zhzhan' has created a pull request for this issue:
https://github.com/apache/spark/pull/5275

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>Priority: Blocker
> Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
> pm jobtracker.png, orc.diff
>
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6614) OutputCommitCoordinator should clear authorized committers only after authorized committer fails, not after any failure

2015-03-30 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-6614:
--
Affects Version/s: 1.4.0
   1.3.1

> OutputCommitCoordinator should clear authorized committers only after 
> authorized committer fails, not after any failure
> ---
>
> Key: SPARK-6614
> URL: https://issues.apache.org/jira/browse/SPARK-6614
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.3.0, 1.3.1, 1.4.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> In OutputCommitCoordinator, there is some logic to clear the authorized 
> committer's lock on committing in case it fails.  However, it looks like the 
> current code also clears this lock if _other_ tasks fail, which is an obvious 
> bug: 
> https://github.com/apache/spark/blob/df3550084c9975f999ed370dd9f7c495181a68ba/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala#L118.
>   In theory, it's possible that this could allow a new committer to start, 
> run to completion, and commit output before the authorized committer 
> finished, but it's unlikely that this race occurs often in practice due to 
> the complex combination of failure and timing conditions that would be 
> required to expose it.  Still, we should fix this issue.
> This was discovered by [~adav] while reading the OutputCommitCoordinator code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 229 matches

Mail list logo