[jira] [Commented] (SPARK-11960) User guide section for streaming a/b testing

2015-11-24 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025593#comment-15025593
 ] 

Feynman Liang commented on SPARK-11960:
---

[~josephkb] happy to work on it, when is the 1.6 QA deadline?

> User guide section for streaming a/b testing
> 
>
> Key: SPARK-11960
> URL: https://issues.apache.org/jira/browse/SPARK-11960
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Feynman Liang
>
> [~fliang] Assigning since you added the feature.  Will you have a chance to 
> do this soon?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11969) SQL UI does not work with PySpark

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11969:


Assignee: Davies Liu  (was: Apache Spark)

> SQL UI does not work with PySpark
> -
>
> Key: SPARK-11969
> URL: https://issues.apache.org/jira/browse/SPARK-11969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11969) SQL UI does not work with PySpark

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025732#comment-15025732
 ] 

Apache Spark commented on SPARK-11969:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9949

> SQL UI does not work with PySpark
> -
>
> Key: SPARK-11969
> URL: https://issues.apache.org/jira/browse/SPARK-11969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11969) SQL UI does not work with PySpark

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11969:


Assignee: Apache Spark  (was: Davies Liu)

> SQL UI does not work with PySpark
> -
>
> Key: SPARK-11969
> URL: https://issues.apache.org/jira/browse/SPARK-11969
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Apache Spark
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-11-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11601:
--
Assignee: Timothy Hunter  (was: Tim Hunter)

> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11970) Add missing APIs in Dataset

2015-11-24 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025795#comment-15025795
 ] 

Xiao Li commented on SPARK-11970:
-

Working on it. Thanks! 

> Add missing APIs in Dataset
> ---
>
> Key: SPARK-11970
> URL: https://issues.apache.org/jira/browse/SPARK-11970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We should add the following functions to Dataset:
> 1. show
> 2. cache / persist / unpersist
> 3. sample
> 4. join with outer join support



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11805) SpillableIterator should free the in-memory sorter while spilling

2015-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11805.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9793
[https://github.com/apache/spark/pull/9793]

> SpillableIterator should free the in-memory sorter while spilling
> -
>
> Key: SPARK-11805
> URL: https://issues.apache.org/jira/browse/SPARK-11805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.6.0
>
>
> This array buffer will not be used after spilling, should be freed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6518) Add example code and user guide for bisecting k-means

2015-11-24 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025717#comment-15025717
 ] 

Yu Ishikawa commented on SPARK-6518:


All right. I'll send a PR soon. Thanks!

> Add example code and user guide for bisecting k-means
> -
>
> Key: SPARK-6518
> URL: https://issues.apache.org/jira/browse/SPARK-6518
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9670) Examples: Check for new APIs requiring example code

2015-11-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9670:
-
Assignee: Timothy Hunter  (was: Tim Hunter)

> Examples: Check for new APIs requiring example code
> ---
>
> Key: SPARK-9670
> URL: https://issues.apache.org/jira/browse/SPARK-9670
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
>Priority: Minor
>
> Audit list of new features added to MLlib, and see which major items are 
> missing example code (in the examples folder).  We do not need examples for 
> everything, only for major items such as new ML algorithms.
> For any such items:
> * Create a JIRA for that feature, and assign it to the author of the feature 
> (or yourself if interested).
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8517:
-
Assignee: Timothy Hunter  (was: Tim Hunter)

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11960) User guide section for streaming a/b testing

2015-11-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025750#comment-15025750
 ] 

Joseph K. Bradley commented on SPARK-11960:
---

Would you be able to do this by next Monday?  I appreciate it!

> User guide section for streaming a/b testing
> 
>
> Key: SPARK-11960
> URL: https://issues.apache.org/jira/browse/SPARK-11960
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Feynman Liang
>
> [~fliang] Assigning since you added the feature.  Will you have a chance to 
> do this soon?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6518) Add example code and user guide for bisecting k-means

2015-11-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025751#comment-15025751
 ] 

Joseph K. Bradley commented on SPARK-6518:
--

Great, thanks!

> Add example code and user guide for bisecting k-means
> -
>
> Key: SPARK-6518
> URL: https://issues.apache.org/jira/browse/SPARK-6518
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11934) [SQL] Adding joinType into joinWith

2015-11-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11934:

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-

> [SQL] Adding joinType into joinWith 
> 
>
> Key: SPARK-11934
> URL: https://issues.apache.org/jira/browse/SPARK-11934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>
> Adding the joinType into the existing joinWith function call in Dataset APIs. 
> When using joinWith function, users can specify the following join type: 
> `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10621) Audit function names in FunctionRegistry and corresponding method names shown in functions.scala and functions.py

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10621:


Assignee: Apache Spark

> Audit function names in FunctionRegistry and corresponding method names shown 
> in functions.scala and functions.py
> -
>
> Key: SPARK-10621
> URL: https://issues.apache.org/jira/browse/SPARK-10621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> Right now, there are a few places that we are not very consistent.
> * There are a few functions that are registered in {{FunctionRegistry}}, but 
> not provided in {{functions.scala}} and {{functions.py}}. Examples are 
> {{isnull}} and {{get_json_object}}.
> * There are a few functions that we have different names in FunctionRegistry 
> and method name in DataFrame API. {{spark_partition_id}} is an example. In 
> FunctionRegistry, it is called {{spark_partition_id}}. But, DataFrame API, 
> the method is called {{sparkPartitionId}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10621) Audit function names in FunctionRegistry and corresponding method names shown in functions.scala and functions.py

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10621:


Assignee: (was: Apache Spark)

> Audit function names in FunctionRegistry and corresponding method names shown 
> in functions.scala and functions.py
> -
>
> Key: SPARK-10621
> URL: https://issues.apache.org/jira/browse/SPARK-10621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, there are a few places that we are not very consistent.
> * There are a few functions that are registered in {{FunctionRegistry}}, but 
> not provided in {{functions.scala}} and {{functions.py}}. Examples are 
> {{isnull}} and {{get_json_object}}.
> * There are a few functions that we have different names in FunctionRegistry 
> and method name in DataFrame API. {{spark_partition_id}} is an example. In 
> FunctionRegistry, it is called {{spark_partition_id}}. But, DataFrame API, 
> the method is called {{sparkPartitionId}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11968) ALS recommend all methods spend most of time in GC

2015-11-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11968:
-

 Summary: ALS recommend all methods spend most of time in GC
 Key: SPARK-11968
 URL: https://issues.apache.org/jira/browse/SPARK-11968
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.5.2, 1.6.0
Reporter: Joseph K. Bradley


After adding recommendUsersForProducts and recommendProductsForUsers to ALS in 
spark-perf, I noticed that it takes much longer than ALS itself.  Looking at 
the monitoring page, I can see it is spending about 8min doing GC for each 
10min task.  That sounds fixable.  Looking at the implementation, there is 
clearly an opportunity to avoid extra allocations: 
[https://github.com/apache/spark/blob/e6dd237463d2de8c506f0735dfdb3f43e8122513/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala#L283]

CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10621) Audit function names in FunctionRegistry and corresponding method names shown in functions.scala and functions.py

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025539#comment-15025539
 ] 

Apache Spark commented on SPARK-10621:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9948

> Audit function names in FunctionRegistry and corresponding method names shown 
> in functions.scala and functions.py
> -
>
> Key: SPARK-10621
> URL: https://issues.apache.org/jira/browse/SPARK-10621
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> Right now, there are a few places that we are not very consistent.
> * There are a few functions that are registered in {{FunctionRegistry}}, but 
> not provided in {{functions.scala}} and {{functions.py}}. Examples are 
> {{isnull}} and {{get_json_object}}.
> * There are a few functions that we have different names in FunctionRegistry 
> and method name in DataFrame API. {{spark_partition_id}} is an example. In 
> FunctionRegistry, it is called {{spark_partition_id}}. But, DataFrame API, 
> the method is called {{sparkPartitionId}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10802) Let ALS recommend for subset of data

2015-11-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025541#comment-15025541
 ] 

Joseph K. Bradley commented on SPARK-10802:
---

Linking related issue: recommend all could be faster.

> Let ALS recommend for subset of data
> 
>
> Key: SPARK-10802
> URL: https://issues.apache.org/jira/browse/SPARK-10802
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Tomasz Bartczak
>Priority: Minor
>
> Currently MatrixFactorizationModel allows to get recommendations for
> - single user 
> - single product 
> - all users
> - all products
> recommendation for all users/products do a cartesian join inside.
> It would be useful in some cases to get recommendations for subset of 
> users/products by providing an RDD with which MatrixFactorizationModel could 
> do an intersection before doing a cartesian join. This would make it much 
> faster in situation where recommendations are needed only for subset of 
> users/products, and when the subset is still too large to make it feasible to 
> recommend one-by-one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9328) Netty IO layer should implement read timeouts

2015-11-24 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025711#comment-15025711
 ] 

Josh Rosen commented on SPARK-9328:
---

Actually, I spoke slightly too soon: there were some timeouts that had to be 
lowered in order for the master branch test to pass (my test was originally 
created for Spark 1.2.x for a backport). It looks like SPARK-7003 has addressed 
this for Spark 1.4.x+, so I'm going to resolve this as fixed in 1.4.0+.

> Netty IO layer should implement read timeouts
> -
>
> Key: SPARK-9328
> URL: https://issues.apache.org/jira/browse/SPARK-9328
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1, 1.3.1
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9328) Netty IO layer should implement read timeouts

2015-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-9328:
-

Assignee: Josh Rosen

> Netty IO layer should implement read timeouts
> -
>
> Key: SPARK-9328
> URL: https://issues.apache.org/jira/browse/SPARK-9328
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1, 1.3.1
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results

2015-11-24 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025739#comment-15025739
 ] 

Yin Huai commented on SPARK-11885:
--

I tried 
{code}
val q1 = sql("""
select   store_country,
 store_region,
 gm(amount)
from receipts
whereamount > 50
 and store_country = 'italy'
group by store_country, store_region
""")
{code}

Seems the result is good. Looks like using built-in functions and UDAF somehow 
triggers the problem.

> UDAF may nondeterministically generate wrong results
> 
>
> Key: SPARK-11885
> URL: https://issues.apache.org/jira/browse/SPARK-11885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). 
> I think it is an issue in 1.5 branch.
> Try the following in spark 1.5 (with a cluster) and you can see the problem.
> {code}
> import java.math.BigDecimal
> import org.apache.spark.sql.expressions.MutableAggregationBuffer
> import org.apache.spark.sql.expressions.UserDefinedAggregateFunction
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType, StructField, DataType, 
> DoubleType, LongType}
> class GeometricMean extends UserDefinedAggregateFunction {
>   def inputSchema: StructType =
> StructType(StructField("value", DoubleType) :: Nil)
>   def bufferSchema: StructType = StructType(
> StructField("count", LongType) ::
>   StructField("product", DoubleType) :: Nil
>   )
>   def dataType: DataType = DoubleType
>   def deterministic: Boolean = true
>   def initialize(buffer: MutableAggregationBuffer): Unit = {
> buffer(0) = 0L
> buffer(1) = 1.0
>   }
>   def update(buffer: MutableAggregationBuffer,input: Row): Unit = {
> buffer(0) = buffer.getAs[Long](0) + 1
> buffer(1) = buffer.getAs[Double](1) * input.getAs[Double](0)
>   }
>   def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
> buffer1(0) = buffer1.getAs[Long](0) + buffer2.getAs[Long](0)
> buffer1(1) = buffer1.getAs[Double](1) * buffer2.getAs[Double](1)
>   }
>   def evaluate(buffer: Row): Any = {
> math.pow(buffer.getDouble(1), 1.0d / buffer.getLong(0))
>   }
> }
> sqlContext.udf.register("gm", new GeometricMean)
> val df = Seq(
>   (1, "italy", "emilia", 42, BigDecimal.valueOf(100, 0), "john"),
>   (2, "italy", "toscana", 42, BigDecimal.valueOf(505, 1), "jim"),
>   (3, "italy", "puglia", 42, BigDecimal.valueOf(70, 0), "jenn"),
>   (4, "italy", "emilia", 42, BigDecimal.valueOf(75 ,0), "jack"),
>   (5, "uk", "london", 42, BigDecimal.valueOf(200 ,0), "carl"),
>   (6, "italy", "emilia", 42, BigDecimal.valueOf(42, 0), "john")).
>   toDF("receipt_id", "store_country", "store_region", "store_id", "amount", 
> "seller_name")
> df.registerTempTable("receipts")
>   
> val q = sql("""
> select   store_country,
>  store_region,
>  avg(amount),
>  sum(amount),
>  gm(amount)
> from receipts
> whereamount > 50
>  and store_country = 'italy'
> group by store_country, store_region
> """)
> q.show
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11914) [SQL] Support coalesce and repartition in Dataset APIs

2015-11-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11914:

Assignee: Xiao Li

> [SQL] Support coalesce and repartition in Dataset APIs
> --
>
> Key: SPARK-11914
> URL: https://issues.apache.org/jira/browse/SPARK-11914
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 1.6.0
>
>
> repartition: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions.
> coalesce: Returns a new [[Dataset]] that has exactly `numPartitions` 
> partitions. Similar to coalesce defined on an [[RDD]], this operation results 
> in a narrow dependency, e.g. if you go from 1000 partitions to 100 
> partitions, there will not be a shuffle, instead each of the 100 new 
> partitions will claim 10 of the current partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9325) Support `collect` on DataFrame columns

2015-11-24 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025611#comment-15025611
 ] 

Hossein Falaki edited comment on SPARK-9325 at 11/24/15 10:50 PM:
--

To help R users and not open up the API, how about adding head and collect 
functions for Column, that just print out a warning that explain how to do it 
the "right" way:

{code}
> collect(df$Col)
Warning: DataFrame Column may be not be materialized. Please use 
collect(select(df, df$Col))
{code}

Right now, users will get a Java exception which is pretty confusing for most R 
users.


was (Author: falaki):
To help R users and not open up the API, how about adding head and collect 
functions for Column, that just print out a warning that explain how to do it 
the "right" way:

{code}
collect(df$Col)
Warning: DataFrame Column may be not be materialized. Please use 
collect(select(df, df$Col))
{code}

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-11-24 Thread Hossein Falaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025611#comment-15025611
 ] 

Hossein Falaki commented on SPARK-9325:
---

To help R users and not open up the API, how about adding head and collect 
functions for Column, that just print out a warning that explain how to do it 
the "right" way:

{code}
collect(df$Col)
Warning: DataFrame Column may be not be materialized. Please use 
collect(select(df, df$Col))
{code}

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8517:
-
Assignee: Tim Hunter

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Tim Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11783) When deployed against remote Hive metastore, HiveContext.executionHive points to wrong metastore

2015-11-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11783.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9895
[https://github.com/apache/spark/pull/9895]

> When deployed against remote Hive metastore, HiveContext.executionHive points 
> to wrong metastore
> 
>
> Key: SPARK-11783
> URL: https://issues.apache.org/jira/browse/SPARK-11783
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 1.7.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.6.0
>
>
> When using remote metastore, execution Hive client somehow is initialized to 
> point to the actual remote metastore instead of the dummy local Derby 
> metastore.
> To reproduce this issue:
> # Configuring {{conf/hive-site.xml}} to point to a remote Hive 1.2.1 
> metastore.
> # Set {{hive.metastore.uris}} to {{thrift://localhost:9083}}.
> # Start metastore service using {{$HIVE_HOME/bin/hive --service metastore}}
> # Start Thrift server with remote debugging options
> # Attach the debugger to the Thrift server driver process, we can verify that 
> {{executionHive}} points to the remote metastore rather than the local 
> execution Derby metastore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11953) CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug

2015-11-24 Thread Siva Gudavalli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025538#comment-15025538
 ] 

Siva Gudavalli commented on SPARK-11953:


I Agree. It depends on how we define SaveMode.Append.
Looking for an option similar to InsertIntoJdbc in 1.4.1

> CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug
> 
>
> Key: SPARK-11953
> URL: https://issues.apache.org/jira/browse/SPARK-11953
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Submit, SQL
>Affects Versions: 1.4.1, 1.5.1
> Environment: Spark stand alone cluster
>Reporter: Siva Gudavalli
>
> In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc.
> They are replaced with write.jdbc() in Spark 1.4.1
> When we specify SaveMode.Append we are letting application know that there is 
> a table in the database which means "tableExists = true". And we do not need 
> to perform "JdbcUtils.tableExists(conn, table)".
> Please let me know if you think differently.
> Regards
> Shiv
> def jdbc(url: String, table: String, connectionProperties: Properties): Unit 
> = {
> val conn = JdbcUtils.createConnection(url, connectionProperties)
> try {
> var tableExists = JdbcUtils.tableExists(conn, table)
> if (mode == SaveMode.Ignore && tableExists)
> { return }
> if (mode == SaveMode.ErrorIfExists && tableExists)
> { sys.error(s"Table $table already exists.") }
> if (mode == SaveMode.Overwrite && tableExists)
> { JdbcUtils.dropTable(conn, table) tableExists = false }
> // Create the table if the table didn't exist.
> if (!tableExists)
> { val schema = JDBCWriteDetails.schemaString(df, url) val sql = s"CREATE 
> TABLE $table ($schema)" conn.prepareStatement(sql).executeUpdate() }
> } finally
> { conn.close() }
> JDBCWriteDetails.saveTable(df, url, table, connectionProperties)
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11970) Add missing APIs in Dataset

2015-11-24 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11970:
---

 Summary: Add missing APIs in Dataset
 Key: SPARK-11970
 URL: https://issues.apache.org/jira/browse/SPARK-11970
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11970) Add missing APIs in Dataset

2015-11-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11970:

Description: 
We should add the following functions to Dataset:

1. show

2. cache / persist / unpersist

3. sample

4. join with outer join support





> Add missing APIs in Dataset
> ---
>
> Key: SPARK-11970
> URL: https://issues.apache.org/jira/browse/SPARK-11970
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>
> We should add the following functions to Dataset:
> 1. show
> 2. cache / persist / unpersist
> 3. sample
> 4. join with outer join support



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9328) Netty IO layer should implement read timeouts

2015-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9328:
--
Affects Version/s: (was: 1.4.1)
   (was: 1.5.0)

> Netty IO layer should implement read timeouts
> -
>
> Key: SPARK-9328
> URL: https://issues.apache.org/jira/browse/SPARK-9328
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1, 1.3.1
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9328) Netty IO layer should implement read timeouts

2015-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9328.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

> Netty IO layer should implement read timeouts
> -
>
> Key: SPARK-9328
> URL: https://issues.apache.org/jira/browse/SPARK-9328
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1, 1.3.1
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9328) Netty IO layer should implement read timeouts

2015-11-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-9328:
--
Target Version/s:   (was: 1.6.0)

> Netty IO layer should implement read timeouts
> -
>
> Key: SPARK-9328
> URL: https://issues.apache.org/jira/browse/SPARK-9328
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1, 1.3.1
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.4.0
>
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11969) SQL UI does not work with PySpark

2015-11-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11969:
--

 Summary: SQL UI does not work with PySpark
 Key: SPARK-11969
 URL: https://issues.apache.org/jira/browse/SPARK-11969
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11966) Spark API for UDTFs

2015-11-24 Thread Jaka Jancar (JIRA)
Jaka Jancar created SPARK-11966:
---

 Summary: Spark API for UDTFs
 Key: SPARK-11966
 URL: https://issues.apache.org/jira/browse/SPARK-11966
 Project: Spark
  Issue Type: New Feature
Reporter: Jaka Jancar


Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
functions. For those you still have to use these horrendous Hive interfaces:

https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11966) Spark API for UDTFs

2015-11-24 Thread Jaka Jancar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jaka Jancar updated SPARK-11966:

Priority: Minor  (was: Major)

> Spark API for UDTFs
> ---
>
> Key: SPARK-11966
> URL: https://issues.apache.org/jira/browse/SPARK-11966
> Project: Spark
>  Issue Type: New Feature
>Reporter: Jaka Jancar
>Priority: Minor
>
> Defining UDFs is easy using sqlContext.udf.register, but not table-generating 
> functions. For those you still have to use these horrendous Hive interfaces:
> https://github.com/prongs/apache-hive/blob/master/contrib/src/java/org/apache/hadoop/hive/contrib/udtf/example/GenericUDTFCount2.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11967:


Assignee: Reynold Xin  (was: Apache Spark)

> Use varargs for multiple paths in DataFrameReader
> -
>
> Key: SPARK-11967
> URL: https://issues.apache.org/jira/browse/SPARK-11967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025416#comment-15025416
 ] 

Reynold Xin commented on SPARK-11967:
-

It was consistency with only one function :) varargs is much easier to use here.

> Use varargs for multiple paths in DataFrameReader
> -
>
> Key: SPARK-11967
> URL: https://issues.apache.org/jira/browse/SPARK-11967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4944:
---

Assignee: Apache Spark

> Table Not Found exception in "Create Table Like registered RDD table"
> -
>
> Key: SPARK-4944
> URL: https://issues.apache.org/jira/browse/SPARK-4944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>Assignee: Apache Spark
>
> {code}
> rdd_table.saveAsParquetFile("/user/spark/my_data.parquet")
> hiveContext.registerRDDAsTable(rdd_table, "rdd_table")
> hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION 
> '/user/spark/my_data.parquet'")
> {code}
> {noformat}
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not 
> found rdd_table
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11140) Replace file server in driver with RPC-based alternative

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025524#comment-15025524
 ] 

Apache Spark commented on SPARK-11140:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9947

> Replace file server in driver with RPC-based alternative
> 
>
> Key: SPARK-11140
> URL: https://issues.apache.org/jira/browse/SPARK-11140
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
> Fix For: 1.7.0
>
>
> As part of making configuring encryption easy in Spark, it would be better to 
> use the existing RPC channel between driver and executors to transfer files 
> and jars added to the application.
> This would remove the need to start the HTTP server currently used for that 
> purpose, which needs to be configured to use SSL if encryption is wanted. SSL 
> is kinda hard to configure correctly in a multi-user, distributed environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11872) Prevent the call to SparkContext#stop() in the listener bus's thread

2015-11-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-11872.
--
  Resolution: Fixed
   Fix Version/s: 1.6.0
Target Version/s: 1.6.0

> Prevent the call to SparkContext#stop() in the listener bus's thread
> 
>
> Key: SPARK-11872
> URL: https://issues.apache.org/jira/browse/SPARK-11872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
> Fix For: 1.6.0
>
>
> This is continuation of SPARK-11761
> Andrew suggested adding this protection. See tail of PR #9741



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10911) Executors should System.exit on clean shutdown

2015-11-24 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-10911:
--
Assignee: Zhuo Liu

> Executors should System.exit on clean shutdown
> --
>
> Key: SPARK-10911
> URL: https://issues.apache.org/jira/browse/SPARK-10911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>Priority: Minor
>
> Executors should call System.exit on clean shutdown to make sure all user 
> threads exit and jvm shuts down.
> We ran into a case where an Executor was left around for days trying to 
> shutdown because the user code was using a non-daemon thread pool and one of 
> those threads wasn't exiting.  We should force the jvm to go away with 
> System.exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10911) Executors should System.exit on clean shutdown

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10911:


Assignee: Apache Spark  (was: Zhuo Liu)

> Executors should System.exit on clean shutdown
> --
>
> Key: SPARK-10911
> URL: https://issues.apache.org/jira/browse/SPARK-10911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>Priority: Minor
>
> Executors should call System.exit on clean shutdown to make sure all user 
> threads exit and jvm shuts down.
> We ran into a case where an Executor was left around for days trying to 
> shutdown because the user code was using a non-daemon thread pool and one of 
> those threads wasn't exiting.  We should force the jvm to go away with 
> System.exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10911) Executors should System.exit on clean shutdown

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10911:


Assignee: Zhuo Liu  (was: Apache Spark)

> Executors should System.exit on clean shutdown
> --
>
> Key: SPARK-10911
> URL: https://issues.apache.org/jira/browse/SPARK-10911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>Priority: Minor
>
> Executors should call System.exit on clean shutdown to make sure all user 
> threads exit and jvm shuts down.
> We ran into a case where an Executor was left around for days trying to 
> shutdown because the user code was using a non-daemon thread pool and one of 
> those threads wasn't exiting.  We should force the jvm to go away with 
> System.exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10911) Executors should System.exit on clean shutdown

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025441#comment-15025441
 ] 

Apache Spark commented on SPARK-10911:
--

User 'zhuoliu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9946

> Executors should System.exit on clean shutdown
> --
>
> Key: SPARK-10911
> URL: https://issues.apache.org/jira/browse/SPARK-10911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>Priority: Minor
>
> Executors should call System.exit on clean shutdown to make sure all user 
> threads exit and jvm shuts down.
> We ran into a case where an Executor was left around for days trying to 
> shutdown because the user code was using a non-daemon thread pool and one of 
> those threads wasn't exiting.  We should force the jvm to go away with 
> System.exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11872) Prevent the call to SparkContext#stop() in the listener bus's thread

2015-11-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-11872:
-
Assignee: Ted Yu  (was: Shixiong Zhu)

> Prevent the call to SparkContext#stop() in the listener bus's thread
> 
>
> Key: SPARK-11872
> URL: https://issues.apache.org/jira/browse/SPARK-11872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Ted Yu
> Fix For: 1.6.0
>
>
> This is continuation of SPARK-11761
> Andrew suggested adding this protection. See tail of PR #9741



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11872) Prevent the call to SparkContext#stop() in the listener bus's thread

2015-11-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-11872:


Assignee: Shixiong Zhu

> Prevent the call to SparkContext#stop() in the listener bus's thread
> 
>
> Key: SPARK-11872
> URL: https://issues.apache.org/jira/browse/SPARK-11872
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Ted Yu
>Assignee: Shixiong Zhu
> Fix For: 1.6.0
>
>
> This is continuation of SPARK-11761
> Andrew suggested adding this protection. See tail of PR #9741



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11946) Audit pivot API for 1.6

2015-11-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11946.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

> Audit pivot API for 1.6
> ---
>
> Key: SPARK-11946
> URL: https://issues.apache.org/jira/browse/SPARK-11946
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.6.0
>
>
> Currently pivot's signature looks like
> {code}
> @scala.annotation.varargs
> def pivot(pivotColumn: Column, values: Column*): GroupedData
> @scala.annotation.varargs
> def pivot(pivotColumn: String, values: Any*): GroupedData
> {code}
> I think we can remove the one that takes "Column" types, since callers should 
> always be passing in literals. It'd also be more clear if the values are not 
> varargs, but rather Seq or java.util.List.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11967:


Assignee: Apache Spark  (was: Reynold Xin)

> Use varargs for multiple paths in DataFrameReader
> -
>
> Key: SPARK-11967
> URL: https://issues.apache.org/jira/browse/SPARK-11967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025373#comment-15025373
 ] 

Apache Spark commented on SPARK-11967:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9945

> Use varargs for multiple paths in DataFrameReader
> -
>
> Key: SPARK-11967
> URL: https://issues.apache.org/jira/browse/SPARK-11967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11929) spark-shell log level customization is lost if user provides a log4j.properties file

2015-11-24 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-11929.
--
   Resolution: Fixed
Fix Version/s: 1.7.0

Issue resolved by pull request 9816
[https://github.com/apache/spark/pull/9816]

> spark-shell log level customization is lost if user provides a 
> log4j.properties file
> 
>
> Key: SPARK-11929
> URL: https://issues.apache.org/jira/browse/SPARK-11929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell
>Reporter: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.7.0
>
>
> {{Logging.scala}} has code that defines the default log level for the 
> spark-shell to WARN, to avoid lots of noise in the output.
> But if a user provides a log4j.properies file in the Spark configuration, 
> that customization is lost. That means that without a log4j.properties, there 
> are two different configurations (one for regular apps, one for the shell). 
> But if you have a custom file, you lose the ability to easily differentiate 
> between those two, and you're stuck with a single config for both.
> It would be nice to allow different configurations also in the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025421#comment-15025421
 ] 

koert kuipers commented on SPARK-11967:
---

i found the comment in my pullreq:
* calling the function load(paths: Array[String]) would be more consistent with 
the rest of the reader API. This precludes using varargs, but that is probably 
not the most common use of this function.

anyhow, i like varargs

> Use varargs for multiple paths in DataFrameReader
> -
>
> Key: SPARK-11967
> URL: https://issues.apache.org/jira/browse/SPARK-11967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11926) unify GetStructField and GetInternalRowField

2015-11-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11926:

Assignee: Wenchen Fan

> unify GetStructField and GetInternalRowField
> 
>
> Key: SPARK-11926
> URL: https://issues.apache.org/jira/browse/SPARK-11926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025424#comment-15025424
 ] 

koert kuipers commented on SPARK-11967:
---

agreed that varargs is easier. thanks

> Use varargs for multiple paths in DataFrameReader
> -
>
> Key: SPARK-11967
> URL: https://issues.apache.org/jira/browse/SPARK-11967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4944:
---

Assignee: (was: Apache Spark)

> Table Not Found exception in "Create Table Like registered RDD table"
> -
>
> Key: SPARK-4944
> URL: https://issues.apache.org/jira/browse/SPARK-4944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> {code}
> rdd_table.saveAsParquetFile("/user/spark/my_data.parquet")
> hiveContext.registerRDDAsTable(rdd_table, "rdd_table")
> hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION 
> '/user/spark/my_data.parquet'")
> {code}
> {noformat}
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not 
> found rdd_table
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4944) Table Not Found exception in "Create Table Like registered RDD table"

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025194#comment-15025194
 ] 

Apache Spark commented on SPARK-4944:
-

User 'dereksabryfb' has created a pull request for this issue:
https://github.com/apache/spark/pull/9944

> Table Not Found exception in "Create Table Like registered RDD table"
> -
>
> Key: SPARK-4944
> URL: https://issues.apache.org/jira/browse/SPARK-4944
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Cheng Hao
>
> {code}
> rdd_table.saveAsParquetFile("/user/spark/my_data.parquet")
> hiveContext.registerRDDAsTable(rdd_table, "rdd_table")
> hiveContext.sql("CREATE EXTERNAL TABLE my_data LIKE rdd_table LOCATION 
> '/user/spark/my_data.parquet'")
> {code}
> {noformat}
> org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution 
> Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Table not 
> found rdd_table
>   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:322)
>   at 
> org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:284)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult$lzycompute(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.sideEffectResult(NativeCommand.scala:35)
>   at 
> org.apache.spark.sql.hive.execution.NativeCommand.execute(NativeCommand.scala:38)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd$lzycompute(HiveContext.scala:382)
>   at 
> org.apache.spark.sql.hive.HiveContext$QueryExecution.toRdd(HiveContext.scala:382)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-11967:
---

 Summary: Use varargs for multiple paths in DataFrameReader
 Key: SPARK-11967
 URL: https://issues.apache.org/jira/browse/SPARK-11967
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11967) Use varargs for multiple paths in DataFrameReader

2015-11-24 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025413#comment-15025413
 ] 

koert kuipers commented on SPARK-11967:
---

i think i had varargs originally, and then someone asked to change it to Array 
for API consistency?

> Use varargs for multiple paths in DataFrameReader
> -
>
> Key: SPARK-11967
> URL: https://issues.apache.org/jira/browse/SPARK-11967
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11929) spark-shell log level customization is lost if user provides a log4j.properties file

2015-11-24 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-11929:
-
Assignee: Marcelo Vanzin

> spark-shell log level customization is lost if user provides a 
> log4j.properties file
> 
>
> Key: SPARK-11929
> URL: https://issues.apache.org/jira/browse/SPARK-11929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Shell
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 1.7.0
>
>
> {{Logging.scala}} has code that defines the default log level for the 
> spark-shell to WARN, to avoid lots of noise in the output.
> But if a user provides a log4j.properies file in the Spark configuration, 
> that customization is lost. That means that without a log4j.properties, there 
> are two different configurations (one for regular apps, one for the shell). 
> But if you have a custom file, you lose the ability to easily differentiate 
> between those two, and you're stuck with a single config for both.
> It would be nice to allow different configurations also in the second case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11982) Improve performance of CartesianProduct

2015-11-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11982:
--

 Summary: Improve performance of CartesianProduct
 Key: SPARK-11982
 URL: https://issues.apache.org/jira/browse/SPARK-11982
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


RDD.cartesian() is very slow, we should improve it or create a special version 
for SQL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9141) DataFrame recomputed instead of using cached parent.

2015-11-24 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-9141:
---
Comment: was deleted

(was: When u run these codes, you can find the UDF was called 20 times, not the 
expected 10 times. )

> DataFrame recomputed instead of using cached parent.
> 
>
> Key: SPARK-9141
> URL: https://issues.apache.org/jira/browse/SPARK-9141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Nick Pritchard
>Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: cache, dataframe
> Fix For: 1.5.0
>
>
> As I understand, DataFrame.cache() is supposed to work the same as 
> RDD.cache(), so that repeated operations on it will use the cached results 
> and not recompute the entire lineage. However, it seems that some DataFrame 
> operations (e.g. withColumn) change the underlying RDD lineage so that cache 
> doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's 
> that  use println so that it is easy to see when they are being called. Next, 
> I create a simple data frame with one row and two columns. Next, I add a 
> column, cache it, and call count() to force the computation. Lastly, I add 
> another column, cache it, and call count().
> I would have expected the last statement to only compute the last column, 
> since everything else was cached. However, because withColumn() changes the 
> lineage, the whole data frame is recomputed.
> {code}
> // Examples udf's that println when called 
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 
> // Initial dataset 
> val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 
> // Add column by applying twice udf 
> val df2 = df1.withColumn("twice", twice($"value")) 
> df2.cache() 
> df2.count() //prints Computed: twice(1) 
> // Add column by applying triple udf 
> val df3 = df2.withColumn("triple", triple($"value")) 
> df3.cache() 
> df3.count() //prints Computed: twice(1)\nComputed: triple(1) 
> {code}
> I found a workaround, which helped me understand what was going on behind the 
> scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD 
> then back DataFrame, which seems to freeze the lineage. The code below shows 
> the workaround for creating the second data frame so cache will work as 
> expected.
> {code}
> val df2 = {
>   val tmp = df1.withColumn("twice", twice($"value"))
>   sqlContext.createDataFrame(tmp.rdd, tmp.schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9141) DataFrame recomputed instead of using cached parent.

2015-11-24 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026105#comment-15026105
 ] 

Yi Tian edited comment on SPARK-9141 at 11/25/15 7:00 AM:
--

[~marmbrus] Here is my codes:

{code}
val rdd = sc.parallelize(1 to 10).map{line => new 
GenericRow(Array[Any]("a","b")).asInstanceOf[Row]}
val df = hc.createDataFrame(rdd, 
StructType(Seq(StructField("a",StringType),StructField("b",StringType
val mkArrayUDF = 
org.apache.spark.sql.functions.udf[Array[String],String,String] ((s1: String, 
s2: String) => {
println("udf called")
Array[String](s1, s2)
  })
val df2 = df.withColumn("arr",mkArrayUDF(df("a"),df("b")))
val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", df2("arr")(1))
df3.collect().foreach(println)
{code}


was (Author: tianyi):
[~marmbrus] Here is my codes:

{code:scala}
val rdd = sc.parallelize(1 to 10).map{line => new 
GenericRow(Array[Any]("a","b")).asInstanceOf[Row]}
val df = hc.createDataFrame(rdd, 
StructType(Seq(StructField("a",StringType),StructField("b",StringType
val mkArrayUDF = 
org.apache.spark.sql.functions.udf[Array[String],String,String] ((s1: String, 
s2: String) => {
println("udf called")
Array[String](s1, s2)
  })
val df2 = df.withColumn("arr",mkArrayUDF(df("a"),df("b")))
val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", df2("arr")(1))
df3.collect().foreach(println)
{code}

> DataFrame recomputed instead of using cached parent.
> 
>
> Key: SPARK-9141
> URL: https://issues.apache.org/jira/browse/SPARK-9141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Nick Pritchard
>Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: cache, dataframe
> Fix For: 1.5.0
>
>
> As I understand, DataFrame.cache() is supposed to work the same as 
> RDD.cache(), so that repeated operations on it will use the cached results 
> and not recompute the entire lineage. However, it seems that some DataFrame 
> operations (e.g. withColumn) change the underlying RDD lineage so that cache 
> doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's 
> that  use println so that it is easy to see when they are being called. Next, 
> I create a simple data frame with one row and two columns. Next, I add a 
> column, cache it, and call count() to force the computation. Lastly, I add 
> another column, cache it, and call count().
> I would have expected the last statement to only compute the last column, 
> since everything else was cached. However, because withColumn() changes the 
> lineage, the whole data frame is recomputed.
> {code}
> // Examples udf's that println when called 
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 
> // Initial dataset 
> val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 
> // Add column by applying twice udf 
> val df2 = df1.withColumn("twice", twice($"value")) 
> df2.cache() 
> df2.count() //prints Computed: twice(1) 
> // Add column by applying triple udf 
> val df3 = df2.withColumn("triple", triple($"value")) 
> df3.cache() 
> df3.count() //prints Computed: twice(1)\nComputed: triple(1) 
> {code}
> I found a workaround, which helped me understand what was going on behind the 
> scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD 
> then back DataFrame, which seems to freeze the lineage. The code below shows 
> the workaround for creating the second data frame so cache will work as 
> expected.
> {code}
> val df2 = {
>   val tmp = df1.withColumn("twice", twice($"value"))
>   sqlContext.createDataFrame(tmp.rdd, tmp.schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11961) User guide section for ChiSqSelector transformer

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11961:


Assignee: Xusen Yin  (was: Apache Spark)

> User guide section for ChiSqSelector transformer
> 
>
> Key: SPARK-11961
> URL: https://issues.apache.org/jira/browse/SPARK-11961
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> [~yinxusen] Assigning this to you since you added the feature.  Will you have 
> time to add a section?  Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11961) User guide section for ChiSqSelector transformer

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11961:


Assignee: Apache Spark  (was: Xusen Yin)

> User guide section for ChiSqSelector transformer
> 
>
> Key: SPARK-11961
> URL: https://issues.apache.org/jira/browse/SPARK-11961
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> [~yinxusen] Assigning this to you since you added the feature.  Will you have 
> time to add a section?  Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11961) User guide section for ChiSqSelector transformer

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026329#comment-15026329
 ] 

Apache Spark commented on SPARK-11961:
--

User 'yinxusen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9965

> User guide section for ChiSqSelector transformer
> 
>
> Key: SPARK-11961
> URL: https://issues.apache.org/jira/browse/SPARK-11961
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Xusen Yin
>
> [~yinxusen] Assigning this to you since you added the feature.  Will you have 
> time to add a section?  Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11979) Empty TrackStateRDD cannot be checkpointed and recovered from checkpoint file

2015-11-24 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-11979.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

> Empty TrackStateRDD cannot be checkpointed and recovered from checkpoint file 
> --
>
> Key: SPARK-11979
> URL: https://issues.apache.org/jira/browse/SPARK-11979
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Critical
> Fix For: 1.6.0
>
>
> {code}
> Job aborted due to stage failure: Task 0 in stage 6.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 6.0 (TID 20, localhost): 
> java.lang.IllegalArgumentException: requirement failed: Invalid initial 
> capacity
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:96)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:86)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.readObject(StateMap.scala:291)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at 
> java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
>   at 
> org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:181)
>   at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
>   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:921)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:921)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:744)
> Driver stacktrace:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 6.0 failed 1 times, most recent failure: Lost task 0.0 in stage 6.0 
> (TID 20, localhost): java.lang.IllegalArgumentException: requirement failed: 
> Invalid initial capacity
>   at scala.Predef$.require(Predef.scala:233)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:96)
>   at 
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.(StateMap.scala:86)
>   at 
> 

[jira] [Comment Edited] (SPARK-9141) DataFrame recomputed instead of using cached parent.

2015-11-24 Thread Yi Tian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026105#comment-15026105
 ] 

Yi Tian edited comment on SPARK-9141 at 11/25/15 7:24 AM:
--

[~marmbrus] Sorry, it's our fault.


was (Author: tianyi):
[~marmbrus] Here is my codes:

{code}
val rdd = sc.parallelize(1 to 10).map{line => new 
GenericRow(Array[Any]("a","b")).asInstanceOf[Row]}
val df = hc.createDataFrame(rdd, 
StructType(Seq(StructField("a",StringType),StructField("b",StringType
val mkArrayUDF = 
org.apache.spark.sql.functions.udf[Array[String],String,String] ((s1: String, 
s2: String) => {
println("udf called")
Array[String](s1, s2)
  })
val df2 = df.withColumn("arr",mkArrayUDF(df("a"),df("b")))
val df3 = df2.withColumn("e0", df2("arr")(0)).withColumn("e1", df2("arr")(1))
df3.collect().foreach(println)
{code}

> DataFrame recomputed instead of using cached parent.
> 
>
> Key: SPARK-9141
> URL: https://issues.apache.org/jira/browse/SPARK-9141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Nick Pritchard
>Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: cache, dataframe
> Fix For: 1.5.0
>
>
> As I understand, DataFrame.cache() is supposed to work the same as 
> RDD.cache(), so that repeated operations on it will use the cached results 
> and not recompute the entire lineage. However, it seems that some DataFrame 
> operations (e.g. withColumn) change the underlying RDD lineage so that cache 
> doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's 
> that  use println so that it is easy to see when they are being called. Next, 
> I create a simple data frame with one row and two columns. Next, I add a 
> column, cache it, and call count() to force the computation. Lastly, I add 
> another column, cache it, and call count().
> I would have expected the last statement to only compute the last column, 
> since everything else was cached. However, because withColumn() changes the 
> lineage, the whole data frame is recomputed.
> {code}
> // Examples udf's that println when called 
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 
> // Initial dataset 
> val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 
> // Add column by applying twice udf 
> val df2 = df1.withColumn("twice", twice($"value")) 
> df2.cache() 
> df2.count() //prints Computed: twice(1) 
> // Add column by applying triple udf 
> val df3 = df2.withColumn("triple", triple($"value")) 
> df3.cache() 
> df3.count() //prints Computed: twice(1)\nComputed: triple(1) 
> {code}
> I found a workaround, which helped me understand what was going on behind the 
> scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD 
> then back DataFrame, which seems to freeze the lineage. The code below shows 
> the workaround for creating the second data frame so cache will work as 
> expected.
> {code}
> val df2 = {
>   val tmp = df1.withColumn("twice", twice($"value"))
>   sqlContext.createDataFrame(tmp.rdd, tmp.schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11329) Expand Star when creating a struct

2015-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15026372#comment-15026372
 ] 

Maciej Bryński commented on SPARK-11329:


I'm using Spark 1.6.

I did some additional tests.

This select use TungstenAggregate:
```
sqlCtx.sql('select id, max(data) as max from table  group by id ').collect()
```
When I add struct() it changed to ConvertToSafe path.
So I think the problem lies in struct() function.

> Expand Star when creating a struct
> --
>
> Key: SPARK-11329
> URL: https://issues.apache.org/jira/browse/SPARK-11329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Nong Li
> Fix For: 1.6.0
>
>
> It is pretty common for customers to do regular extractions of update data 
> from an external datasource (e.g. mysql or postgres). While this is possible 
> today, the syntax is a little onerous. With some small improvements to the 
> analyzer I think we could make this much easier.
> Goal: Allow users to execute the following two queries as well as their 
> dataframe equivalents
> to find the most recent record for each key
> {{SELECT max(struct(timestamp, *)) as mostRecentRecord GROUP BY key}}
> to unnest the struct from above.
> {{SELECT mostRecentRecord.* FROM data}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11818) ExecutorClassLoader cannot see any resources from parent class loader

2015-11-24 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11818.

   Resolution: Fixed
 Assignee: Jungtaek Lim
Fix Version/s: 1.6.0

> ExecutorClassLoader cannot see any resources from parent class loader
> -
>
> Key: SPARK-11818
> URL: https://issues.apache.org/jira/browse/SPARK-11818
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.4.1
> Environment: CentOS 6, spark 1.4.1-hadoop2.4, mesos 0.22.1
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
> Fix For: 1.6.0
>
>
> This issue starts from finding root reason from strange problem from 
> spark-shell (and zeppelin) which is not a problem for spark-submit.
> https://mail-archives.apache.org/mod_mbox/spark-user/201511.mbox/%3CCAF5108jMXyOjiGmCgr%3Ds%2BNvTMcyKWMBVM1GsrH7Pz4xUj48LfA%40mail.gmail.com%3E
> After some hours (over days) digging into the detail, I found that 
> ExecutorClassLoader cannot see any resource files which can be seen from 
> parent class loader.
> ExecutorClassLoader itself doesn't need to lookup resource files cause REPL 
> doesn't generate these, but it should delegate lookup to parent class loader.
> I'll provide the pull request which includes tests which fails on master soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024893#comment-15024893
 ] 

Apache Spark commented on SPARK-11955:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9940

> Mark one side fields in merging schema for safely pushdowning filters in 
> parquet
> 
>
> Key: SPARK-11955
> URL: https://issues.apache.org/jira/browse/SPARK-11955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently we simply skip pushdowning filters in parquet if we enable schema 
> merging.
> However, we can actually mark one side fields in merging schema for safely 
> pushdowning filters in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet

2015-11-24 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-11955:
---

 Summary: Mark one side fields in merging schema for safely 
pushdowning filters in parquet
 Key: SPARK-11955
 URL: https://issues.apache.org/jira/browse/SPARK-11955
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh


Currently we simply skip pushdowning filters in parquet if we enable schema 
merging.

However, we can actually mark one side fields in merging schema for safely 
pushdowning filters in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11942) fix encoder life cycle for CoGroup

2015-11-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11942:
-
Assignee: Wenchen Fan

> fix encoder life cycle for CoGroup
> --
>
> Key: SPARK-11942
> URL: https://issues.apache.org/jira/browse/SPARK-11942
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11955:


Assignee: (was: Apache Spark)

> Mark one side fields in merging schema for safely pushdowning filters in 
> parquet
> 
>
> Key: SPARK-11955
> URL: https://issues.apache.org/jira/browse/SPARK-11955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>
> Currently we simply skip pushdowning filters in parquet if we enable schema 
> merging.
> However, we can actually mark one side fields in merging schema for safely 
> pushdowning filters in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11955) Mark one side fields in merging schema for safely pushdowning filters in parquet

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11955:


Assignee: Apache Spark

> Mark one side fields in merging schema for safely pushdowning filters in 
> parquet
> 
>
> Key: SPARK-11955
> URL: https://issues.apache.org/jira/browse/SPARK-11955
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> Currently we simply skip pushdowning filters in parquet if we enable schema 
> merging.
> However, we can actually mark one side fields in merging schema for safely 
> pushdowning filters in parquet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11942) fix encoder life cycle for CoGroup

2015-11-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11942.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9928
[https://github.com/apache/spark/pull/9928]

> fix encoder life cycle for CoGroup
> --
>
> Key: SPARK-11942
> URL: https://issues.apache.org/jira/browse/SPARK-11942
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11956) Test failures potentially related to SPARK-11140

2015-11-24 Thread Marcelo Vanzin (JIRA)
Marcelo Vanzin created SPARK-11956:
--

 Summary: Test failures potentially related to SPARK-11140
 Key: SPARK-11956
 URL: https://issues.apache.org/jira/browse/SPARK-11956
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.7.0
Reporter: Marcelo Vanzin


[~joshrosen] pointed out that some YARN tests started failing intermittently 
after that change went in. Here's a suspicious excerpt from one of the logs on 
Jenkins:

{noformat}
15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for 
/jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256
15/11/24 08:58:18 INFO Utils: Fetching 
spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to 
/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp
15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown 
(amp-jenkins-worker-07.amp:53256) driver disconnected.
15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream 
/jars/sparkJar2657865636759819960.tmp.
java.nio.channels.ClosedChannelException
at 
org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61)
at 
org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
at 
io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
at 
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:745)
15/11/24 09:00:00 INFO Utils: Copying 
/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache
 to 
/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
15/11/24 09:00:00 INFO Executor: Adding 
file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
 to class loader
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9141) DataFrame recomputed instead of using cached parent.

2015-11-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024927#comment-15024927
 ] 

Michael Armbrust commented on SPARK-9141:
-

[~tianyi] please provide a reproduction of the issue you are hitting.  The 
example from the description works for me.  In particular please include 
explain for the cache and failing dataframe.

> DataFrame recomputed instead of using cached parent.
> 
>
> Key: SPARK-9141
> URL: https://issues.apache.org/jira/browse/SPARK-9141
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Nick Pritchard
>Assignee: Michael Armbrust
>Priority: Blocker
>  Labels: cache, dataframe
> Fix For: 1.5.0
>
>
> As I understand, DataFrame.cache() is supposed to work the same as 
> RDD.cache(), so that repeated operations on it will use the cached results 
> and not recompute the entire lineage. However, it seems that some DataFrame 
> operations (e.g. withColumn) change the underlying RDD lineage so that cache 
> doesn't work as expected.
> Below is a Scala example that demonstrates this. First, I define two UDF's 
> that  use println so that it is easy to see when they are being called. Next, 
> I create a simple data frame with one row and two columns. Next, I add a 
> column, cache it, and call count() to force the computation. Lastly, I add 
> another column, cache it, and call count().
> I would have expected the last statement to only compute the last column, 
> since everything else was cached. However, because withColumn() changes the 
> lineage, the whole data frame is recomputed.
> {code}
> // Examples udf's that println when called 
> val twice = udf { (x: Int) => println(s"Computed: twice($x)"); x * 2 } 
> val triple = udf { (x: Int) => println(s"Computed: triple($x)"); x * 3 } 
> // Initial dataset 
> val df1 = sc.parallelize(Seq(("a", 1))).toDF("name", "value") 
> // Add column by applying twice udf 
> val df2 = df1.withColumn("twice", twice($"value")) 
> df2.cache() 
> df2.count() //prints Computed: twice(1) 
> // Add column by applying triple udf 
> val df3 = df2.withColumn("triple", triple($"value")) 
> df3.cache() 
> df3.count() //prints Computed: twice(1)\nComputed: triple(1) 
> {code}
> I found a workaround, which helped me understand what was going on behind the 
> scenes, but doesn't seem like an ideal solution. Basically, I convert to RDD 
> then back DataFrame, which seems to freeze the lineage. The code below shows 
> the workaround for creating the second data frame so cache will work as 
> expected.
> {code}
> val df2 = {
>   val tmp = df1.withColumn("twice", twice($"value"))
>   sqlContext.createDataFrame(tmp.rdd, tmp.schema)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11952) Remove duplicate ml examples

2015-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11952.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9933
[https://github.com/apache/spark/pull/9933]

> Remove duplicate ml examples
> 
>
> Key: SPARK-11952
> URL: https://issues.apache.org/jira/browse/SPARK-11952
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, ML
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Remove duplicate ml examples (only for ML)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11952) Remove duplicate ml examples (GBT/RF/logistic regression in Python)

2015-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11952:
--
Assignee: Yanbo Liang

> Remove duplicate ml examples (GBT/RF/logistic regression in Python)
> ---
>
> Key: SPARK-11952
> URL: https://issues.apache.org/jira/browse/SPARK-11952
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Remove duplicate ml examples (only for ML)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11521) LinearRegressionSummary needs to clarify which metrics are weighted in the documentation

2015-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11521.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9927
[https://github.com/apache/spark/pull/9927]

> LinearRegressionSummary needs to clarify which metrics are weighted in the 
> documentation
> 
>
> Key: SPARK-11521
> URL: https://issues.apache.org/jira/browse/SPARK-11521
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 1.6.0
>
>
> Some metrics in the summary are weighted (e.g., devianceResiduals), but the 
> ones computed via RegressionMetrics are not.  This should be documented very 
> clearly (unless this gets fixed before the next release in [SPARK-11520]).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11952) Remove duplicate ml examples (GBT/RF/logistic regression in Python)

2015-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-11952:
--
Summary: Remove duplicate ml examples (GBT/RF/logistic regression in 
Python)  (was: Remove duplicate ml examples)

> Remove duplicate ml examples (GBT/RF/logistic regression in Python)
> ---
>
> Key: SPARK-11952
> URL: https://issues.apache.org/jira/browse/SPARK-11952
> Project: Spark
>  Issue Type: Sub-task
>  Components: Examples, ML
>Reporter: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> Remove duplicate ml examples (only for ML)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11730) Feature Importance for GBT

2015-11-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025084#comment-15025084
 ] 

Joseph K. Bradley commented on SPARK-11730:
---

OK, following Friedman sounds good.  : )
I agree it'd be nice to wait for GBT to be moved to spark.ml.

> Feature Importance for GBT
> --
>
> Key: SPARK-11730
> URL: https://issues.apache.org/jira/browse/SPARK-11730
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Reporter: Brian Webb
>
> Random Forests have feature importance, but GBT do not. It would be great if 
> we can add feature importance to GBT as well. Perhaps the code in Random 
> Forests can be refactored to apply to both types of ensembles.
> See https://issues.apache.org/jira/browse/SPARK-5133



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA

2015-11-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10022:
--
Assignee: Yanbo Liang

> Scala-Python method/parameter inconsistency check for ML during 1.5 QA
> --
>
> Key: SPARK-10022
> URL: https://issues.apache.org/jira/browse/SPARK-10022
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
> Attachments: python-1.4.txt, python-1.5.txt, python1.4-to-1.5.diff
>
>
> The missing classes for PySpark were listed at SPARK-9663.
> Here we check and list the missing method/parameter for ML of PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11604) ML 1.6 QA: API: Python API coverage

2015-11-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025086#comment-15025086
 ] 

Joseph K. Bradley commented on SPARK-11604:
---

Great thank you!

> ML 1.6 QA: API: Python API coverage
> ---
>
> Key: SPARK-11604
> URL: https://issues.apache.org/jira/browse/SPARK-11604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10022) Scala-Python method/parameter inconsistency check for ML during 1.5 QA

2015-11-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-10022.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> Scala-Python method/parameter inconsistency check for ML during 1.5 QA
> --
>
> Key: SPARK-10022
> URL: https://issues.apache.org/jira/browse/SPARK-10022
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.6.0
>
> Attachments: python-1.4.txt, python-1.5.txt, python1.4-to-1.5.diff
>
>
> The missing classes for PySpark were listed at SPARK-9663.
> Here we check and list the missing method/parameter for ML of PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11884) Drop multiple columns in the DataFrame API

2015-11-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024837#comment-15024837
 ] 

Ted Yu commented on SPARK-11884:


Is there interest in moving forward with the PR ?

> Drop multiple columns in the DataFrame API
> --
>
> Key: SPARK-11884
> URL: https://issues.apache.org/jira/browse/SPARK-11884
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Ted Yu
>Priority: Minor
>
> See the thread Ben started:
> http://search-hadoop.com/m/q3RTtveEuhjsr7g/
> This issue adds drop() method to DataFrame which accepts multiple column names



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11602) ML 1.6 QA: API: New Scala APIs, docs

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11602:


Assignee: yuhao yang  (was: Apache Spark)

> ML 1.6 QA: API: New Scala APIs, docs
> 
>
> Key: SPARK-11602
> URL: https://issues.apache.org/jira/browse/SPARK-11602
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Audit new public Scala APIs added to MLlib.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please comment here, or better yet create JIRAs and link 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11602) ML 1.6 QA: API: New Scala APIs, docs

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024838#comment-15024838
 ] 

Apache Spark commented on SPARK-11602:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/9939

> ML 1.6 QA: API: New Scala APIs, docs
> 
>
> Key: SPARK-11602
> URL: https://issues.apache.org/jira/browse/SPARK-11602
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>
> Audit new public Scala APIs added to MLlib.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please comment here, or better yet create JIRAs and link 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11855) Catalyst breaks backwards compatibility in branch-1.6

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11855:


Assignee: (was: Apache Spark)

> Catalyst breaks backwards compatibility in branch-1.6
> -
>
> Key: SPARK-11855
> URL: https://issues.apache.org/jira/browse/SPARK-11855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Santiago M. Mola
>Priority: Critical
>
> There's a number of APIs broken in catalyst 1.6.0. I'm trying to compile most 
> cases:
> *UnresolvedRelation*'s constructor has been changed from taking a Seq to a 
> TableIdentifier. A deprecated constructor taking Seq would be needed to be 
> backwards compatible.
> {code}
>  case class UnresolvedRelation(
> -tableIdentifier: Seq[String],
> +tableIdentifier: TableIdentifier,
>  alias: Option[String] = None) extends LeafNode {
> {code}
> It is similar with *UnresolvedStar*:
> {code}
> -case class UnresolvedStar(table: Option[String]) extends Star with 
> Unevaluable {
> +case class UnresolvedStar(target: Option[Seq[String]]) extends Star with 
> Unevaluable {
> {code}
> *Catalog* did get a lot of signatures changed too (because of 
> TableIdentifier). Providing the older methods as deprecated also seems viable 
> here.
> Spark 1.5 already broke backwards compatibility of part of catalyst API with 
> respect to 1.4. I understand there are good reasons for some cases, but we 
> should try to minimize backwards compatibility breakages for 1.x. Specially 
> now that 2.x is on the horizon and there will be a near opportunity to remove 
> deprecated stuff.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11956) Test failures potentially related to SPARK-11140

2015-11-24 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024974#comment-15024974
 ] 

Marcelo Vanzin commented on SPARK-11956:


This seems to be an issue I identified as part of working on SPARK-11563; there 
are fixes as part of that PR, I'll try to pull them out so we can unblock tests 
without having to push everything.

> Test failures potentially related to SPARK-11140
> 
>
> Key: SPARK-11956
> URL: https://issues.apache.org/jira/browse/SPARK-11956
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.7.0
>Reporter: Marcelo Vanzin
>
> [~joshrosen] pointed out that some YARN tests started failing intermittently 
> after that change went in. Here's a suspicious excerpt from one of the logs 
> on Jenkins:
> {noformat}
> 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for 
> /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256
> 15/11/24 08:58:18 INFO Utils: Fetching 
> spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp
> 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown 
> (amp-jenkins-worker-07.amp:53256) driver disconnected.
> 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream 
> /jars/sparkJar2657865636759819960.tmp.
> java.nio.channels.ClosedChannelException
> at 
> org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 15/11/24 09:00:00 INFO Utils: Copying 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache
>  to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
> 15/11/24 09:00:00 INFO Executor: Adding 
> file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
>  to class loader
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11953) CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug

2015-11-24 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024991#comment-15024991
 ] 

Huaxin Gao commented on SPARK-11953:


My understanding is that SaveMode.Append doesn't mean that table exists for 
sure. 
When SaveMode is Append, if table exists, append to the existing table. If 
table doesn't exist, create the table and append to the newly created table. So 
the code looks right to me. 

> CLONE - Sparksql-1.4.1 DataFrameWrite.jdbc() SaveMode.Append Bug
> 
>
> Key: SPARK-11953
> URL: https://issues.apache.org/jira/browse/SPARK-11953
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, Spark Submit, SQL
>Affects Versions: 1.4.1, 1.5.1
> Environment: Spark stand alone cluster
>Reporter: Siva Gudavalli
>
> In Spark 1.3.1 we have 2 methods i.e.. CreateJdbcTable and InsertIntoJdbc.
> They are replaced with write.jdbc() in Spark 1.4.1
> When we specify SaveMode.Append we are letting application know that there is 
> a table in the database which means "tableExists = true". And we do not need 
> to perform "JdbcUtils.tableExists(conn, table)".
> Please let me know if you think differently.
> Regards
> Shiv
> def jdbc(url: String, table: String, connectionProperties: Properties): Unit 
> = {
> val conn = JdbcUtils.createConnection(url, connectionProperties)
> try {
> var tableExists = JdbcUtils.tableExists(conn, table)
> if (mode == SaveMode.Ignore && tableExists)
> { return }
> if (mode == SaveMode.ErrorIfExists && tableExists)
> { sys.error(s"Table $table already exists.") }
> if (mode == SaveMode.Overwrite && tableExists)
> { JdbcUtils.dropTable(conn, table) tableExists = false }
> // Create the table if the table didn't exist.
> if (!tableExists)
> { val schema = JDBCWriteDetails.schemaString(df, url) val sql = s"CREATE 
> TABLE $table ($schema)" conn.prepareStatement(sql).executeUpdate() }
> } finally
> { conn.close() }
> JDBCWriteDetails.saveTable(df, url, table, connectionProperties)
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11847) Model export/import for spark.ml: LDA

2015-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11847.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9894
[https://github.com/apache/spark/pull/9894]

> Model export/import for spark.ml: LDA
> -
>
> Key: SPARK-11847
> URL: https://issues.apache.org/jira/browse/SPARK-11847
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
> Fix For: 1.6.0
>
>
> Add read/write support to LDA, similar to ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11956) Test failures potentially related to SPARK-11140

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025006#comment-15025006
 ] 

Apache Spark commented on SPARK-11956:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/9941

> Test failures potentially related to SPARK-11140
> 
>
> Key: SPARK-11956
> URL: https://issues.apache.org/jira/browse/SPARK-11956
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.7.0
>Reporter: Marcelo Vanzin
>
> [~joshrosen] pointed out that some YARN tests started failing intermittently 
> after that change went in. Here's a suspicious excerpt from one of the logs 
> on Jenkins:
> {noformat}
> 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for 
> /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256
> 15/11/24 08:58:18 INFO Utils: Fetching 
> spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp
> 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown 
> (amp-jenkins-worker-07.amp:53256) driver disconnected.
> 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream 
> /jars/sparkJar2657865636759819960.tmp.
> java.nio.channels.ClosedChannelException
> at 
> org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 15/11/24 09:00:00 INFO Utils: Copying 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache
>  to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
> 15/11/24 09:00:00 INFO Executor: Adding 
> file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
>  to class loader
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11956) Test failures potentially related to SPARK-11140

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11956:


Assignee: (was: Apache Spark)

> Test failures potentially related to SPARK-11140
> 
>
> Key: SPARK-11956
> URL: https://issues.apache.org/jira/browse/SPARK-11956
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.7.0
>Reporter: Marcelo Vanzin
>
> [~joshrosen] pointed out that some YARN tests started failing intermittently 
> after that change went in. Here's a suspicious excerpt from one of the logs 
> on Jenkins:
> {noformat}
> 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for 
> /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256
> 15/11/24 08:58:18 INFO Utils: Fetching 
> spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp
> 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown 
> (amp-jenkins-worker-07.amp:53256) driver disconnected.
> 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream 
> /jars/sparkJar2657865636759819960.tmp.
> java.nio.channels.ClosedChannelException
> at 
> org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 15/11/24 09:00:00 INFO Utils: Copying 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache
>  to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
> 15/11/24 09:00:00 INFO Executor: Adding 
> file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
>  to class loader
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11956) Test failures potentially related to SPARK-11140

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11956:


Assignee: Apache Spark

> Test failures potentially related to SPARK-11140
> 
>
> Key: SPARK-11956
> URL: https://issues.apache.org/jira/browse/SPARK-11956
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.7.0
>Reporter: Marcelo Vanzin
>Assignee: Apache Spark
>
> [~joshrosen] pointed out that some YARN tests started failing intermittently 
> after that change went in. Here's a suspicious excerpt from one of the logs 
> on Jenkins:
> {noformat}
> 15/11/24 08:58:18 DEBUG TransportClient: Sending stream request for 
> /jars/sparkJar2657865636759819960.tmp to /192.168.10.27:53256
> 15/11/24 08:58:18 INFO Utils: Fetching 
> spark://192.168.10.27:53256/jars/sparkJar2657865636759819960.tmp to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/fetchFileTemp4632692089398180695.tmp
> 15/11/24 09:00:00 WARN CoarseGrainedExecutorBackend: An unknown 
> (amp-jenkins-worker-07.amp:53256) driver disconnected.
> 15/11/24 09:00:00 ERROR NettyRpcEnv: Error downloading stream 
> /jars/sparkJar2657865636759819960.tmp.
> java.nio.channels.ClosedChannelException
> at 
> org.apache.spark.network.client.StreamInterceptor.channelInactive(StreamInterceptor.java:61)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelInactive(TransportFrameDecoder.java:123)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:208)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:194)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:828)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$7.run(AbstractChannel.java:621)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:357)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 15/11/24 09:00:00 INFO Utils: Copying 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/spark-466b6740-bbb5-40ed-8abb-b084a6150115/-4120705061448384283661_cache
>  to 
> /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
> 15/11/24 09:00:00 INFO Executor: Adding 
> file:/home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/spark-test/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/usercache/jenkins/appcache/application_1448384095183_0006/container_1448384095183_0006_01_02/./sparkJar2657865636759819960.tmp
>  to class loader
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11855) Catalyst breaks backwards compatibility in branch-1.6

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15024832#comment-15024832
 ] 

Apache Spark commented on SPARK-11855:
--

User 'smola' has created a pull request for this issue:
https://github.com/apache/spark/pull/9938

> Catalyst breaks backwards compatibility in branch-1.6
> -
>
> Key: SPARK-11855
> URL: https://issues.apache.org/jira/browse/SPARK-11855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Santiago M. Mola
>Priority: Critical
>
> There's a number of APIs broken in catalyst 1.6.0. I'm trying to compile most 
> cases:
> *UnresolvedRelation*'s constructor has been changed from taking a Seq to a 
> TableIdentifier. A deprecated constructor taking Seq would be needed to be 
> backwards compatible.
> {code}
>  case class UnresolvedRelation(
> -tableIdentifier: Seq[String],
> +tableIdentifier: TableIdentifier,
>  alias: Option[String] = None) extends LeafNode {
> {code}
> It is similar with *UnresolvedStar*:
> {code}
> -case class UnresolvedStar(table: Option[String]) extends Star with 
> Unevaluable {
> +case class UnresolvedStar(target: Option[Seq[String]]) extends Star with 
> Unevaluable {
> {code}
> *Catalog* did get a lot of signatures changed too (because of 
> TableIdentifier). Providing the older methods as deprecated also seems viable 
> here.
> Spark 1.5 already broke backwards compatibility of part of catalyst API with 
> respect to 1.4. I understand there are good reasons for some cases, but we 
> should try to minimize backwards compatibility breakages for 1.x. Specially 
> now that 2.x is on the horizon and there will be a near opportunity to remove 
> deprecated stuff.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11602) ML 1.6 QA: API: New Scala APIs, docs

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11602:


Assignee: Apache Spark  (was: yuhao yang)

> ML 1.6 QA: API: New Scala APIs, docs
> 
>
> Key: SPARK-11602
> URL: https://issues.apache.org/jira/browse/SPARK-11602
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> Audit new public Scala APIs added to MLlib.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please comment here, or better yet create JIRAs and link 
> them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11855) Catalyst breaks backwards compatibility in branch-1.6

2015-11-24 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11855:


Assignee: Apache Spark

> Catalyst breaks backwards compatibility in branch-1.6
> -
>
> Key: SPARK-11855
> URL: https://issues.apache.org/jira/browse/SPARK-11855
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Santiago M. Mola
>Assignee: Apache Spark
>Priority: Critical
>
> There's a number of APIs broken in catalyst 1.6.0. I'm trying to compile most 
> cases:
> *UnresolvedRelation*'s constructor has been changed from taking a Seq to a 
> TableIdentifier. A deprecated constructor taking Seq would be needed to be 
> backwards compatible.
> {code}
>  case class UnresolvedRelation(
> -tableIdentifier: Seq[String],
> +tableIdentifier: TableIdentifier,
>  alias: Option[String] = None) extends LeafNode {
> {code}
> It is similar with *UnresolvedStar*:
> {code}
> -case class UnresolvedStar(table: Option[String]) extends Star with 
> Unevaluable {
> +case class UnresolvedStar(target: Option[Seq[String]]) extends Star with 
> Unevaluable {
> {code}
> *Catalog* did get a lot of signatures changed too (because of 
> TableIdentifier). Providing the older methods as deprecated also seems viable 
> here.
> Spark 1.5 already broke backwards compatibility of part of catalyst API with 
> respect to 1.4. I understand there are good reasons for some cases, but we 
> should try to minimize backwards compatibility breakages for 1.x. Specially 
> now that 2.x is on the horizon and there will be a near opportunity to remove 
> deprecated stuff.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9328) Netty IO layer should implement read timeouts

2015-11-24 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025107#comment-15025107
 ] 

Michael Armbrust commented on SPARK-9328:
-

[~joshrosen] is this actually a 1.6 blocker?

> Netty IO layer should implement read timeouts
> -
>
> Key: SPARK-9328
> URL: https://issues.apache.org/jira/browse/SPARK-9328
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.2.1, 1.3.1, 1.4.1, 1.5.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> Spark's network layer does not implement read timeouts which may lead to 
> stalls during shuffle: if a remote shuffle server stalls while responding to 
> a shuffle block fetch request but does not close the socket then the job may 
> block until an OS-level socket timeout occurs.
> I think that we can fix this using Netty's ReadTimeoutHandler 
> (http://stackoverflow.com/questions/13390363/netty-connecttimeoutmillis-vs-readtimeouthandler).
>   The tricky part of working on this will be figuring out the right place to 
> add the handler and ensuring that we don't introduce performance issues by 
> not re-using sockets.
> Quoting from that linked StackOverflow question:
> {quote}
> Note that the ReadTimeoutHandler is also unaware of whether you have sent a 
> request - it only cares whether data has been read from the socket. If your 
> connection is persistent, and you only want read timeouts to fire when a 
> request has been sent, you'll need to build a request / response aware 
> timeout handler.
> {quote}
> If we want to avoid tearing down connections between shuffles then we may 
> have to do something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11382) Replace example code in mllib-decision-tree.md using include_example

2015-11-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15025109#comment-15025109
 ] 

Apache Spark commented on SPARK-11382:
--

User 'nongli' has created a pull request for this issue:
https://github.com/apache/spark/pull/9942

> Replace example code in mllib-decision-tree.md using include_example
> 
>
> Key: SPARK-11382
> URL: https://issues.apache.org/jira/browse/SPARK-11382
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Reporter: Xusen Yin
>Assignee: Xusen Yin
>  Labels: starter
> Fix For: 1.6.0
>
>
> This is similar to SPARK-11289 but for the example code in 
> mllib-decision-tree.md.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11957) SQLTransformer docs are unclear about generality of SQL statements

2015-11-24 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-11957:
-

 Summary: SQLTransformer docs are unclear about generality of SQL 
statements
 Key: SPARK-11957
 URL: https://issues.apache.org/jira/browse/SPARK-11957
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Reporter: Joseph K. Bradley
Priority: Minor


See discussion here for context [SPARK-11234].  The Scala doc needs to be 
clearer about what SQL statements are supported.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >