date:20160627


 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-16234:
---

You can disable speculation too.
The right resolution is not a problem.

> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> resolved...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16234) Speculative Task may not be able to overwrite file


 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-16234.
---
Resolution: Not A Problem

> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> resolved...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16235) "evaluateEachIteration" is returning wrong results when calculated for classification model.


[ 
https://issues.apache.org/jira/browse/SPARK-16235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352473#comment-15352473
 ] 

Sean Owen commented on SPARK-16235:
---

[~mahmoudr] but MSE is an error metric for regression, not classification. Why 
would that be relevant here then?

> "evaluateEachIteration" is returning wrong results when calculated for 
> classification model.
> 
>
> Key: SPARK-16235
> URL: https://issues.apache.org/jira/browse/SPARK-16235
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Mahmoud Rawas
>
> Basically within the mentioned function there is a code to map the actual 
> value which supposed to be in the range of \[0,1] into the range of \[-1,1], 
> in order to make it compatible with the predicted value produces by a 
> classification mode. 
> {code}
> val remappedData = algo match {
>   case Classification => data.map(x => new LabeledPoint((x.label * 2) - 
> 1, x.features))
>   case _ => data
> }
> {code}
> the problem with this approach is the fact that it will calculate an 
> incorrect error for an example mse will be be 4 time larger than the actual 
> expected mse 
> Instead we should map the predicted value into probability value in [0,1].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16246) Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark Kafka streaming


[ 
https://issues.apache.org/jira/browse/SPARK-16246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352470#comment-15352470
 ] 

Sean Owen commented on SPARK-16246:
---

You'd have to say a lot more about how you're running this, including the 
number of partitions, workers, and where the thread is waiting (stack trace)

> Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark 
> Kafka streaming
> --
>
> Key: SPARK-16246
> URL: https://issues.apache.org/jira/browse/SPARK-16246
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Streaming
>Affects Versions: 1.6.1
>Reporter: Alex Jiang
>
> I don't know if our spark streaming issue is related to this 
> (https://issues.apache.org/jira/browse/SPARK-15558).
> Basically we have one Kafka receiver on each executor, and it ran fine for a 
> while. Then, the executor had a lot of waiting thread accumulated (Thread 
> 1224: block-manager-slave-async-thread-pool-1083 (TIMED_WAITING)). And the 
> executor kept open such new thread. Eventually, it reached the maximum number 
> of the thread on that executor and Kafka receiver on that executor failed.
> Could someone please shed some light on this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16218) spark 1.4.1 "Storage Fraction Cached" was greater than 120%.And I recache the talbe in memory and find querying faster than before,May be its a bug.

2016-06-27 Thread Vincent zhao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352463#comment-15352463
 ] 

Vincent zhao commented on SPARK-16218:
--

now I see , thank you

> spark 1.4.1 "Storage Fraction Cached"  was greater than 120%.And I recache 
> the talbe in memory and find querying faster than before,May be its a bug.
> -
>
> Key: SPARK-16218
> URL: https://issues.apache.org/jira/browse/SPARK-16218
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, SQL
>Affects Versions: 1.4.1
> Environment: Java Version 1.7.0_71 (Oracle Corporation)
> Scala Version version 2.10.4
>Reporter: Vincent zhao
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> We have cached a Hive Table in memory using CACHE TABLE tablename.And in the 
> early time ,we query the data at a very high speed.But, after many times,we 
> found the speed begin very slowly, but the SQL is the same as the early one. 
> So we check the reason In the JOBS tab of the Spark Web UI. And we found the 
> cost times of some task is very long but the size of the task input is not 
> large.But in the Storage tab of the Spark Web UI of spark 1.4.1, we saw a 
> case where the "Fraction Cached" was greater than 120%.So  
> I recache the talbe in memory and find querying faster than before .May it be 
> a bug in this version , if so ,we also can make a monitor for it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16239) SQL issues with cast from date to string around daylight savings time


[ 
https://issues.apache.org/jira/browse/SPARK-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352448#comment-15352448
 ] 

Sean Owen commented on SPARK-16239:
---

These dates are ambiguous though. They don't have a timezone, which specifies 
whether DST is in effect in some cases.

> SQL issues with cast from date to string around daylight savings time
> -
>
> Key: SPARK-16239
> URL: https://issues.apache.org/jira/browse/SPARK-16239
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Glen Maisey
>Priority: Critical
>
> Hi all,
> I have a dataframe with a date column. When I cast to a string using the 
> spark sql cast function it converts it to the wrong date on certain days. 
> Looking into it, it occurs once a year when summer daylight savings starts.
> I've tried to show this issue the code below. The toString() function works 
> correctly whereas the cast does not.
> Unfortunately my users are using SQL code rather than scala dataframes and 
> therefore this workaround does not apply. This was actually picked up where a 
> user was writing something like "SELECT date1 UNION ALL select date2" where 
> date1 was a string and date2 was a date type. It must be implicitly 
> converting the date to a string which gives this error.
> I'm in the Australia/Sydney timezone (see the time changes here 
> http://www.timeanddate.com/time/zone/australia/sydney) 
> val dates = 
> Array("2014-10-03","2014-10-04","2014-10-05","2014-10-06","2015-10-02","2015-10-03",
>  "2015-10-04", "2015-10-05")
> val df = sc.parallelize(dates)
> .toDF("txn_date")
> .select(col("txn_date").cast("Date"))
> df.select(
> col("txn_date"),
> col("txn_date").cast("Timestamp").alias("txn_date_timestamp"),
> col("txn_date").cast("String").alias("txn_date_str_cast"),
> col("txn_date".toString()).alias("txn_date_str_toString")
> )
> .show()
> +--++-+-+
> |  txn_date|  txn_date_timestamp|txn_date_str_cast|txn_date_str_toString|
> +--++-+-+
> |2014-10-03|2014-10-02 14:00:...|   2014-10-03|   2014-10-03|
> |2014-10-04|2014-10-03 14:00:...|   2014-10-04|   2014-10-04|
> |2014-10-05|2014-10-04 13:00:...|   2014-10-04|   2014-10-05|
> |2014-10-06|2014-10-05 13:00:...|   2014-10-06|   2014-10-06|
> |2015-10-02|2015-10-01 14:00:...|   2015-10-02|   2015-10-02|
> |2015-10-03|2015-10-02 14:00:...|   2015-10-03|   2015-10-03|
> |2015-10-04|2015-10-03 13:00:...|   2015-10-03|   2015-10-04|
> |2015-10-05|2015-10-04 13:00:...|   2015-10-05|   2015-10-05|
> +--++-+-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict


[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352436#comment-15352436
 ] 

Xin Ren commented on SPARK-16144:
-

sorry still trying to solve the merge conflicts

should be close to finish...

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16241) model loading backward compatibility for ml NaiveBayes


 [ 
https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16241:


Assignee: (was: Apache Spark)

> model loading backward compatibility for ml NaiveBayes
> --
>
> Key: SPARK-16241
> URL: https://issues.apache.org/jira/browse/SPARK-16241
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. Please manually verify the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16241) model loading backward compatibility for ml NaiveBayes


 [ 
https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16241:


Assignee: Apache Spark

> model loading backward compatibility for ml NaiveBayes
> --
>
> Key: SPARK-16241
> URL: https://issues.apache.org/jira/browse/SPARK-16241
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. Please manually verify the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16241) model loading backward compatibility for ml NaiveBayes


[ 
https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352429#comment-15352429
 ] 

Apache Spark commented on SPARK-16241:
--

User 'zlpmichelle' has created a pull request for this issue:
https://github.com/apache/spark/pull/13940

> model loading backward compatibility for ml NaiveBayes
> --
>
> Key: SPARK-16241
> URL: https://issues.apache.org/jira/browse/SPARK-16241
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. Please manually verify the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16202) Misleading Description of CreatableRelationProvider's createRelation


 [ 
https://issues.apache.org/jira/browse/SPARK-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16202.
-
   Resolution: Fixed
 Assignee: Xiao Li
Fix Version/s: 2.1.0

> Misleading Description of CreatableRelationProvider's createRelation
> 
>
> Key: SPARK-16202
> URL: https://issues.apache.org/jira/browse/SPARK-16202
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Minor
> Fix For: 2.1.0
>
>
> The API description of {{createRelation}} in {{CreatableRelationProvider}} is 
> misleading. The current description only expects users to return the 
> relation. However, the major goal of this API should also include saving the 
> Dataframe.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict


 [ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16144:
--
Assignee: Xin Ren

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xin Ren
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict


[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352409#comment-15352409
 ] 

Xiangrui Meng edited comment on SPARK-16144 at 6/28/16 5:57 AM:


Please hold because this -should- must be combined with SPARK-16140.


was (Author: mengxr):
Please hold because this should be combined with SPARK-16140.

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16248) Whitelist the list of Hive fallback functions


 [ 
https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16248:

Description: 
This patch removes the blind fallback into Hive for functions. Instead, it 
creates a whitelist and adds only a small number of functions to the whitelist, 
i.e. the ones we intend to support in the long run in Spark. 


> Whitelist the list of Hive fallback functions
> -
>
> Key: SPARK-16248
> URL: https://issues.apache.org/jira/browse/SPARK-16248
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This patch removes the blind fallback into Hive for functions. Instead, it 
> creates a whitelist and adds only a small number of functions to the 
> whitelist, i.e. the ones we intend to support in the long run in Spark. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict


[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352409#comment-15352409
 ] 

Xiangrui Meng commented on SPARK-16144:
---

Please hold because this should be combined with SPARK-16140.

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16248) Whitelist the list of Hive fallback functions


 [ 
https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16248:


Assignee: Apache Spark  (was: Reynold Xin)

> Whitelist the list of Hive fallback functions
> -
>
> Key: SPARK-16248
> URL: https://issues.apache.org/jira/browse/SPARK-16248
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16248) Whitelist the list of Hive fallback functions


[ 
https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352405#comment-15352405
 ] 

Apache Spark commented on SPARK-16248:
--

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/13939

> Whitelist the list of Hive fallback functions
> -
>
> Key: SPARK-16248
> URL: https://issues.apache.org/jira/browse/SPARK-16248
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16248) Whitelist the list of Hive fallback functions


 [ 
https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16248:


Assignee: Reynold Xin  (was: Apache Spark)

> Whitelist the list of Hive fallback functions
> -
>
> Key: SPARK-16248
> URL: https://issues.apache.org/jira/browse/SPARK-16248
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16248) Whitelist the list of Hive fallback functions

Reynold Xin created SPARK-16248:
---

 Summary: Whitelist the list of Hive fallback functions
 Key: SPARK-16248
 URL: https://issues.apache.org/jira/browse/SPARK-16248
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15863) Update SQL programming guide for Spark 2.0


[ 
https://issues.apache.org/jira/browse/SPARK-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352390#comment-15352390
 ] 

Apache Spark commented on SPARK-15863:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13938

> Update SQL programming guide for Spark 2.0
> --
>
> Key: SPARK-15863
> URL: https://issues.apache.org/jira/browse/SPARK-15863
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16132) model loading backward compatibility for tree model (DecisionTree, RF, GBT)


 [ 
https://issues.apache.org/jira/browse/SPARK-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-16132.

Resolution: Not A Problem

> model loading backward compatibility for tree model (DecisionTree, RF, GBT)
> ---
>
> Key: SPARK-16132
> URL: https://issues.apache.org/jira/browse/SPARK-16132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Please help check model loading compatibility for tree models, including 
> DecisionTree, RandomForest and GBT. (load models saved in Spark 1.6). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16243) model loading backward compatibility for ml.feature.PCA


[ 
https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352382#comment-15352382
 ] 

yuhao yang commented on SPARK-16243:


Close this as it's duplicated to 
https://issues.apache.org/jira/browse/SPARK-16245

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16243
> URL: https://issues.apache.org/jira/browse/SPARK-16243
> Project: Spark
>  Issue Type: Improvement
>Reporter: yuhao yang
>Priority: Minor
>
> Fix PCA to load 1.6 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16243) model loading backward compatibility for ml.feature.PCA


 [ 
https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yuhao yang resolved SPARK-16243.

Resolution: Duplicate

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16243
> URL: https://issues.apache.org/jira/browse/SPARK-16243
> Project: Spark
>  Issue Type: Improvement
>Reporter: yuhao yang
>Priority: Minor
>
> Fix PCA to load 1.6 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-27 Thread Edward Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Ma updated SPARK-16247:
--
Description: 
I am using pyspark with dataframe. Using pipeline operation to train and 
predict the result. It is alright for single testing.

However, I got issue when using pipeline and CrossValidator. The issue is that 
I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
feature. Those fields are built by StringIndexer and VectorIndex. It suppose to 
be existed after executing pipeline. 

Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
function and line 239, est.fit), I found that it does not execute pipeline 
stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 

Would you mind advising whether my usage is correct or not.

Thanks.

Here is code snippet

// # Indexing
labelIndexer = StringIndexer(inputCol="label", 
outputCol="indexedLabel").fit(extracted_data)
featureIndexer = VectorIndexer(inputCol="extracted_msg", 
outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)

// # Training
classification_model = RandomForestClassifier(labelCol="indexedLabel", 
featuresCol="indexedMsg", numTrees=50, maxDepth=20)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model])

// # Cross Validation
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
evaluator=cvEvaluator, numFolds=10)
cvModel = cv.fit(trainingData)

  was:
I am using pyspark with dataframe. Using pipeline operation to train and 
predict the result. It is alright for single testing.

However, I got issue when using pipeline and CrossValidator. The issue is that 
I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
feature. Those fields are built by StringIndexer and VectorIndex. It suppose to 
be existed after executing pipeline. 

Then I dig into pyspark library (line 222, _fit function and line 239, 
est.fit), I found that it does not execute pipeline stage. Therefore, I cannot 
get "indexedLabel" and "indexedMsg". 

Would you mind advising whether my usage is correct or not.

Thanks.

Here is code snippet

// # Indexing
labelIndexer = StringIndexer(inputCol="label", 
outputCol="indexedLabel").fit(extracted_data)
featureIndexer = VectorIndexer(inputCol="extracted_msg", 
outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)

// # Training
classification_model = RandomForestClassifier(labelCol="indexedLabel", 
featuresCol="indexedMsg", numTrees=50, maxDepth=20)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model])

// # Cross Validation
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
evaluator=cvEvaluator, numFolds=10)
cvModel = cv.fit(trainingData)


> Using pyspark dataframe with pipeline and cross validator
> -
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit 
> function and line 239, est.fit), I found that it does not execute pipeline 
> stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel =

[jira] [Updated] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-27 Thread Edward Ma (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Ma updated SPARK-16247:
--
Description: 
I am using pyspark with dataframe. Using pipeline operation to train and 
predict the result. It is alright for single testing.

However, I got issue when using pipeline and CrossValidator. The issue is that 
I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
feature. Those fields are built by StringIndexer and VectorIndex. It suppose to 
be existed after executing pipeline. 

Then I dig into pyspark library (line 222, _fit function and line 239, 
est.fit), I found that it does not execute pipeline stage. Therefore, I cannot 
get "indexedLabel" and "indexedMsg". 

Would you mind advising whether my usage is correct or not.

Thanks.

Here is code snippet

// # Indexing
labelIndexer = StringIndexer(inputCol="label", 
outputCol="indexedLabel").fit(extracted_data)
featureIndexer = VectorIndexer(inputCol="extracted_msg", 
outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)

// # Training
classification_model = RandomForestClassifier(labelCol="indexedLabel", 
featuresCol="indexedMsg", numTrees=50, maxDepth=20)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model])

// # Cross Validation
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
evaluator=cvEvaluator, numFolds=10)
cvModel = cv.fit(trainingData)

  was:
I am using pyspark with dataframe. Using pipeline operation to train and 
predict the result. It is alright for single testing.

However, I got issue when using pipeline and CrossValidator. The issue is that 
I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
feature. Those fields are built by StringIndexer and VectorIndex. It suppose to 
be existed after executing pipeline. 

Then I dig into pyspark library (line 222, _fit function and line 239, 
est.fit), I found that it does not execute pipeline stage. Therefore, I cannot 
get "indexedLabel" and "indexedMsg". 

Would you mind advising whether my usage is correct or not.

Thanks.

Here is code snippet

# Indexing
labelIndexer = StringIndexer(inputCol="label", 
outputCol="indexedLabel").fit(extracted_data)
featureIndexer = VectorIndexer(inputCol="extracted_msg", 
outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)

# Training
classification_model = RandomForestClassifier(labelCol="indexedLabel", 
featuresCol="indexedMsg", numTrees=50, maxDepth=20)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model])

# Cross Validation
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
evaluator=cvEvaluator, numFolds=10)
cvModel = cv.fit(trainingData)


> Using pyspark dataframe with pipeline and cross validator
> -
>
> Key: SPARK-16247
> URL: https://issues.apache.org/jira/browse/SPARK-16247
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.1
>Reporter: Edward Ma
>
> I am using pyspark with dataframe. Using pipeline operation to train and 
> predict the result. It is alright for single testing.
> However, I got issue when using pipeline and CrossValidator. The issue is 
> that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
> feature. Those fields are built by StringIndexer and VectorIndex. It suppose 
> to be existed after executing pipeline. 
> Then I dig into pyspark library (line 222, _fit function and line 239, 
> est.fit), I found that it does not execute pipeline stage. Therefore, I 
> cannot get "indexedLabel" and "indexedMsg". 
> Would you mind advising whether my usage is correct or not.
> Thanks.
> Here is code snippet
> // # Indexing
> labelIndexer = StringIndexer(inputCol="label", 
> outputCol="indexedLabel").fit(extracted_data)
> featureIndexer = VectorIndexer(inputCol="extracted_msg", 
> outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)
> // # Training
> classification_model = RandomForestClassifier(labelCol="indexedLabel", 
> featuresCol="indexedMsg", numTrees=50, maxDepth=20)
> pipeline = Pipeline(stages=[labelIndexer, featureIndexer, 
> classification_model])
> // # Cross Validation
> paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
> cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
> cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
> evaluator=cvEvaluator, numFolds=10)
> cvModel = cv.fit(trainingData)



--
This message was sent by Atlassian JIRA
(

[jira] [Created] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator

2016-06-27 Thread Edward Ma (JIRA)

Edward Ma created SPARK-16247:
-

 Summary: Using pyspark dataframe with pipeline and cross validator
 Key: SPARK-16247
 URL: https://issues.apache.org/jira/browse/SPARK-16247
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.6.1
Reporter: Edward Ma


I am using pyspark with dataframe. Using pipeline operation to train and 
predict the result. It is alright for single testing.

However, I got issue when using pipeline and CrossValidator. The issue is that 
I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and 
feature. Those fields are built by StringIndexer and VectorIndex. It suppose to 
be existed after executing pipeline. 

Then I dig into pyspark library (line 222, _fit function and line 239, 
est.fit), I found that it does not execute pipeline stage. Therefore, I cannot 
get "indexedLabel" and "indexedMsg". 

Would you mind advising whether my usage is correct or not.

Thanks.

Here is code snippet

# Indexing
labelIndexer = StringIndexer(inputCol="label", 
outputCol="indexedLabel").fit(extracted_data)
featureIndexer = VectorIndexer(inputCol="extracted_msg", 
outputCol="indexedMsg", maxCategories=3000).fit(extracted_data)

# Training
classification_model = RandomForestClassifier(labelCol="indexedLabel", 
featuresCol="indexedMsg", numTrees=50, maxDepth=20)
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model])

# Cross Validation
paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build()
cvEvaluator = MulticlassClassificationEvaluator(metricName="precision")
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, 
evaluator=cvEvaluator, numFolds=10)
cvModel = cv.fit(trainingData)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16246) Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark Kafka streaming

2016-06-27 Thread Alex Jiang (JIRA)

Alex Jiang created SPARK-16246:
--

 Summary: Too many block-manager-slave-async-thread opened 
(TIMED_WAITING) for spark Kafka streaming
 Key: SPARK-16246
 URL: https://issues.apache.org/jira/browse/SPARK-16246
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Streaming
Affects Versions: 1.6.1
Reporter: Alex Jiang


I don't know if our spark streaming issue is related to this 
(https://issues.apache.org/jira/browse/SPARK-15558).

Basically we have one Kafka receiver on each executor, and it ran fine for a 
while. Then, the executor had a lot of waiting thread accumulated (Thread 1224: 
block-manager-slave-async-thread-pool-1083 (TIMED_WAITING)). And the executor 
kept open such new thread. Eventually, it reached the maximum number of the 
thread on that executor and Kafka receiver on that executor failed.

Could someone please shed some light on this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16221) Redirect Parquet JUL logger via SLF4J for WRITE operations

2016-06-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-16221.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13918
[https://github.com/apache/spark/pull/13918]

> Redirect Parquet JUL logger via SLF4J for WRITE operations
> --
>
> Key: SPARK-16221
> URL: https://issues.apache.org/jira/browse/SPARK-16221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> SPARK-8118 implements redirecting Parquet JUL logger via SLF4J, but it is 
> currently applied only when READ operations occurs. If users use only WRITE 
> operations, there occurs many Parquet logs.
> This issue makes the redirection work on WRITE operations, too.
> **Before**
> {code}
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> ... about 70 lines Parquet Log ...
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> ... about 70 lines Parquet Log ...
> {code}
> **After**
> {code}
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> [Stage 0:>  (0 + 8) / 
> 8]
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.  
>   
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16221) Redirect Parquet JUL logger via SLF4J for WRITE operations

2016-06-27 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-16221:
---
Assignee: Dongjoon Hyun

> Redirect Parquet JUL logger via SLF4J for WRITE operations
> --
>
> Key: SPARK-16221
> URL: https://issues.apache.org/jira/browse/SPARK-16221
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>
> SPARK-8118 implements redirecting Parquet JUL logger via SLF4J, but it is 
> currently applied only when READ operations occurs. If users use only WRITE 
> operations, there occurs many Parquet logs.
> This issue makes the redirection work on WRITE operations, too.
> **Before**
> {code}
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
> Compression: SNAPPY
> ... about 70 lines Parquet Log ...
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> ... about 70 lines Parquet Log ...
> {code}
> **After**
> {code}
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> [Stage 0:>  (0 + 8) / 
> 8]
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.  
>   
> scala> 
> spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p")
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs


 [ 
https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16111.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.0.0

> Hide SparkOrcNewRecordReader in API docs
> 
>
> Key: SPARK-16111
> URL: https://issues.apache.org/jira/browse/SPARK-16111
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Reporter: Xiangrui Meng
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 2.0.0
>
>
> We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it 
> appears on the top of the list in the Scala API doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15558) Deadlock when retreiving shuffled cached data

2016-06-27 Thread Alex Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352361#comment-15352361
 ] 

Alex Jiang commented on SPARK-15558:


I don't know if our spark streaming issue is related to this. 

Basically we have one Kafka receiver on each executor, and it ran fine for a 
while. Then, the executor had a lot of waiting thread accumulated  (Thread 
1224: block-manager-slave-async-thread-pool-1083 (TIMED_WAITING)). And the 
executor kept open such new thread. Eventually, it reached the maximum number 
of the thread on that executor and Kafka receiver on that executor failed. 

Could someone please shed some light on this? 

> Deadlock when retreiving shuffled cached data
> -
>
> Key: SPARK-15558
> URL: https://issues.apache.org/jira/browse/SPARK-15558
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Fabiano Francesconi
> Attachments: screenshot-1.png
>
>
> Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached 
> data from another host. The job I am currently executing is fetching data 
> using async actions and persisting these RDDs into main memory (they all 
> fit). Later on, at the point in which it is currently hanging, the 
> application is retrieving this cached data but it hangs. The application, 
> once the timeout set in the Await.results call is met, crashes.
> This problem is reproducible at every executing, although the point in which 
> it hangs it is not.
> I have also tried activating:
> {code}
> spark.memory.useLegacyMode=true
> {code}
> as mentioned in SPARK-13566 guessing a similar deadlock as the one given 
> between MemoryStore and BlockManager. Unfortunately, this didn't help.
> The only plausible (albeit debatable) solution would be to use speculation 
> mode.
> Configuration:
> {code}
> /usr/local/tl/spark-latest/bin/spark-submit \
>   --executor-memory 80G \
>   --total-executor-cores 90 \
>   --driver-memory 8G \
> {code}
> Stack trace:
> {code}
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 
> daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on 
> condition [0x7f9946bfb000]
>java.lang.Thread.State: TIMED_WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 
> daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on 
> condition [0x7f98c86b6000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f9b541a6570> (a 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool)
> at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 
> tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7fab3f081300> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102)
> at 
> org.apache.spark.storage.BlockManager$$

[jira] [Assigned] (SPARK-16245) model loading backward compatibility for ml.feature.PCA


 [ 
https://issues.apache.org/jira/browse/SPARK-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16245:


Assignee: Apache Spark

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16245
> URL: https://issues.apache.org/jira/browse/SPARK-16245
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> model loading backward compatibility for ml.feature.PCA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16245) model loading backward compatibility for ml.feature.PCA


[ 
https://issues.apache.org/jira/browse/SPARK-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352350#comment-15352350
 ] 

Apache Spark commented on SPARK-16245:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13937

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16245
> URL: https://issues.apache.org/jira/browse/SPARK-16245
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> model loading backward compatibility for ml.feature.PCA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16245) model loading backward compatibility for ml.feature.PCA


 [ 
https://issues.apache.org/jira/browse/SPARK-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16245:


Assignee: (was: Apache Spark)

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16245
> URL: https://issues.apache.org/jira/browse/SPARK-16245
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> model loading backward compatibility for ml.feature.PCA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16245) model loading backward compatibility for ml.feature.PCA

Yanbo Liang created SPARK-16245:
---

 Summary: model loading backward compatibility for ml.feature.PCA
 Key: SPARK-16245
 URL: https://issues.apache.org/jira/browse/SPARK-16245
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Yanbo Liang
Priority: Minor


model loading backward compatibility for ml.feature.PCA



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16244) Failed job/stage couldn't stop JobGenerator immediately.

2016-06-27 Thread Liyin Tang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liyin Tang updated SPARK-16244:
---
Description: 
This streaming job has a very simple DAG. Each batch have only 1 job, and each 
job has only 1 stage.

Based on the following logs, we observed a potential race condition. Stage 1 
failed due to some tasks failure, and it tigers failJobAndIndependentStages.

In the meanwhile, the next stage (job), 2, is submitted and was able to 
successfully run a few tasks before stopping JobGenerator via shutdown hook.

Since the next job was able to run through a few tasks successfully, it just 
messed up all the checkpoints / offset management.

Here is the log from my job:


{color:red}
Stage 227 started:
{color}
[INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at 
foreachRDD at DBExportStreaming.java:55)
[INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo 
- Final stage: ResultStage 227(foreachRDD at DBExportStreaming.java:55)
[INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at 
DBExportStreaming.java:55), which has no missing parents
[INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at 
foreachRDD at DBExportStreaming.java:55)

{color:red}
Stage 227 failed:
{color}
[ERROR] 2016-06-25 19:01:34,083 org.apache.spark.scheduler.TaskSetManager 
logError - Task 26 in stage 227.0 failed 4 times; aborting job
[INFO] 2016-06-25 19:01:34,086 org.apache.spark.scheduler.cluster.YarnScheduler 
logInfo - Cancelling stage 227
[INFO] 2016-06-25 19:01:34,088 org.apache.spark.scheduler.cluster.YarnScheduler 
logInfo - Stage 227 was cancelled
[INFO] 2016-06-25 19:01:34,089 org.apache.spark.scheduler.DAGScheduler logInfo 
- ResultStage 227 (foreachRDD at DBExportStreaming.java:55) failed in 153.914 s
[INFO] 2016-06-25 19:01:34,090 org.apache.spark.scheduler.DAGScheduler logInfo 
- Job 227 failed: foreachRDD at DBExportStreaming.java:55, took 153.930462 s
[INFO] 2016-06-25 19:01:34,091 
org.apache.spark.streaming.scheduler.JobScheduler logInfo - Finished job 
streaming job 146688114 ms.0 from job set of time 14
6688114 ms
[INFO] 2016-06-25 19:01:34,091 
org.apache.spark.streaming.scheduler.JobScheduler logInfo - Total delay: 
154.091 s for time 146688114 ms (execution: 153.935
s)

{color:red}
Stage 228 started:
{color}

[INFO] 2016-06-25 19:01:34,094 org.apache.spark.SparkContext logInfo - Starting 
job: foreachRDD at DBExportStreaming.java:55
[INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo 
- Got job 228 (foreachRDD at DBExportStreaming.java:55) with 1495 output 
partitions
[INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo 
- Final stage: ResultStage 228(foreachRDD at DBExportStreaming.java:55)
Exception in thread "main" [INFO] 2016-06-25 19:01:34,095 
org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List()

{color:red}
Shutdown hook was called after stage 228 started:
{color}

[INFO] 2016-06-25 19:01:34,099 org.apache.spark.streaming.StreamingContext 
logInfo - Invoking stop(stopGracefully=false) from shutdown hook
[INFO] 2016-06-25 19:01:34,101 
org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopping 
JobGenerator immediately
[INFO] 2016-06-25 19:01:34,102 org.apache.spark.streaming.util.RecurringTimer 
logInfo - Stopped timer for JobGenerator after time 146688126
[INFO] 2016-06-25 19:01:34,103 
org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopped JobGenerator
[INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - 
ensureFreeSpace(133720) called with curMem=344903, maxMem=1159641169
[INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - 
Block broadcast_229 stored as values in memory (estimated size 130.6 KB, free 
1105.5 MB)
[INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - 
ensureFreeSpace(51478) called with curMem=478623, maxMem=1159641169
[INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - 
Block broadcast_229_piece0 stored as bytes in memory (estimated size 50.3 KB, 
free 1105.4 MB)
[INFO] 2016-06-25 19:01:34,108 org.apache.spark.storage.BlockManagerInfo 
logInfo - Added broadcast_229_piece0 in memory on 10.123.209.8:42154 (size: 
50.3 KB, free: 1105.8 MB)
[INFO] 2016-06-25 19:01:34,109 org.apache.spark.SparkContext logInfo - Created 
broadcast 229 from broadcast at DAGScheduler.scala:861
[INFO] 2016-06-25 19:01:34,110 org.apache.spark.scheduler.DAGScheduler logInfo 
- Submitting 1495 missing tasks from ResultStage 228 (MapPartitionsRDD[458] at 
foreachRDD at DBExpor

[jira] [Created] (SPARK-16244) Failed job/stage couldn't stop JobGenerator immediately.

2016-06-27 Thread Liyin Tang (JIRA)

Liyin Tang created SPARK-16244:
--

 Summary: Failed job/stage couldn't stop JobGenerator immediately.
 Key: SPARK-16244
 URL: https://issues.apache.org/jira/browse/SPARK-16244
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.2
Reporter: Liyin Tang


This streaming job has a very simple DAG. Each batch have only 1 job, and each 
job has only 1 stage.

Based on the following logs, we observed a potential race condition. Stage 1 
failed due to some tasks failure, and it tigers failJobAndIndependentStages.

In the meanwhile, the next stage (job), 2, is submitted and was able to 
successfully run a few tasks before stopping JobGenerator via shutdown hook.

Since the next job was able to run through a few tasks successfully, it just 
messed up all the checkpoints / offset management.

I will attach the log in the jira as well.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16132) model loading backward compatibility for tree model (DecisionTree, RF, GBT)


[ 
https://issues.apache.org/jira/browse/SPARK-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352301#comment-15352301
 ] 

Yanbo Liang commented on SPARK-16132:
-

Since we did not support DecisionTree, RandomForest and GBT persistence at 1.6, 
so there is no model loading compatibility issue for these Estimators/Models. I 
think we can close this JIRA.

> model loading backward compatibility for tree model (DecisionTree, RF, GBT)
> ---
>
> Key: SPARK-16132
> URL: https://issues.apache.org/jira/browse/SPARK-16132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Please help check model loading compatibility for tree models, including 
> DecisionTree, RandomForest and GBT. (load models saved in Spark 1.6). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA


[ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352296#comment-15352296
 ] 

Yanbo Liang commented on SPARK-16240:
-

Since we did not back port https://github.com/apache/spark/pull/12065 to 1.6, 
we need to implement our own {{LDA.load}} with special handling for param 
{{topicDistribution}} and replace it with {{topicDistributionCol}} when setting 
params.

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Priority: Minor
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15799) Release SparkR on CRAN


[ 
https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352289#comment-15352289
 ] 

Xiangrui Meng commented on SPARK-15799:
---

[~sunrui] Maybe we can download the corresponding Spark jars from Maven if no 
SPARK_HOME is specified, just to make it really simple to use.

> Release SparkR on CRAN
> --
>
> Key: SPARK-15799
> URL: https://issues.apache.org/jira/browse/SPARK-15799
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Xiangrui Meng
>
> Story: "As an R user, I would like to see SparkR released on CRAN, so I can 
> use SparkR easily in an existing R environment and have other packages built 
> on top of SparkR."
> I made this JIRA with the following questions in mind:
> * Are there known issues that prevent us releasing SparkR on CRAN?
> * Do we want to package Spark jars in the SparkR release?
> * Are there license issues?
> * How does it fit into Spark's release process?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16242) Wrap the Matrix conversion utils in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16242:
--
Assignee: Yanbo Liang

> Wrap the Matrix conversion utils in Python
> --
>
> Key: SPARK-16242
> URL: https://issues.apache.org/jira/browse/SPARK-16242
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> This is to wrap SPARK-16187 in Python. So Python users can use it to convert 
> DataFrames with matrix columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16143) Group survival analysis methods in generated doc


 [ 
https://issues.apache.org/jira/browse/SPARK-16143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16143.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 13927
[https://github.com/apache/spark/pull/13927]

> Group survival analysis methods in generated doc
> 
>
> Key: SPARK-16143
> URL: https://issues.apache.org/jira/browse/SPARK-16143
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
> Fix For: 2.0.1, 2.1.0
>
>
> Follow SPARK-16107 and group the doc of spark.survreg, predict(SR), 
> summary(SR), read/write.ml(SR) under Rd spark.survreg.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16243) model loading backward compatibility for ml.feature.PCA


[ 
https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352277#comment-15352277
 ] 

Apache Spark commented on SPARK-16243:
--

User 'hhbyyh' has created a pull request for this issue:
https://github.com/apache/spark/pull/13936

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16243
> URL: https://issues.apache.org/jira/browse/SPARK-16243
> Project: Spark
>  Issue Type: Improvement
>Reporter: yuhao yang
>Priority: Minor
>
> Fix PCA to load 1.6 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16243) model loading backward compatibility for ml.feature.PCA


 [ 
https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16243:


Assignee: (was: Apache Spark)

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16243
> URL: https://issues.apache.org/jira/browse/SPARK-16243
> Project: Spark
>  Issue Type: Improvement
>Reporter: yuhao yang
>Priority: Minor
>
> Fix PCA to load 1.6 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16243) model loading backward compatibility for ml.feature.PCA


 [ 
https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16243:


Assignee: Apache Spark

> model loading backward compatibility for ml.feature.PCA
> ---
>
> Key: SPARK-16243
> URL: https://issues.apache.org/jira/browse/SPARK-16243
> Project: Spark
>  Issue Type: Improvement
>Reporter: yuhao yang
>Assignee: Apache Spark
>Priority: Minor
>
> Fix PCA to load 1.6 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16243) model loading backward compatibility for ml.feature.PCA

yuhao yang created SPARK-16243:
--

 Summary: model loading backward compatibility for ml.feature.PCA
 Key: SPARK-16243
 URL: https://issues.apache.org/jira/browse/SPARK-16243
 Project: Spark
  Issue Type: Improvement
Reporter: yuhao yang
Priority: Minor


Fix PCA to load 1.6 models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16242) Wrap the Matrix conversion utils in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16242:


Assignee: (was: Apache Spark)

> Wrap the Matrix conversion utils in Python
> --
>
> Key: SPARK-16242
> URL: https://issues.apache.org/jira/browse/SPARK-16242
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> This is to wrap SPARK-16187 in Python. So Python users can use it to convert 
> DataFrames with matrix columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16242) Wrap the Matrix conversion utils in Python


[ 
https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352269#comment-15352269
 ] 

Apache Spark commented on SPARK-16242:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/13935

> Wrap the Matrix conversion utils in Python
> --
>
> Key: SPARK-16242
> URL: https://issues.apache.org/jira/browse/SPARK-16242
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> This is to wrap SPARK-16187 in Python. So Python users can use it to convert 
> DataFrames with matrix columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16242) Wrap the Matrix conversion utils in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16242:


Assignee: Apache Spark

> Wrap the Matrix conversion utils in Python
> --
>
> Key: SPARK-16242
> URL: https://issues.apache.org/jira/browse/SPARK-16242
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> This is to wrap SPARK-16187 in Python. So Python users can use it to convert 
> DataFrames with matrix columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16241) model loading backward compatibility for ml NaiveBayes


[ 
https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352266#comment-15352266
 ] 

yuhao yang commented on SPARK-16241:


Sure, please refer the fix of 
https://issues.apache.org/jira/browse/SPARK-16130. Thanks.

> model loading backward compatibility for ml NaiveBayes
> --
>
> Key: SPARK-16241
> URL: https://issues.apache.org/jira/browse/SPARK-16241
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. Please manually verify the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16241) model loading backward compatibility for ml NaiveBayes

2016-06-27 Thread Li Ping Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352263#comment-15352263
 ] 

Li Ping Zhang commented on SPARK-16241:
---

Can I help on this?

> model loading backward compatibility for ml NaiveBayes
> --
>
> Key: SPARK-16241
> URL: https://issues.apache.org/jira/browse/SPARK-16241
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. Please manually verify the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16242) Wrap the Matrix conversion utils in Python

Yanbo Liang created SPARK-16242:
---

 Summary: Wrap the Matrix conversion utils in Python
 Key: SPARK-16242
 URL: https://issues.apache.org/jira/browse/SPARK-16242
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib, PySpark
Reporter: Yanbo Liang
Priority: Minor


This is to wrap SPARK-16187 in Python. So Python users can use it to convert 
DataFrames with matrix columns.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16241) model loading backward compatibility for ml NaiveBayes

yuhao yang created SPARK-16241:
--

 Summary: model loading backward compatibility for ml NaiveBayes
 Key: SPARK-16241
 URL: https://issues.apache.org/jira/browse/SPARK-16241
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: yuhao yang
Priority: Minor


To help users migrate from Spark 1.6. to 2.0, we should make model loading 
backward compatible with models saved in 1.6. Please manually verify the fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA


[ 
https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352225#comment-15352225
 ] 

yuhao yang commented on SPARK-16240:


cc [~yanboliang] [~josephkb]

> model loading backward compatibility for ml.clustering.LDA
> --
>
> Key: SPARK-16240
> URL: https://issues.apache.org/jira/browse/SPARK-16240
> Project: Spark
>  Issue Type: Bug
>Reporter: yuhao yang
>Priority: Minor
>
> After resolving the matrix conversion issue, LDA model still cannot load 1.6 
> models as one of the parameter name is changed.
> https://github.com/apache/spark/pull/12065
> We can perhaps add some special logic in the loading code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA

yuhao yang created SPARK-16240:
--

 Summary: model loading backward compatibility for ml.clustering.LDA
 Key: SPARK-16240
 URL: https://issues.apache.org/jira/browse/SPARK-16240
 Project: Spark
  Issue Type: Bug
Reporter: yuhao yang
Priority: Minor


After resolving the matrix conversion issue, LDA model still cannot load 1.6 
models as one of the parameter name is changed.

https://github.com/apache/spark/pull/12065

We can perhaps add some special logic in the loading code.






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16239) SQL issues with cast from date to string around daylight savings time

2016-06-27 Thread Glen Maisey (JIRA)

Glen Maisey created SPARK-16239:
---

 Summary: SQL issues with cast from date to string around daylight 
savings time
 Key: SPARK-16239
 URL: https://issues.apache.org/jira/browse/SPARK-16239
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.1
Reporter: Glen Maisey
Priority: Critical


Hi all,

I have a dataframe with a date column. When I cast to a string using the spark 
sql cast function it converts it to the wrong date on certain days. Looking 
into it, it occurs once a year when summer daylight savings starts.

I've tried to show this issue the code below. The toString() function works 
correctly whereas the cast does not.

Unfortunately my users are using SQL code rather than scala dataframes and 
therefore this workaround does not apply. This was actually picked up where a 
user was writing something like "SELECT date1 UNION ALL select date2" where 
date1 was a string and date2 was a date type. It must be implicitly converting 
the date to a string which gives this error.

I'm in the Australia/Sydney timezone (see the time changes here 
http://www.timeanddate.com/time/zone/australia/sydney) 

val dates = 
Array("2014-10-03","2014-10-04","2014-10-05","2014-10-06","2015-10-02","2015-10-03",
 "2015-10-04", "2015-10-05")
val df = sc.parallelize(dates)
.toDF("txn_date")
.select(col("txn_date").cast("Date"))

df.select(
col("txn_date"),
col("txn_date").cast("Timestamp").alias("txn_date_timestamp"),
col("txn_date").cast("String").alias("txn_date_str_cast"),
col("txn_date".toString()).alias("txn_date_str_toString")
)
.show()

+--++-+-+
|  txn_date|  txn_date_timestamp|txn_date_str_cast|txn_date_str_toString|
+--++-+-+
|2014-10-03|2014-10-02 14:00:...|   2014-10-03|   2014-10-03|
|2014-10-04|2014-10-03 14:00:...|   2014-10-04|   2014-10-04|
|2014-10-05|2014-10-04 13:00:...|   2014-10-04|   2014-10-05|
|2014-10-06|2014-10-05 13:00:...|   2014-10-06|   2014-10-06|
|2015-10-02|2015-10-01 14:00:...|   2015-10-02|   2015-10-02|
|2015-10-03|2015-10-02 14:00:...|   2015-10-03|   2015-10-03|
|2015-10-04|2015-10-03 13:00:...|   2015-10-03|   2015-10-04|
|2015-10-05|2015-10-04 13:00:...|   2015-10-05|   2015-10-05|
+--++-+-+



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16238) Metrics for generated method bytecode size


[ 
https://issues.apache.org/jira/browse/SPARK-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352136#comment-15352136
 ] 

Apache Spark commented on SPARK-16238:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/13934

> Metrics for generated method bytecode size
> --
>
> Key: SPARK-16238
> URL: https://issues.apache.org/jira/browse/SPARK-16238
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>Priority: Minor
>
> Add metrics for the generated method size, too increase visibility into when 
> we come close to exceeding the JVM limit of 64KB per method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16238) Metrics for generated method bytecode size


 [ 
https://issues.apache.org/jira/browse/SPARK-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16238:


Assignee: Apache Spark

> Metrics for generated method bytecode size
> --
>
> Key: SPARK-16238
> URL: https://issues.apache.org/jira/browse/SPARK-16238
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Add metrics for the generated method size, too increase visibility into when 
> we come close to exceeding the JVM limit of 64KB per method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16238) Metrics for generated method bytecode size


 [ 
https://issues.apache.org/jira/browse/SPARK-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16238:


Assignee: (was: Apache Spark)

> Metrics for generated method bytecode size
> --
>
> Key: SPARK-16238
> URL: https://issues.apache.org/jira/browse/SPARK-16238
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Eric Liang
>Priority: Minor
>
> Add metrics for the generated method size, too increase visibility into when 
> we come close to exceeding the JVM limit of 64KB per method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16238) Metrics for generated method bytecode size

2016-06-27 Thread Eric Liang (JIRA)

Eric Liang created SPARK-16238:
--

 Summary: Metrics for generated method bytecode size
 Key: SPARK-16238
 URL: https://issues.apache.org/jira/browse/SPARK-16238
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Eric Liang
Priority: Minor


Add metrics for the generated method size, too increase visibility into when we 
come close to exceeding the JVM limit of 64KB per method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16237) PySpark gapply

2016-06-27 Thread Vladimir Feinberg (JIRA)

Vladimir Feinberg created SPARK-16237:
-

 Summary: PySpark gapply
 Key: SPARK-16237
 URL: https://issues.apache.org/jira/browse/SPARK-16237
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SQL
Reporter: Vladimir Feinberg


To maintain feature parity, `gapply` functionality should be added to 
`pyspark`'s  `GroupedData` with an interface.

The implementation already exists because it fulfilled a need in another 
package: 
https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

It needs to be migrated (to become a GroupedData method, the first argument now 
to be called self).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16236) Add Path Option back to Load API in DataFrameReader


[ 
https://issues.apache.org/jira/browse/SPARK-16236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352085#comment-15352085
 ] 

Apache Spark commented on SPARK-16236:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/13933

> Add Path Option back to Load API in DataFrameReader
> ---
>
> Key: SPARK-16236
> URL: https://issues.apache.org/jira/browse/SPARK-16236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> @koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ 
> changed the behavior of `load` API. After the change, the `load` API does not 
> add the value of `path` into the `options`.  Thank you!
> We should add the option `path` back to `load()` API in `DataFrameReader`, if 
> and only if users specify one and only one path in the load API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16236) Add Path Option back to Load API in DataFrameReader


 [ 
https://issues.apache.org/jira/browse/SPARK-16236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16236:


Assignee: (was: Apache Spark)

> Add Path Option back to Load API in DataFrameReader
> ---
>
> Key: SPARK-16236
> URL: https://issues.apache.org/jira/browse/SPARK-16236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> @koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ 
> changed the behavior of `load` API. After the change, the `load` API does not 
> add the value of `path` into the `options`.  Thank you!
> We should add the option `path` back to `load()` API in `DataFrameReader`, if 
> and only if users specify one and only one path in the load API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16236) Add Path Option back to Load API in DataFrameReader


 [ 
https://issues.apache.org/jira/browse/SPARK-16236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16236:


Assignee: Apache Spark

> Add Path Option back to Load API in DataFrameReader
> ---
>
> Key: SPARK-16236
> URL: https://issues.apache.org/jira/browse/SPARK-16236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> @koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ 
> changed the behavior of `load` API. After the change, the `load` API does not 
> add the value of `path` into the `options`.  Thank you!
> We should add the option `path` back to `load()` API in `DataFrameReader`, if 
> and only if users specify one and only one path in the load API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality


 [ 
https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-16220.
-
   Resolution: Fixed
 Assignee: Herman van Hovell
Fix Version/s: 2.0.1

> Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
> --
>
> Key: SPARK-16220
> URL: https://issues.apache.org/jira/browse/SPARK-16220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Bill Chambers
>Assignee: Herman van Hovell
> Fix For: 2.0.1
>
>
> After discussing this with [~marmbrus] and [~rxin]. We've decided to revert 
> SPARK-15663. After doing some research it seems like this is an unnecessary 
> departure from 1.X functionality and does not have a reasonable substitute 
> that gives the same functionality.
> The first step is to revert the change. After doing that there are a couple 
> of different ways to approachs to getting at user defined functions.
> 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does 
> this)
> 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS
> 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar)
> 4. SHOW FUNCTIONS + some column to designate if it's system designed or user 
> defined.
> 1. This aligns with previous functionality and then supplements it with 
> something a bit more specific. 
> 2. Is unclear because "all" is just unclear why does the default refer to 
> only user defined functions. This doesn't seem like the right approach.
> 3. Same kind of issue, I'm not sure why the user functions should be the 
> default over the system functions. That doesn't seem like the correct 
> approach.
> 4. This one seems nice because it kind of achieves #1, keeps existing 
> functionality, but then supplants it with some more. This also allows you, 
> for example, to create your own set of date functions and then search them 
> all in one go as opposed to searching system and then user functions. This 
> would have to return two columns though, which could potentially be an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16236) Add Path Option back to Load API in DataFrameReader

2016-06-27 Thread Xiao Li (JIRA)

Xiao Li created SPARK-16236:
---

 Summary: Add Path Option back to Load API in DataFrameReader
 Key: SPARK-16236
 URL: https://issues.apache.org/jira/browse/SPARK-16236
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


@koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ 
changed the behavior of `load` API. After the change, the `load` API does not 
add the value of `path` into the `options`.  Thank you!

We should add the option `path` back to `load()` API in `DataFrameReader`, if 
and only if users specify one and only one path in the load API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16055) sparkR.init() can not load sparkPackages when executing an R file

2016-06-27 Thread Krishna Kalyan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352051#comment-15352051
 ] 

Krishna Kalyan commented on SPARK-16055:


Hi [~shivaram]
(Spark 1.6) - Could replicate the issue with the error above.

log stack trace below (Spark 1.5) - Every thing works fine / Unable to 
replicate the issue
https://gist.github.com/krishnakalyan3/4a433cc854def9cb13925b431bd2dfd2

Could you please help me understand why there is a problem for version 1.6 and 
for 1.5 every thing works fine.

Thanks,
Krishna

> sparkR.init() can not load sparkPackages when executing an R file
> -
>
> Key: SPARK-16055
> URL: https://issues.apache.org/jira/browse/SPARK-16055
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SparkR
>Affects Versions: 1.6.1
>Reporter: Sun Rui
>Priority: Minor
>
> This is an issue reported in the Spark user mailing list. Refer to 
> http://comments.gmane.org/gmane.comp.lang.scala.spark.user/35742
> This issue does not occur in an interactive SparkR session, while it does 
> occur when executing an R file.
> The following example code can be put into an R file to reproduce this issue:
> {code}
> .libPaths(c("/home/user/spark-1.6.1-bin-hadoop2.6/R/lib",.libPaths()))
> Sys.setenv(SPARK_HOME="/home/user/spark-1.6.1-bin-hadoop2.6")
> library("SparkR")
> sc <- sparkR.init(sparkPackages = "com.databricks:spark-csv_2.11:1.4.0")
> sqlContext <- sparkRSQL.init(sc)
> df <- read.df(sqlContext, 
> "file:///home/user/spark-1.6.1-bin-hadoop2.6/data/mllib/sample_tree_data.csv","csv")
> showDF(df)
> {code}
> The error message is as such:
> {panel}
> 16/06/19 15:48:56 ERROR RBackendHandler: loadDF on 
> org.apache.spark.sql.api.r.SQLUtils failed
> Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
>   java.lang.ClassNotFoundException: Failed to find data source: csv. Please 
> find packages at http://spark-packages.org
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77)
>   at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119)
>   at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160)
>   at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala
> Calls: read.df -> callJStatic -> invokeJava
> Execution halted
> {panel}
> The reason behind this is that in case you execute an R file, the R backend 
> launches before the R interpreter, so there is no opportunity for packages 
> specified with ‘sparkPackages’ to be processed.
> This JIRA issue is to track this issue. An appropriate solution is to be 
> discussed. Maybe documentation the limitation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15354) Topology aware block replication strategies


 [ 
https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15354:


Assignee: Apache Spark

> Topology aware block replication strategies
> ---
>
> Key: SPARK-15354
> URL: https://issues.apache.org/jira/browse/SPARK-15354
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>Assignee: Apache Spark
>
> Implementations of strategies for resilient block replication for different 
> resource managers that replicate the 3-replica strategy used by HDFS, where 
> the first replica is on an executor, the second replica within the same rack 
> as the executor and a third replica on a different rack. 
> The implementation involves providing two pluggable classes, one running in 
> the driver that provides topology information for every host at cluster start 
> and the second prioritizing a list of peer BlockManagerIds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15354) Topology aware block replication strategies


[ 
https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352038#comment-15352038
 ] 

Apache Spark commented on SPARK-15354:
--

User 'shubhamchopra' has created a pull request for this issue:
https://github.com/apache/spark/pull/13932

> Topology aware block replication strategies
> ---
>
> Key: SPARK-15354
> URL: https://issues.apache.org/jira/browse/SPARK-15354
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>
> Implementations of strategies for resilient block replication for different 
> resource managers that replicate the 3-replica strategy used by HDFS, where 
> the first replica is on an executor, the second replica within the same rack 
> as the executor and a third replica on a different rack. 
> The implementation involves providing two pluggable classes, one running in 
> the driver that provides topology information for every host at cluster start 
> and the second prioritizing a list of peer BlockManagerIds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15354) Topology aware block replication strategies


 [ 
https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-15354:


Assignee: (was: Apache Spark)

> Topology aware block replication strategies
> ---
>
> Key: SPARK-15354
> URL: https://issues.apache.org/jira/browse/SPARK-15354
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos, Spark Core, YARN
>Reporter: Shubham Chopra
>
> Implementations of strategies for resilient block replication for different 
> resource managers that replicate the 3-replica strategy used by HDFS, where 
> the first replica is on an executor, the second replica within the same rack 
> as the executor and a third replica on a different rack. 
> The implementation involves providing two pluggable classes, one running in 
> the driver that provides topology information for every host at cluster start 
> and the second prioritizing a list of peer BlockManagerIds. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16235) "evaluateEachIteration" is returning wrong results when calculated for classification model.

2016-06-27 Thread Mahmoud Rawas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352022#comment-15352022
 ] 

Mahmoud Rawas commented on SPARK-16235:
---

I am working on a fix 


> "evaluateEachIteration" is returning wrong results when calculated for 
> classification model.
> 
>
> Key: SPARK-16235
> URL: https://issues.apache.org/jira/browse/SPARK-16235
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1, 1.6.2, 2.0.0
>Reporter: Mahmoud Rawas
>
> Basically within the mentioned function there is a code to map the actual 
> value which supposed to be in the range of \[0,1] into the range of \[-1,1], 
> in order to make it compatible with the predicted value produces by a 
> classification mode. 
> {code}
> val remappedData = algo match {
>   case Classification => data.map(x => new LabeledPoint((x.label * 2) - 
> 1, x.features))
>   case _ => data
> }
> {code}
> the problem with this approach is the fact that it will calculate an 
> incorrect error for an example mse will be be 4 time larger than the actual 
> expected mse 
> Instead we should map the predicted value into probability value in [0,1].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16235) "evaluateEachIteration" is returning wrong results when calculated for classification model.

2016-06-27 Thread Mahmoud Rawas (JIRA)

Mahmoud Rawas created SPARK-16235:
-

 Summary: "evaluateEachIteration" is returning wrong results when 
calculated for classification model.
 Key: SPARK-16235
 URL: https://issues.apache.org/jira/browse/SPARK-16235
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.2, 1.6.1, 2.0.0
Reporter: Mahmoud Rawas


Basically within the mentioned function there is a code to map the actual value 
which supposed to be in the range of \[0,1] into the range of \[-1,1], in order 
to make it compatible with the predicted value produces by a classification 
mode. 
{code}
val remappedData = algo match {
  case Classification => data.map(x => new LabeledPoint((x.label * 2) - 1, 
x.features))
  case _ => data
}
{code}

the problem with this approach is the fact that it will calculate an incorrect 
error for an example mse will be be 4 time larger than the actual expected mse 

Instead we should map the predicted value into probability value in [0,1].





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15700) Spark 2.0 dataframes using more memory (reading/writing parquet)


[ 
https://issues.apache.org/jira/browse/SPARK-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352016#comment-15352016
 ] 

Davies Liu commented on SPARK-15700:


My guess is that the SQL metrics required more memory than before. 

cc [~cloud_fan] , could you test a join with these many partitions to measure 
the memory used by SQL metrics?

> Spark 2.0 dataframes using more memory (reading/writing parquet)
> 
>
> Key: SPARK-15700
> URL: https://issues.apache.org/jira/browse/SPARK-15700
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Thomas Graves
>
> I was running a large 15TB join job with 10 map tasks, 2 reducers 
> that I frequently have run on Spark 1.6 successfully (with very little GC) 
> and it failed with an out of heap memory on the driver. Driver had 10GB heap 
> with 3GB overhead.
> 16/05/31 22:47:44 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOfRange(Arrays.java:3520)
> at 
> org.apache.parquet.io.api.Binary$ByteArrayBackedBinary.getBytes(Binary.java:262)
> at 
> org.apache.parquet.column.statistics.BinaryStatistics.getMinBytes(BinaryStatistics.java:67)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:242)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:184)
> at 
> org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:95)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:472)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:500)
> at 
> org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:490)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:63)
> at 
> org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:221)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234)
> at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626)
> I haven't had a chance to look into this further yet just reporting it for 
> now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM


[ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352004#comment-15352004
 ] 

Davies Liu commented on SPARK-15621:


The number of rows in the queue will bounded by the number of values in 
input/output buffer of Python process together with some buffered Python 
process (under processing), so it's not exactly unbounded.

Do you have a way to reproduce the issue?

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>Priority: Critical
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15621) BatchEvalPythonExec fails with OOM


 [ 
https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-15621:
---
Priority: Major  (was: Critical)

> BatchEvalPythonExec fails with OOM
> --
>
> Key: SPARK-15621
> URL: https://issues.apache.org/jira/browse/SPARK-15621
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Krisztian Szucs
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40
> No matter what the queue grows unboundedly and fails with OOM, even with 
> identity `lambda x: x` udf function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit


 [ 
https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-16224:
---
Description: 
Hi,
This is a continuation of a resolved bug 
[SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345]

I can access databases when using new methodology, i.e:

{code}
from pyspark.sql import SparkSession
from pyspark import SparkConf

if __name__ == "__main__":
conf = SparkConf()
hc = 
SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
print(hc.sql("show databases").collect())
{code}
This shows all database in hive.

However, using HiveContext, i.e.:
{code}
from pyspark.sql import HiveContext
from pyspark import SparkContext, SparkConf

if __name__ == "__main__":
conf = SparkConf()
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
print(hive_context.sql("show databases").collect())

# The result is
#[Row(result='default')]
{code}
prints only default database.

I have {{hive-site.xml}} file configured.

Those snippets are for scripts launched with {{spark-submit}} command. With 
pyspark those code fragments work fine, displaying all the databases.

  was:
Hi,
This is a continuation of a resolved bug 
[SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345]

I can access databases when using new methodology, i.e:

{code}
from pyspark.sql import SparkSession
from pyspark import SparkConf

if __name__ == "__main__":
conf = SparkConf()
hc = 
SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
print(hc.sql("show databases").collect())
{code}
This shows all database in hive.

However, using HiveContext, i.e.:
{code}
from pyspark.sql import HiveContext
from pyspark improt SparkContext, SparkConf

if __name__ == "__main__":
conf = SparkConf()
sc = SparkContext(conf=conf)
hive_context = HiveContext(sc)
print(hive_context.sql("show databases").collect())

# The result is
#[Row(result='default')]
{code}
prints only default database.

I have {{hive-site.xml}} file configured.

Those snippets are for scripts launched with {{spark-submit}} command. With 
pyspark those code fragments work fine, displaying all the databases.


> Hive context created by HiveContext can't access Hive databases when used in 
> a script launched be spark-submit
> --
>
> Key: SPARK-16224
> URL: https://issues.apache.org/jira/browse/SPARK-16224
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: branch-2.0
>Reporter: Piotr Milanowski
>Assignee: Yin Huai
>Priority: Blocker
>
> Hi,
> This is a continuation of a resolved bug 
> [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345]
> I can access databases when using new methodology, i.e:
> {code}
> from pyspark.sql import SparkSession
> from pyspark import SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> hc = 
> SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> print(hc.sql("show databases").collect())
> {code}
> This shows all database in hive.
> However, using HiveContext, i.e.:
> {code}
> from pyspark.sql import HiveContext
> from pyspark import SparkContext, SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> sc = SparkContext(conf=conf)
> hive_context = HiveContext(sc)
> print(hive_context.sql("show databases").collect())
> # The result is
> #[Row(result='default')]
> {code}
> prints only default database.
> I have {{hive-site.xml}} file configured.
> Those snippets are for scripts launched with {{spark-submit}} command. With 
> pyspark those code fragments work fine, displaying all the databases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict

2016-06-27 Thread Xusen Yin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351942#comment-15351942
 ] 

Xusen Yin commented on SPARK-16144:
---

I'd like to work on this.

> Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
> -
>
> Key: SPARK-16144
> URL: https://issues.apache.org/jira/browse/SPARK-16144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> After we grouped generic methods by the algorithm, it would be nice to add a 
> separate Rd for each ML generic methods, in particular, write.ml, read.ml, 
> summary, and predict and link the implementations with seealso.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10408) Autoencoder

2016-06-27 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351922#comment-15351922
 ] 

Alexander Ulanov commented on SPARK-10408:
--

Here is the PR https://github.com/apache/spark/pull/13621

> Autoencoder
> ---
>
> Key: SPARK-10408
> URL: https://issues.apache.org/jira/browse/SPARK-10408
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.5.0
>Reporter: Alexander Ulanov
>Assignee: Alexander Ulanov
>Priority: Minor
>
> Goal: Implement various types of autoencoders 
> Requirements:
> 1)Basic (deep) autoencoder that supports different types of inputs: binary, 
> real in [0..1]. real in [-inf, +inf] 
> 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature 
> to the MLP and then used here 
> 3)Denoising autoencoder 
> 4)Stacked autoencoder for pre-training of deep networks. It should support 
> arbitrary network layers
> References: 
> 1. Vincent, Pascal, et al. "Extracting and composing robust features with 
> denoising autoencoders." Proceedings of the 25th international conference on 
> Machine learning. ACM, 2008. 
> http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf
>  
> 2. 
> http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, 
> 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. 
> (2010). Stacked denoising autoencoders: Learning useful representations in a 
> deep network with a local denoising criterion. Journal of Machine Learning 
> Research, 11(3371–3408). 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf
> 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep 
> networks." Advances in neural information processing systems 19 (2007): 153. 
> http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file


 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-16234:
--
Description: resolved...  (was: given spark.speculative set to true, I'm 
running a large spark job with parquet and savemode overwrite.

Spark will speculatively try to create a task to deal with a straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists)

> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> resolved...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-16234) Speculative Task may not be able to overwrite file


 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers closed SPARK-16234.
-
Resolution: Resolved

> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> given spark.speculative set to true, I'm running a large spark job with 
> parquet and savemode overwrite.
> Spark will speculatively try to create a task to deal with a straggler. 
> However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
> selected, if the straggler completes before the original task or the original 
> task completes before the straggler then the job will fail due to the file 
> already existing.
> java.io.IOException: 
> /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
> already exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts

2016-06-27 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-16106.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 13826
[https://github.com/apache/spark/pull/13826]

> TaskSchedulerImpl does not correctly handle new executors on existing hosts
> ---
>
> Key: SPARK-16106
> URL: https://issues.apache.org/jira/browse/SPARK-16106
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Priority: Trivial
> Fix For: 2.1.0
>
>
> The TaskSchedulerImpl updates the set of executors and hosts in each call to 
> {{resourceOffers}}.  During this call, it also tracks whether there are any 
> new executors observed in {{newExecAvail}}:
> {code}
>   executorIdToHost(o.executorId) = o.host
>   executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
>   if (!executorsByHost.contains(o.host)) {
> executorsByHost(o.host) = new HashSet[String]()
> executorAdded(o.executorId, o.host)
> newExecAvail = true
>   }
> {code}
> However, this only detects when a new *host* is added, not when an additional 
> executor is added to an existing host (a relatively common event in dynamic 
> allocation).
> The end result is that task locality and {{failedEpochs}} is not updated 
> correctly for new executors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-16136) Flaky Test: TaskManagerSuite "Kill other task attempts when one attempt belonging to the same task succeeds"

2016-06-27 Thread Imran Rashid (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-16136.
--
   Resolution: Fixed
Fix Version/s: 2.1.0

Fixed by https://github.com/apache/spark/pull/13848

> Flaky Test: TaskManagerSuite "Kill other task attempts when one attempt 
> belonging to the same task succeeds"
> 
>
> Key: SPARK-16136
> URL: https://issues.apache.org/jira/browse/SPARK-16136
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.0.0
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>Priority: Minor
> Fix For: 2.1.0
>
>
> TaskManagerSuite "Kill other task attempts when one attempt belonging to the 
> same task succeeds" is flaky because it requires at least one millisecond to 
> elapse between when the tasks are schedule and when the check is made for 
> speculatable tasks.
> Fix this by using a manual clock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file


 [ 
https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Chambers updated SPARK-16234:
--
Description: 
given spark.speculative set to true, I'm running a large spark job with parquet 
and savemode overwrite.

Spark will speculatively try to create a task to deal with a straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists

  was:
given spark.speculative set to true, I'm running a large spark job with parquet 
and savemode overwrite.

Spark will speculatively try to create a task to deal with this straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists


> Speculative Task may not be able to overwrite file
> --
>
> Key: SPARK-16234
> URL: https://issues.apache.org/jira/browse/SPARK-16234
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Bill Chambers
>
> given spark.speculative set to true, I'm running a large spark job with 
> parquet and savemode overwrite.
> Spark will speculatively try to create a task to deal with a straggler. 
> However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
> selected, if the straggler completes before the original task or the original 
> task completes before the straggler then the job will fail due to the file 
> already existing.
> java.io.IOException: 
> /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
> already exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16234) Speculative Task may not be able to overwrite file

Bill Chambers created SPARK-16234:
-

 Summary: Speculative Task may not be able to overwrite file
 Key: SPARK-16234
 URL: https://issues.apache.org/jira/browse/SPARK-16234
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.0.0
Reporter: Bill Chambers


given spark.speculative set to true, I'm running a large spark job with parquet 
and savemode overwrite.

Spark will speculatively try to create a task to deal with this straggler. 
However, doing this comes with risk because EVEN THOUGH savemode overwrite is 
selected, if the straggler completes before the original task or the original 
task completes before the straggler then the job will fail due to the file 
already existing.

java.io.IOException: 
/...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet 
already exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR

2016-06-27 Thread Kai Jiang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351826#comment-15351826
 ] 

Kai Jiang commented on SPARK-15767:
---

ping [~shvenkat]

> Decision Tree Regression wrapper in SparkR
> --
>
> Key: SPARK-15767
> URL: https://issues.apache.org/jira/browse/SPARK-15767
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Kai Jiang
>Assignee: Kai Jiang
>
> Implement a wrapper in SparkR to support decision tree regression. R's naive 
> Decision Tree Regression implementation is from package rpart with signature 
> rpart(formula, dataframe, method="anova"). I propose we could implement API 
> like spark.rpart(dataframe, formula, ...) .  After having implemented 
> decision tree classification, we could refactor this two into an API more 
> like rpart()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-16233) test_sparkSQL.R is failing


 [ 
https://issues.apache.org/jira/browse/SPARK-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin Ren updated SPARK-16233:

Description: 
By running 
{code}
./R/run-tests.sh 
{code}

Getting error:
{code}
xin:spark xr$ ./R/run-tests.sh
Warning: Ignoring non-spark config property: SPARK_SCALA_VERSION=2.11
Loading required package: methods

Attaching package: ‘SparkR’

The following object is masked from ‘package:testthat’:

describe

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
rank, rbind, sample, startsWith, subset, summary, transform, union

binary functions: ...
functions on binary files: 
broadcast variables: ..
functions in client.R: .
test functions in sparkR.R: .Re-using existing Spark Context. Call 
sparkR.session.stop() or restart R to create a new Spark Context
Re-using existing Spark Context. Call sparkR.session.stop() or restart R to 
create a new Spark Context
...
include an external JAR in SparkContext: Warning: Ignoring non-spark config 
property: SPARK_SCALA_VERSION=2.11
..
include R packages:
MLlib functions: .SLF4J: Failed to load class 
"org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
.27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576
27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Dictionary is on
27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Validation is off
27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Writer version is: PARQUET_1_0
27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Maximum row group padding size is 0 bytes
27-Jun-2016 1:51:25 PM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore 
to file. allocated memory: 65,622
27-Jun-2016 1:51:25 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] 
BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, RLE, 
BIT_PACKED]
27-Jun-2016 1:51:25 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, 
list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, 
encodings: [PLAIN, RLE]
27-Jun-2016 1:51:25 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for 
[hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: [PLAIN, 
BIT_PACKED]
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Dictionary is on
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Validation is off
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Writer version is: PARQUET_1_0
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Maximum row group padding size is 0 bytes
27-Jun-2016 1:51:26 PM INFO: 
org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore 
to file. allocated memory: 49
27-Jun-2016 1:51:26 PM INFO: 
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, 
list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: [PLAIN, 
RLE]
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Dictionary is on
27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: 
Validation is off
27-Jun-2016 1:51:26 PM INFO: org.apache.parque

[jira] [Commented] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit

2016-06-27 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351784#comment-15351784
 ] 

Yin Huai commented on SPARK-16224:
--

https://github.com/apache/spark/pull/13931 should fix the issue. Can you try it?

> Hive context created by HiveContext can't access Hive databases when used in 
> a script launched be spark-submit
> --
>
> Key: SPARK-16224
> URL: https://issues.apache.org/jira/browse/SPARK-16224
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: branch-2.0
>Reporter: Piotr Milanowski
>Assignee: Yin Huai
>Priority: Blocker
>
> Hi,
> This is a continuation of a resolved bug 
> [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345]
> I can access databases when using new methodology, i.e:
> {code}
> from pyspark.sql import SparkSession
> from pyspark import SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> hc = 
> SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> print(hc.sql("show databases").collect())
> {code}
> This shows all database in hive.
> However, using HiveContext, i.e.:
> {code}
> from pyspark.sql import HiveContext
> from pyspark improt SparkContext, SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> sc = SparkContext(conf=conf)
> hive_context = HiveContext(sc)
> print(hive_context.sql("show databases").collect())
> # The result is
> #[Row(result='default')]
> {code}
> prints only default database.
> I have {{hive-site.xml}} file configured.
> Those snippets are for scripts launched with {{spark-submit}} command. With 
> pyspark those code fragments work fine, displaying all the databases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit


[ 
https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351783#comment-15351783
 ] 

Apache Spark commented on SPARK-16224:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/13931

> Hive context created by HiveContext can't access Hive databases when used in 
> a script launched be spark-submit
> --
>
> Key: SPARK-16224
> URL: https://issues.apache.org/jira/browse/SPARK-16224
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: branch-2.0
>Reporter: Piotr Milanowski
>Assignee: Yin Huai
>Priority: Blocker
>
> Hi,
> This is a continuation of a resolved bug 
> [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345]
> I can access databases when using new methodology, i.e:
> {code}
> from pyspark.sql import SparkSession
> from pyspark import SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> hc = 
> SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> print(hc.sql("show databases").collect())
> {code}
> This shows all database in hive.
> However, using HiveContext, i.e.:
> {code}
> from pyspark.sql import HiveContext
> from pyspark improt SparkContext, SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> sc = SparkContext(conf=conf)
> hive_context = HiveContext(sc)
> print(hive_context.sql("show databases").collect())
> # The result is
> #[Row(result='default')]
> {code}
> prints only default database.
> I have {{hive-site.xml}} file configured.
> Those snippets are for scripts launched with {{spark-submit}} command. With 
> pyspark those code fragments work fine, displaying all the databases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit

2016-06-27 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351780#comment-15351780
 ] 

Yin Huai commented on SPARK-16224:
--

https://github.com/apache/spark/pull/13931 should fix the issue. Can you try it?

> Hive context created by HiveContext can't access Hive databases when used in 
> a script launched be spark-submit
> --
>
> Key: SPARK-16224
> URL: https://issues.apache.org/jira/browse/SPARK-16224
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: branch-2.0
>Reporter: Piotr Milanowski
>Assignee: Yin Huai
>Priority: Blocker
>
> Hi,
> This is a continuation of a resolved bug 
> [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345]
> I can access databases when using new methodology, i.e:
> {code}
> from pyspark.sql import SparkSession
> from pyspark import SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> hc = 
> SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate()
> print(hc.sql("show databases").collect())
> {code}
> This shows all database in hive.
> However, using HiveContext, i.e.:
> {code}
> from pyspark.sql import HiveContext
> from pyspark improt SparkContext, SparkConf
> if __name__ == "__main__":
> conf = SparkConf()
> sc = SparkContext(conf=conf)
> hive_context = HiveContext(sc)
> print(hive_context.sql("show databases").collect())
> # The result is
> #[Row(result='default')]
> {code}
> prints only default database.
> I have {{hive-site.xml}} file configured.
> Those snippets are for scripts launched with {{spark-submit}} command. With 
> pyspark those code fragments work fine, displaying all the databases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-27 Thread Joseph K. Bradley (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Joseph K. Bradley updated SPARK-15581:
--
Description:
This is a master list for MLlib improvements we are working on for the next
release. Please view this as a wish list rather than a definite plan, for we
don't have an accurate estimate of available resources. Due to limited review
bandwidth, features appearing on this list will get higher priority during code
review. But feel free to suggest new items to the list in comments. We are
experimenting with this process. Your feedback would be greatly appreciated.

h1. Instructions

h2. For contributors:

* Please read
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
carefully. Code style, documentation, and unit tests are important.
* If you are a first-time Spark contributor, please always start with a
[starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather
than a medium/big feature. Based on our experience, mixing the development
process with a big feature usually causes long delay in code review.
* Never work silently. Let everyone know on the corresponding JIRA page when
you start working on some features. This is to avoid duplicate work. For small
features, you don't need to wait to get JIRA assigned.
* For medium/big features or features with dependencies, please get assigned
first before coding and keep the ETA updated on the JIRA. If there exist no
activity on the JIRA page for a certain amount of time, the JIRA should be
released for other contributors.
* Do not claim multiple (>3) JIRAs at the same time. Try to finish them one
after another.
* Remember to add the `@Since("VERSION")` annotation to new public APIs.
* Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review
greatly helps to improve others' code as well as yours.

h2. For committers:

* Try to break down big features into small and specific JIRA tasks and link
them properly.
* Add a "starter" label to starter tasks.
* Put a rough estimate for medium/big features and track the progress.
* If you start reviewing a PR, please add yourself to the Shepherd field on
JIRA.
* If the code looks good to you, please comment "LGTM". For non-trivial PRs,
please ping a maintainer to make a final pass.
* After merging a PR, create and link JIRAs for Python, example code, and
documentation if applicable.

h1. Roadmap (*WIP*)

This is NOT [a complete list of MLlib JIRAs for 2.1|
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
We only include umbrella JIRAs and high-level tasks.

Major efforts in this release:
* Feature parity for the DataFrames-based API (`spark.ml`), relative to the
RDD-based API
* ML persistence
* Python API feature parity and test coverage
* R API expansion and improvements
* Note about new features: As usual, we expect to expand the feature set of
MLlib. However, we will prioritize API parity, bug fixes, and improvements
over new features.

Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for
it, but new features, APIs, and improvements will only be added to `spark.ml`.

h2. Critical feature parity in DataFrame-based API

* Umbrella JIRA: [SPARK-4591]

h2. Persistence

* Complete persistence within MLlib
** Python tuning (SPARK-13786)
* MLlib in R format: compatibility with other languages (SPARK-15572)
* Impose backwards compatibility for persistence (SPARK-15573)

h2. Python API
* Standardize unit tests for Scala and Python to improve and consolidate test
coverage for Params, persistence, and other common functionality (SPARK-15571)
* Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706)
** Note: The linked JIRAs for this are incomplete. More to be created...
** Related: Implement Python meta-algorithms in Scala (to simplify persistence)
(SPARK-15574)
* Feature parity: The main goal of the Python API is to have feature parity
with the Scala/Java API. You can find a [complete list here|
https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC].
The tasks fall into two major categories:
** Python API for missing methods (SPARK-14813)
** Python API for new algorithms. Committers should create a JIRA for the
Python API after merging a public feature in Scala/Java.

h2. SparkR
* Improve R formula support and implementation (SPARK-15540)
*

[jira] [Commented] (SPARK-16228) "Percentile" needs explicit cast to double


[ 
https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351727#comment-15351727
 ] 

Apache Spark commented on SPARK-16228:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/13930

> "Percentile" needs explicit cast to double
> --
>
> Key: SPARK-16228
> URL: https://issues.apache.org/jira/browse/SPARK-16228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> {quote}
>  select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
> {quote}
> Works.
> {quote}
>  select percentile(cast(id as bigint), 0.5 ) from temp.bla
> {quote}
> Throws
> {quote}
> Error in query: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
> decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
> _FUNC_(bigint, double)  ; line 1 pos 7
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16228) "Percentile" needs explicit cast to double


 [ 
https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16228:


Assignee: (was: Apache Spark)

> "Percentile" needs explicit cast to double
> --
>
> Key: SPARK-16228
> URL: https://issues.apache.org/jira/browse/SPARK-16228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> {quote}
>  select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
> {quote}
> Works.
> {quote}
>  select percentile(cast(id as bigint), 0.5 ) from temp.bla
> {quote}
> Throws
> {quote}
> Error in query: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
> decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
> _FUNC_(bigint, double)  ; line 1 pos 7
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16228) "Percentile" needs explicit cast to double


 [ 
https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16228:


Assignee: Apache Spark

> "Percentile" needs explicit cast to double
> --
>
> Key: SPARK-16228
> URL: https://issues.apache.org/jira/browse/SPARK-16228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Assignee: Apache Spark
>
> {quote}
>  select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
> {quote}
> Works.
> {quote}
>  select percentile(cast(id as bigint), 0.5 ) from temp.bla
> {quote}
> Throws
> {quote}
> Error in query: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
> decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
> _FUNC_(bigint, double)  ; line 1 pos 7
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16228) "Percentile" needs explicit cast to double

2016-06-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351708#comment-15351708
 ] 

Dongjoon Hyun commented on SPARK-16228:
---

Hi, [~epahomov] and [~srowen].
The root cause is that Spark 2.0 uses `Decimal` as a default type for literal 
'0.5'.
This happens for `percentile_approx`, too. I guess it will happen for all 
double-type-only external functions.
I'll make a PR for this soon.

> "Percentile" needs explicit cast to double
> --
>
> Key: SPARK-16228
> URL: https://issues.apache.org/jira/browse/SPARK-16228
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>
> {quote}
>  select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla
> {quote}
> Works.
> {quote}
>  select percentile(cast(id as bigint), 0.5 ) from temp.bla
> {quote}
> Throws
> {quote}
> Error in query: No handler for Hive UDF 
> 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': 
> org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method 
> for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, 
> decimal(38,18)). Possible choices: _FUNC_(bigint, array)  
> _FUNC_(bigint, double)  ; line 1 pos 7
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality


[ 
https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351705#comment-15351705
 ] 

Apache Spark commented on SPARK-16220:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/13929

> Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
> --
>
> Key: SPARK-16220
> URL: https://issues.apache.org/jira/browse/SPARK-16220
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Bill Chambers
>
> After discussing this with [~marmbrus] and [~rxin]. We've decided to revert 
> SPARK-15663. After doing some research it seems like this is an unnecessary 
> departure from 1.X functionality and does not have a reasonable substitute 
> that gives the same functionality.
> The first step is to revert the change. After doing that there are a couple 
> of different ways to approachs to getting at user defined functions.
> 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does 
> this)
> 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS
> 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar)
> 4. SHOW FUNCTIONS + some column to designate if it's system designed or user 
> defined.
> 1. This aligns with previous functionality and then supplements it with 
> something a bit more specific. 
> 2. Is unclear because "all" is just unclear why does the default refer to 
> only user defined functions. This doesn't seem like the right approach.
> 3. Same kind of issue, I'm not sure why the user functions should be the 
> default over the system functions. That doesn't seem like the correct 
> approach.
> 4. This one seems nice because it kind of achieves #1, keeps existing 
> functionality, but then supplants it with some more. This also allows you, 
> for example, to create your own set of date functions and then search them 
> all in one go as opposed to searching system and then user functions. This 
> would have to return two columns though, which could potentially be an issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-16233) test_sparkSQL.R is failing

Xin Ren created SPARK-16233:
---

 Summary: test_sparkSQL.R is failing
 Key: SPARK-16233
 URL: https://issues.apache.org/jira/browse/SPARK-16233
 Project: Spark
  Issue Type: Bug
  Components: SparkR, Tests
Affects Versions: 2.0.0
Reporter: Xin Ren
Priority: Minor


By running 
{code}
./R/run-tests.sh 
{code}

Getting error:
{code}
15. Error: create DataFrame from list or data.frame (@test_sparkSQL.R#277) -
java.lang.NoClassDefFoundorg/apache/spark/sql/execution/datasources/PreInsertCastAndRename$
at 
org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:69)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
at 
org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:533)
at 
org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:293)
at org.apache.spark.sql.api.r.SQLUtils$.createDF(SQLUtils.scala:135)
at org.apache.spark.sql.api.r.SQLUtils.createDF(SQLUtils.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
at 
org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at 
io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
1: createDataFrame(l, c("a", "b")) at 
/Users/quickmobile/workspace/spark/R/lib/SparkR/tests/testthat/test_sparkSQL.R:277
2: dispatchFunc("createDataFrame(data, schema = NULL, samplingRatio = 1.0)", x, 
...)
3: f(x, ...)
4: callJStatic("org.apache.spark.sql.api.r.SQLUtils", "createDF", srdd, 
schema$jobj,
   sparkSession)
5: invokeJava(isStatic = TRUE, className, methodName, ...)
6: stop(readString(conn))

DONE ===
Execution halted
{code}

Cause: most probably these tests are using 'createDataFrame(sqlContext...)' 
which is deprecated. Should update tests method invocations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16233) test_sparkSQL.R is failing


[ 
https://issues.apache.org/jira/browse/SPARK-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351703#comment-15351703
 ] 

Xin Ren commented on SPARK-16233:
-

I'm working on this

> test_sparkSQL.R is failing
> --
>
> Key: SPARK-16233
> URL: https://issues.apache.org/jira/browse/SPARK-16233
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, Tests
>Affects Versions: 2.0.0
>Reporter: Xin Ren
>Priority: Minor
>
> By running 
> {code}
> ./R/run-tests.sh 
> {code}
> Getting error:
> {code}
> 15. Error: create DataFrame from list or data.frame (@test_sparkSQL.R#277) 
> -
> java.lang.NoClassDefFoundorg/apache/spark/sql/execution/datasources/PreInsertCastAndRename$
>   at 
> org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:69)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63)
>   at 
> org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62)
>   at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:533)
>   at 
> org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:293)
>   at org.apache.spark.sql.api.r.SQLUtils$.createDF(SQLUtils.scala:135)
>   at org.apache.spark.sql.api.r.SQLUtils.createDF(SQLUtils.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:483)
>   at 
> org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86)
>   at 
> org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38)
>   at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
>   at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
>   at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
>   at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
>   at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
>   at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
>   at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>   at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
>   at java.lang.Thread.run(Thread.java:745)
> 1: createDataFrame(l, c("a", "b")) at 
> /Users/quickmobile/workspace/spark/R/lib/SparkR/tests/testthat/test_sparkSQL.R:277
> 2: dispatchFunc("createDataFrame(data, schema = NULL, samplingRatio = 1.0)", 
> x, ...)
> 3: f(x, ...)
> 4: callJStatic("org.apache.spark.sql.api.r.SQLUtils", "createDF", srdd, 
> schema$jobj,
>sparkSession)
> 5: invokeJava(isStatic = TRUE, className, methodName, ...)
> 6: stop(readString(conn))
> DONE 
> ===
> Execution halted
> {code}
> Cause: most probably these tests are using 'createDataFrame(sqlContext...)' 
> which is deprecated. Should update tests method invocations. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.a

[jira] [Updated] (SPARK-16231) PySpark ML DataFrame example fails on Vector conversion