[jira] [Commented] (SPARK-16235) "evaluateEachIteration" is returning wrong results when calculated for classification model.
[ https://issues.apache.org/jira/browse/SPARK-16235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352487#comment-15352487 ] Mahmoud Rawas commented on SPARK-16235: --- On one hand, the API does not deny us for calculating MSE on classification, and I gave MSE as an example, but basically the model predict a probability for a categorical value to be in one of of the 2 cases (0,1) (true or false), and on calculating MSE the will provide an indication on how model predicted values are close to the actual values. And on the other hand when we calculate this error on each iteration we will be able to figure out the point where the model starts to over-fit, be over training the model then get the minimum error with all iterations. Also it is good to mention probability as it will be good idea to expose it to the user so he can change the cut off value instead of mllib doing this on behalf of the user at mid-range (this will be a different discussion, I will move it to a new ticket) > "evaluateEachIteration" is returning wrong results when calculated for > classification model. > > > Key: SPARK-16235 > URL: https://issues.apache.org/jira/browse/SPARK-16235 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Mahmoud Rawas > > Basically within the mentioned function there is a code to map the actual > value which supposed to be in the range of \[0,1] into the range of \[-1,1], > in order to make it compatible with the predicted value produces by a > classification mode. > {code} > val remappedData = algo match { > case Classification => data.map(x => new LabeledPoint((x.label * 2) - > 1, x.features)) > case _ => data > } > {code} > the problem with this approach is the fact that it will calculate an > incorrect error for an example mse will be be 4 time larger than the actual > expected mse > Instead we should map the predicted value into probability value in [0,1]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16246) Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark Kafka streaming
[ https://issues.apache.org/jira/browse/SPARK-16246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352481#comment-15352481 ] Saisai Shao commented on SPARK-16246: - It would be better to have a thread dump about running executors, then we could identify the real cause. > Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark > Kafka streaming > -- > > Key: SPARK-16246 > URL: https://issues.apache.org/jira/browse/SPARK-16246 > Project: Spark > Issue Type: Bug > Components: Spark Core, Streaming >Affects Versions: 1.6.1 >Reporter: Alex Jiang > > I don't know if our spark streaming issue is related to this > (https://issues.apache.org/jira/browse/SPARK-15558). > Basically we have one Kafka receiver on each executor, and it ran fine for a > while. Then, the executor had a lot of waiting thread accumulated (Thread > 1224: block-manager-slave-async-thread-pool-1083 (TIMED_WAITING)). And the > executor kept open such new thread. Eventually, it reached the maximum number > of the thread on that executor and Kafka receiver on that executor failed. > Could someone please shed some light on this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-16234: --- You can disable speculation too. The right resolution is not a problem. > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > resolved... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-16234. --- Resolution: Not A Problem > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > resolved... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16235) "evaluateEachIteration" is returning wrong results when calculated for classification model.
[ https://issues.apache.org/jira/browse/SPARK-16235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352473#comment-15352473 ] Sean Owen commented on SPARK-16235: --- [~mahmoudr] but MSE is an error metric for regression, not classification. Why would that be relevant here then? > "evaluateEachIteration" is returning wrong results when calculated for > classification model. > > > Key: SPARK-16235 > URL: https://issues.apache.org/jira/browse/SPARK-16235 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Mahmoud Rawas > > Basically within the mentioned function there is a code to map the actual > value which supposed to be in the range of \[0,1] into the range of \[-1,1], > in order to make it compatible with the predicted value produces by a > classification mode. > {code} > val remappedData = algo match { > case Classification => data.map(x => new LabeledPoint((x.label * 2) - > 1, x.features)) > case _ => data > } > {code} > the problem with this approach is the fact that it will calculate an > incorrect error for an example mse will be be 4 time larger than the actual > expected mse > Instead we should map the predicted value into probability value in [0,1]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16246) Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark Kafka streaming
[ https://issues.apache.org/jira/browse/SPARK-16246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352470#comment-15352470 ] Sean Owen commented on SPARK-16246: --- You'd have to say a lot more about how you're running this, including the number of partitions, workers, and where the thread is waiting (stack trace) > Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark > Kafka streaming > -- > > Key: SPARK-16246 > URL: https://issues.apache.org/jira/browse/SPARK-16246 > Project: Spark > Issue Type: Bug > Components: Spark Core, Streaming >Affects Versions: 1.6.1 >Reporter: Alex Jiang > > I don't know if our spark streaming issue is related to this > (https://issues.apache.org/jira/browse/SPARK-15558). > Basically we have one Kafka receiver on each executor, and it ran fine for a > while. Then, the executor had a lot of waiting thread accumulated (Thread > 1224: block-manager-slave-async-thread-pool-1083 (TIMED_WAITING)). And the > executor kept open such new thread. Eventually, it reached the maximum number > of the thread on that executor and Kafka receiver on that executor failed. > Could someone please shed some light on this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16218) spark 1.4.1 "Storage Fraction Cached" was greater than 120%.And I recache the talbe in memory and find querying faster than before,May be its a bug.
[ https://issues.apache.org/jira/browse/SPARK-16218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352463#comment-15352463 ] Vincent zhao commented on SPARK-16218: -- now I see , thank you > spark 1.4.1 "Storage Fraction Cached" was greater than 120%.And I recache > the talbe in memory and find querying faster than before,May be its a bug. > - > > Key: SPARK-16218 > URL: https://issues.apache.org/jira/browse/SPARK-16218 > Project: Spark > Issue Type: Bug > Components: Block Manager, SQL >Affects Versions: 1.4.1 > Environment: Java Version 1.7.0_71 (Oracle Corporation) > Scala Version version 2.10.4 >Reporter: Vincent zhao >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > We have cached a Hive Table in memory using CACHE TABLE tablename.And in the > early time ,we query the data at a very high speed.But, after many times,we > found the speed begin very slowly, but the SQL is the same as the early one. > So we check the reason In the JOBS tab of the Spark Web UI. And we found the > cost times of some task is very long but the size of the task input is not > large.But in the Storage tab of the Spark Web UI of spark 1.4.1, we saw a > case where the "Fraction Cached" was greater than 120%.So > I recache the talbe in memory and find querying faster than before .May it be > a bug in this version , if so ,we also can make a monitor for it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16239) SQL issues with cast from date to string around daylight savings time
[ https://issues.apache.org/jira/browse/SPARK-16239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352448#comment-15352448 ] Sean Owen commented on SPARK-16239: --- These dates are ambiguous though. They don't have a timezone, which specifies whether DST is in effect in some cases. > SQL issues with cast from date to string around daylight savings time > - > > Key: SPARK-16239 > URL: https://issues.apache.org/jira/browse/SPARK-16239 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Glen Maisey >Priority: Critical > > Hi all, > I have a dataframe with a date column. When I cast to a string using the > spark sql cast function it converts it to the wrong date on certain days. > Looking into it, it occurs once a year when summer daylight savings starts. > I've tried to show this issue the code below. The toString() function works > correctly whereas the cast does not. > Unfortunately my users are using SQL code rather than scala dataframes and > therefore this workaround does not apply. This was actually picked up where a > user was writing something like "SELECT date1 UNION ALL select date2" where > date1 was a string and date2 was a date type. It must be implicitly > converting the date to a string which gives this error. > I'm in the Australia/Sydney timezone (see the time changes here > http://www.timeanddate.com/time/zone/australia/sydney) > val dates = > Array("2014-10-03","2014-10-04","2014-10-05","2014-10-06","2015-10-02","2015-10-03", > "2015-10-04", "2015-10-05") > val df = sc.parallelize(dates) > .toDF("txn_date") > .select(col("txn_date").cast("Date")) > df.select( > col("txn_date"), > col("txn_date").cast("Timestamp").alias("txn_date_timestamp"), > col("txn_date").cast("String").alias("txn_date_str_cast"), > col("txn_date".toString()).alias("txn_date_str_toString") > ) > .show() > +--++-+-+ > | txn_date| txn_date_timestamp|txn_date_str_cast|txn_date_str_toString| > +--++-+-+ > |2014-10-03|2014-10-02 14:00:...| 2014-10-03| 2014-10-03| > |2014-10-04|2014-10-03 14:00:...| 2014-10-04| 2014-10-04| > |2014-10-05|2014-10-04 13:00:...| 2014-10-04| 2014-10-05| > |2014-10-06|2014-10-05 13:00:...| 2014-10-06| 2014-10-06| > |2015-10-02|2015-10-01 14:00:...| 2015-10-02| 2015-10-02| > |2015-10-03|2015-10-02 14:00:...| 2015-10-03| 2015-10-03| > |2015-10-04|2015-10-03 13:00:...| 2015-10-03| 2015-10-04| > |2015-10-05|2015-10-04 13:00:...| 2015-10-05| 2015-10-05| > +--++-+-+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
[ https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352436#comment-15352436 ] Xin Ren commented on SPARK-16144: - sorry still trying to solve the merge conflicts should be close to finish... > Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict > - > > Key: SPARK-16144 > URL: https://issues.apache.org/jira/browse/SPARK-16144 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xin Ren > > After we grouped generic methods by the algorithm, it would be nice to add a > separate Rd for each ML generic methods, in particular, write.ml, read.ml, > summary, and predict and link the implementations with seealso. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16241) model loading backward compatibility for ml NaiveBayes
[ https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16241: Assignee: (was: Apache Spark) > model loading backward compatibility for ml NaiveBayes > -- > > Key: SPARK-16241 > URL: https://issues.apache.org/jira/browse/SPARK-16241 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. Please manually verify the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16241) model loading backward compatibility for ml NaiveBayes
[ https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16241: Assignee: Apache Spark > model loading backward compatibility for ml NaiveBayes > -- > > Key: SPARK-16241 > URL: https://issues.apache.org/jira/browse/SPARK-16241 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. Please manually verify the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16241) model loading backward compatibility for ml NaiveBayes
[ https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352429#comment-15352429 ] Apache Spark commented on SPARK-16241: -- User 'zlpmichelle' has created a pull request for this issue: https://github.com/apache/spark/pull/13940 > model loading backward compatibility for ml NaiveBayes > -- > > Key: SPARK-16241 > URL: https://issues.apache.org/jira/browse/SPARK-16241 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. Please manually verify the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16202) Misleading Description of CreatableRelationProvider's createRelation
[ https://issues.apache.org/jira/browse/SPARK-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16202. - Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.1.0 > Misleading Description of CreatableRelationProvider's createRelation > > > Key: SPARK-16202 > URL: https://issues.apache.org/jira/browse/SPARK-16202 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > Fix For: 2.1.0 > > > The API description of {{createRelation}} in {{CreatableRelationProvider}} is > misleading. The current description only expects users to return the > relation. However, the major goal of this API should also include saving the > Dataframe. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
[ https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16144: -- Assignee: Xin Ren > Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict > - > > Key: SPARK-16144 > URL: https://issues.apache.org/jira/browse/SPARK-16144 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xin Ren > > After we grouped generic methods by the algorithm, it would be nice to add a > separate Rd for each ML generic methods, in particular, write.ml, read.ml, > summary, and predict and link the implementations with seealso. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
[ https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352409#comment-15352409 ] Xiangrui Meng edited comment on SPARK-16144 at 6/28/16 5:57 AM: Please hold because this -should- must be combined with SPARK-16140. was (Author: mengxr): Please hold because this should be combined with SPARK-16140. > Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict > - > > Key: SPARK-16144 > URL: https://issues.apache.org/jira/browse/SPARK-16144 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > After we grouped generic methods by the algorithm, it would be nice to add a > separate Rd for each ML generic methods, in particular, write.ml, read.ml, > summary, and predict and link the implementations with seealso. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16248) Whitelist the list of Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16248: Description: This patch removes the blind fallback into Hive for functions. Instead, it creates a whitelist and adds only a small number of functions to the whitelist, i.e. the ones we intend to support in the long run in Spark. > Whitelist the list of Hive fallback functions > - > > Key: SPARK-16248 > URL: https://issues.apache.org/jira/browse/SPARK-16248 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > > This patch removes the blind fallback into Hive for functions. Instead, it > creates a whitelist and adds only a small number of functions to the > whitelist, i.e. the ones we intend to support in the long run in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
[ https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352409#comment-15352409 ] Xiangrui Meng commented on SPARK-16144: --- Please hold because this should be combined with SPARK-16140. > Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict > - > > Key: SPARK-16144 > URL: https://issues.apache.org/jira/browse/SPARK-16144 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > After we grouped generic methods by the algorithm, it would be nice to add a > separate Rd for each ML generic methods, in particular, write.ml, read.ml, > summary, and predict and link the implementations with seealso. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16248) Whitelist the list of Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16248: Assignee: Apache Spark (was: Reynold Xin) > Whitelist the list of Hive fallback functions > - > > Key: SPARK-16248 > URL: https://issues.apache.org/jira/browse/SPARK-16248 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16248) Whitelist the list of Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352405#comment-15352405 ] Apache Spark commented on SPARK-16248: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/13939 > Whitelist the list of Hive fallback functions > - > > Key: SPARK-16248 > URL: https://issues.apache.org/jira/browse/SPARK-16248 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16248) Whitelist the list of Hive fallback functions
[ https://issues.apache.org/jira/browse/SPARK-16248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16248: Assignee: Reynold Xin (was: Apache Spark) > Whitelist the list of Hive fallback functions > - > > Key: SPARK-16248 > URL: https://issues.apache.org/jira/browse/SPARK-16248 > Project: Spark > Issue Type: Improvement >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16248) Whitelist the list of Hive fallback functions
Reynold Xin created SPARK-16248: --- Summary: Whitelist the list of Hive fallback functions Key: SPARK-16248 URL: https://issues.apache.org/jira/browse/SPARK-16248 Project: Spark Issue Type: Improvement Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15863) Update SQL programming guide for Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-15863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352390#comment-15352390 ] Apache Spark commented on SPARK-15863: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13938 > Update SQL programming guide for Spark 2.0 > -- > > Key: SPARK-15863 > URL: https://issues.apache.org/jira/browse/SPARK-15863 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian >Priority: Blocker > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16132) model loading backward compatibility for tree model (DecisionTree, RF, GBT)
[ https://issues.apache.org/jira/browse/SPARK-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-16132. Resolution: Not A Problem > model loading backward compatibility for tree model (DecisionTree, RF, GBT) > --- > > Key: SPARK-16132 > URL: https://issues.apache.org/jira/browse/SPARK-16132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Please help check model loading compatibility for tree models, including > DecisionTree, RandomForest and GBT. (load models saved in Spark 1.6). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16243) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352382#comment-15352382 ] yuhao yang commented on SPARK-16243: Close this as it's duplicated to https://issues.apache.org/jira/browse/SPARK-16245 > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16243 > URL: https://issues.apache.org/jira/browse/SPARK-16243 > Project: Spark > Issue Type: Improvement >Reporter: yuhao yang >Priority: Minor > > Fix PCA to load 1.6 models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16243) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yuhao yang resolved SPARK-16243. Resolution: Duplicate > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16243 > URL: https://issues.apache.org/jira/browse/SPARK-16243 > Project: Spark > Issue Type: Improvement >Reporter: yuhao yang >Priority: Minor > > Fix PCA to load 1.6 models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator
[ https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Ma updated SPARK-16247: -- Description: I am using pyspark with dataframe. Using pipeline operation to train and predict the result. It is alright for single testing. However, I got issue when using pipeline and CrossValidator. The issue is that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and feature. Those fields are built by StringIndexer and VectorIndex. It suppose to be existed after executing pipeline. Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit function and line 239, est.fit), I found that it does not execute pipeline stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". Would you mind advising whether my usage is correct or not. Thanks. Here is code snippet // # Indexing labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(extracted_data) featureIndexer = VectorIndexer(inputCol="extracted_msg", outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) // # Training classification_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedMsg", numTrees=50, maxDepth=20) pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model]) // # Cross Validation paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator, numFolds=10) cvModel = cv.fit(trainingData) was: I am using pyspark with dataframe. Using pipeline operation to train and predict the result. It is alright for single testing. However, I got issue when using pipeline and CrossValidator. The issue is that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and feature. Those fields are built by StringIndexer and VectorIndex. It suppose to be existed after executing pipeline. Then I dig into pyspark library (line 222, _fit function and line 239, est.fit), I found that it does not execute pipeline stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". Would you mind advising whether my usage is correct or not. Thanks. Here is code snippet // # Indexing labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(extracted_data) featureIndexer = VectorIndexer(inputCol="extracted_msg", outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) // # Training classification_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedMsg", numTrees=50, maxDepth=20) pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model]) // # Cross Validation paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator, numFolds=10) cvModel = cv.fit(trainingData) > Using pyspark dataframe with pipeline and cross validator > - > > Key: SPARK-16247 > URL: https://issues.apache.org/jira/browse/SPARK-16247 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 >Reporter: Edward Ma > > I am using pyspark with dataframe. Using pipeline operation to train and > predict the result. It is alright for single testing. > However, I got issue when using pipeline and CrossValidator. The issue is > that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and > feature. Those fields are built by StringIndexer and VectorIndex. It suppose > to be existed after executing pipeline. > Then I dig into pyspark library [python/pyspark/ml/tuning.py] (line 222, _fit > function and line 239, est.fit), I found that it does not execute pipeline > stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". > Would you mind advising whether my usage is correct or not. > Thanks. > Here is code snippet > // # Indexing > labelIndexer = StringIndexer(inputCol="label", > outputCol="indexedLabel").fit(extracted_data) > featureIndexer = VectorIndexer(inputCol="extracted_msg", > outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) > // # Training > classification_model = RandomForestClassifier(labelCol="indexedLabel", > featuresCol="indexedMsg", numTrees=50, maxDepth=20) > pipeline = Pipeline(stages=[labelIndexer, featureIndexer, > classification_model]) > // # Cross Validation > paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() > cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") > cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, > evaluator=cvEvaluator, numFolds=10) > cvModel =
[jira] [Updated] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator
[ https://issues.apache.org/jira/browse/SPARK-16247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Edward Ma updated SPARK-16247: -- Description: I am using pyspark with dataframe. Using pipeline operation to train and predict the result. It is alright for single testing. However, I got issue when using pipeline and CrossValidator. The issue is that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and feature. Those fields are built by StringIndexer and VectorIndex. It suppose to be existed after executing pipeline. Then I dig into pyspark library (line 222, _fit function and line 239, est.fit), I found that it does not execute pipeline stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". Would you mind advising whether my usage is correct or not. Thanks. Here is code snippet // # Indexing labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(extracted_data) featureIndexer = VectorIndexer(inputCol="extracted_msg", outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) // # Training classification_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedMsg", numTrees=50, maxDepth=20) pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model]) // # Cross Validation paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator, numFolds=10) cvModel = cv.fit(trainingData) was: I am using pyspark with dataframe. Using pipeline operation to train and predict the result. It is alright for single testing. However, I got issue when using pipeline and CrossValidator. The issue is that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and feature. Those fields are built by StringIndexer and VectorIndex. It suppose to be existed after executing pipeline. Then I dig into pyspark library (line 222, _fit function and line 239, est.fit), I found that it does not execute pipeline stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". Would you mind advising whether my usage is correct or not. Thanks. Here is code snippet # Indexing labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(extracted_data) featureIndexer = VectorIndexer(inputCol="extracted_msg", outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) # Training classification_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedMsg", numTrees=50, maxDepth=20) pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model]) # Cross Validation paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator, numFolds=10) cvModel = cv.fit(trainingData) > Using pyspark dataframe with pipeline and cross validator > - > > Key: SPARK-16247 > URL: https://issues.apache.org/jira/browse/SPARK-16247 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.6.1 >Reporter: Edward Ma > > I am using pyspark with dataframe. Using pipeline operation to train and > predict the result. It is alright for single testing. > However, I got issue when using pipeline and CrossValidator. The issue is > that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and > feature. Those fields are built by StringIndexer and VectorIndex. It suppose > to be existed after executing pipeline. > Then I dig into pyspark library (line 222, _fit function and line 239, > est.fit), I found that it does not execute pipeline stage. Therefore, I > cannot get "indexedLabel" and "indexedMsg". > Would you mind advising whether my usage is correct or not. > Thanks. > Here is code snippet > // # Indexing > labelIndexer = StringIndexer(inputCol="label", > outputCol="indexedLabel").fit(extracted_data) > featureIndexer = VectorIndexer(inputCol="extracted_msg", > outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) > // # Training > classification_model = RandomForestClassifier(labelCol="indexedLabel", > featuresCol="indexedMsg", numTrees=50, maxDepth=20) > pipeline = Pipeline(stages=[labelIndexer, featureIndexer, > classification_model]) > // # Cross Validation > paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() > cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") > cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, > evaluator=cvEvaluator, numFolds=10) > cvModel = cv.fit(trainingData) -- This message was sent by Atlassian JIRA (
[jira] [Created] (SPARK-16247) Using pyspark dataframe with pipeline and cross validator
Edward Ma created SPARK-16247: - Summary: Using pyspark dataframe with pipeline and cross validator Key: SPARK-16247 URL: https://issues.apache.org/jira/browse/SPARK-16247 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.6.1 Reporter: Edward Ma I am using pyspark with dataframe. Using pipeline operation to train and predict the result. It is alright for single testing. However, I got issue when using pipeline and CrossValidator. The issue is that I expect CrossValidator use "indexedLabel" and "indexedMsg" as label and feature. Those fields are built by StringIndexer and VectorIndex. It suppose to be existed after executing pipeline. Then I dig into pyspark library (line 222, _fit function and line 239, est.fit), I found that it does not execute pipeline stage. Therefore, I cannot get "indexedLabel" and "indexedMsg". Would you mind advising whether my usage is correct or not. Thanks. Here is code snippet # Indexing labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(extracted_data) featureIndexer = VectorIndexer(inputCol="extracted_msg", outputCol="indexedMsg", maxCategories=3000).fit(extracted_data) # Training classification_model = RandomForestClassifier(labelCol="indexedLabel", featuresCol="indexedMsg", numTrees=50, maxDepth=20) pipeline = Pipeline(stages=[labelIndexer, featureIndexer, classification_model]) # Cross Validation paramGrid = ParamGridBuilder().addGrid(1000, (10, 100, 1000)).build() cvEvaluator = MulticlassClassificationEvaluator(metricName="precision") cv = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=cvEvaluator, numFolds=10) cvModel = cv.fit(trainingData) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16246) Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark Kafka streaming
Alex Jiang created SPARK-16246: -- Summary: Too many block-manager-slave-async-thread opened (TIMED_WAITING) for spark Kafka streaming Key: SPARK-16246 URL: https://issues.apache.org/jira/browse/SPARK-16246 Project: Spark Issue Type: Bug Components: Spark Core, Streaming Affects Versions: 1.6.1 Reporter: Alex Jiang I don't know if our spark streaming issue is related to this (https://issues.apache.org/jira/browse/SPARK-15558). Basically we have one Kafka receiver on each executor, and it ran fine for a while. Then, the executor had a lot of waiting thread accumulated (Thread 1224: block-manager-slave-async-thread-pool-1083 (TIMED_WAITING)). And the executor kept open such new thread. Eventually, it reached the maximum number of the thread on that executor and Kafka receiver on that executor failed. Could someone please shed some light on this? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16221) Redirect Parquet JUL logger via SLF4J for WRITE operations
[ https://issues.apache.org/jira/browse/SPARK-16221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian resolved SPARK-16221. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13918 [https://github.com/apache/spark/pull/13918] > Redirect Parquet JUL logger via SLF4J for WRITE operations > -- > > Key: SPARK-16221 > URL: https://issues.apache.org/jira/browse/SPARK-16221 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > SPARK-8118 implements redirecting Parquet JUL logger via SLF4J, but it is > currently applied only when READ operations occurs. If users use only WRITE > operations, there occurs many Parquet logs. > This issue makes the redirection work on WRITE operations, too. > **Before** > {code} > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > ... about 70 lines Parquet Log ... > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > ... about 70 lines Parquet Log ... > {code} > **After** > {code} > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > [Stage 0:> (0 + 8) / > 8] > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16221) Redirect Parquet JUL logger via SLF4J for WRITE operations
[ https://issues.apache.org/jira/browse/SPARK-16221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cheng Lian updated SPARK-16221: --- Assignee: Dongjoon Hyun > Redirect Parquet JUL logger via SLF4J for WRITE operations > -- > > Key: SPARK-16221 > URL: https://issues.apache.org/jira/browse/SPARK-16221 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > > SPARK-8118 implements redirecting Parquet JUL logger via SLF4J, but it is > currently applied only when READ operations occurs. If users use only WRITE > operations, there occurs many Parquet logs. > This issue makes the redirection work on WRITE operations, too. > **Before** > {code} > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > Jun 26, 2016 9:04:38 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: > Compression: SNAPPY > ... about 70 lines Parquet Log ... > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > ... about 70 lines Parquet Log ... > {code} > **After** > {code} > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > [Stage 0:> (0 + 8) / > 8] > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > > scala> > spark.range(10).write.format("parquet").mode("overwrite").save("/tmp/p") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs
[ https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16111. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.0.0 > Hide SparkOrcNewRecordReader in API docs > > > Key: SPARK-16111 > URL: https://issues.apache.org/jira/browse/SPARK-16111 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Reporter: Xiangrui Meng >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 2.0.0 > > > We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it > appears on the top of the list in the Scala API doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15558) Deadlock when retreiving shuffled cached data
[ https://issues.apache.org/jira/browse/SPARK-15558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352361#comment-15352361 ] Alex Jiang commented on SPARK-15558: I don't know if our spark streaming issue is related to this. Basically we have one Kafka receiver on each executor, and it ran fine for a while. Then, the executor had a lot of waiting thread accumulated (Thread 1224: block-manager-slave-async-thread-pool-1083 (TIMED_WAITING)). And the executor kept open such new thread. Eventually, it reached the maximum number of the thread on that executor and Kafka receiver on that executor failed. Could someone please shed some light on this? > Deadlock when retreiving shuffled cached data > - > > Key: SPARK-15558 > URL: https://issues.apache.org/jira/browse/SPARK-15558 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Fabiano Francesconi > Attachments: screenshot-1.png > > > Spark-1.6.1-bin-hadoop2.6 hangs when trying to retrieving shuffled cached > data from another host. The job I am currently executing is fetching data > using async actions and persisting these RDDs into main memory (they all > fit). Later on, at the point in which it is currently hanging, the > application is retrieving this cached data but it hangs. The application, > once the timeout set in the Await.results call is met, crashes. > This problem is reproducible at every executing, although the point in which > it hangs it is not. > I have also tried activating: > {code} > spark.memory.useLegacyMode=true > {code} > as mentioned in SPARK-13566 guessing a similar deadlock as the one given > between MemoryStore and BlockManager. Unfortunately, this didn't help. > The only plausible (albeit debatable) solution would be to use speculation > mode. > Configuration: > {code} > /usr/local/tl/spark-latest/bin/spark-submit \ > --executor-memory 80G \ > --total-executor-cores 90 \ > --driver-memory 8G \ > {code} > Stack trace: > {code} > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-55" #293 > daemon prio=5 os_prio=0 tid=0x7f99d4004000 nid=0x4e80 waiting on > condition [0x7f9946bfb000] >java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at > scala.concurrent.forkjoin.ForkJoinPool.idleAwaitWork(ForkJoinPool.java:2135) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2067) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "sparkExecutorActorSystem-akka.remote.default-remote-dispatcher-54" #292 > daemon prio=5 os_prio=0 tid=0x7f99d4002000 nid=0x4e6d waiting on > condition [0x7f98c86b6000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7f9b541a6570> (a > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinPool) > at scala.concurrent.forkjoin.ForkJoinPool.scan(ForkJoinPool.java:2075) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > "Executor task launch worker-43" #236 daemon prio=5 os_prio=0 > tid=0x7f9950001800 nid=0x4acc waiting on condition [0x7f9a2c4be000] >java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x7fab3f081300> (a > scala.concurrent.impl.Promise$CompletionLatch) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > at > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > at > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > at scala.concurrent.Await$.result(package.scala:107) > at > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:102) > at > org.apache.spark.storage.BlockManager$$
[jira] [Assigned] (SPARK-16245) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16245: Assignee: Apache Spark > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16245 > URL: https://issues.apache.org/jira/browse/SPARK-16245 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > model loading backward compatibility for ml.feature.PCA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16245) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352350#comment-15352350 ] Apache Spark commented on SPARK-16245: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/13937 > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16245 > URL: https://issues.apache.org/jira/browse/SPARK-16245 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > model loading backward compatibility for ml.feature.PCA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16245) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16245: Assignee: (was: Apache Spark) > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16245 > URL: https://issues.apache.org/jira/browse/SPARK-16245 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Yanbo Liang >Priority: Minor > > model loading backward compatibility for ml.feature.PCA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16245) model loading backward compatibility for ml.feature.PCA
Yanbo Liang created SPARK-16245: --- Summary: model loading backward compatibility for ml.feature.PCA Key: SPARK-16245 URL: https://issues.apache.org/jira/browse/SPARK-16245 Project: Spark Issue Type: Improvement Components: ML Reporter: Yanbo Liang Priority: Minor model loading backward compatibility for ml.feature.PCA -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16244) Failed job/stage couldn't stop JobGenerator immediately.
[ https://issues.apache.org/jira/browse/SPARK-16244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liyin Tang updated SPARK-16244: --- Description: This streaming job has a very simple DAG. Each batch have only 1 job, and each job has only 1 stage. Based on the following logs, we observed a potential race condition. Stage 1 failed due to some tasks failure, and it tigers failJobAndIndependentStages. In the meanwhile, the next stage (job), 2, is submitted and was able to successfully run a few tasks before stopping JobGenerator via shutdown hook. Since the next job was able to run through a few tasks successfully, it just messed up all the checkpoints / offset management. Here is the log from my job: {color:red} Stage 227 started: {color} [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55) [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 227(foreachRDD at DBExportStreaming.java:55) [INFO] 2016-06-25 18:59:00,160 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55), which has no missing parents [INFO] 2016-06-25 18:59:00,171 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 227 (MapPartitionsRDD[455] at foreachRDD at DBExportStreaming.java:55) {color:red} Stage 227 failed: {color} [ERROR] 2016-06-25 19:01:34,083 org.apache.spark.scheduler.TaskSetManager logError - Task 26 in stage 227.0 failed 4 times; aborting job [INFO] 2016-06-25 19:01:34,086 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Cancelling stage 227 [INFO] 2016-06-25 19:01:34,088 org.apache.spark.scheduler.cluster.YarnScheduler logInfo - Stage 227 was cancelled [INFO] 2016-06-25 19:01:34,089 org.apache.spark.scheduler.DAGScheduler logInfo - ResultStage 227 (foreachRDD at DBExportStreaming.java:55) failed in 153.914 s [INFO] 2016-06-25 19:01:34,090 org.apache.spark.scheduler.DAGScheduler logInfo - Job 227 failed: foreachRDD at DBExportStreaming.java:55, took 153.930462 s [INFO] 2016-06-25 19:01:34,091 org.apache.spark.streaming.scheduler.JobScheduler logInfo - Finished job streaming job 146688114 ms.0 from job set of time 14 6688114 ms [INFO] 2016-06-25 19:01:34,091 org.apache.spark.streaming.scheduler.JobScheduler logInfo - Total delay: 154.091 s for time 146688114 ms (execution: 153.935 s) {color:red} Stage 228 started: {color} [INFO] 2016-06-25 19:01:34,094 org.apache.spark.SparkContext logInfo - Starting job: foreachRDD at DBExportStreaming.java:55 [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Got job 228 (foreachRDD at DBExportStreaming.java:55) with 1495 output partitions [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Final stage: ResultStage 228(foreachRDD at DBExportStreaming.java:55) Exception in thread "main" [INFO] 2016-06-25 19:01:34,095 org.apache.spark.scheduler.DAGScheduler logInfo - Parents of final stage: List() {color:red} Shutdown hook was called after stage 228 started: {color} [INFO] 2016-06-25 19:01:34,099 org.apache.spark.streaming.StreamingContext logInfo - Invoking stop(stopGracefully=false) from shutdown hook [INFO] 2016-06-25 19:01:34,101 org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopping JobGenerator immediately [INFO] 2016-06-25 19:01:34,102 org.apache.spark.streaming.util.RecurringTimer logInfo - Stopped timer for JobGenerator after time 146688126 [INFO] 2016-06-25 19:01:34,103 org.apache.spark.streaming.scheduler.JobGenerator logInfo - Stopped JobGenerator [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - ensureFreeSpace(133720) called with curMem=344903, maxMem=1159641169 [INFO] 2016-06-25 19:01:34,106 org.apache.spark.storage.MemoryStore logInfo - Block broadcast_229 stored as values in memory (estimated size 130.6 KB, free 1105.5 MB) [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - ensureFreeSpace(51478) called with curMem=478623, maxMem=1159641169 [INFO] 2016-06-25 19:01:34,107 org.apache.spark.storage.MemoryStore logInfo - Block broadcast_229_piece0 stored as bytes in memory (estimated size 50.3 KB, free 1105.4 MB) [INFO] 2016-06-25 19:01:34,108 org.apache.spark.storage.BlockManagerInfo logInfo - Added broadcast_229_piece0 in memory on 10.123.209.8:42154 (size: 50.3 KB, free: 1105.8 MB) [INFO] 2016-06-25 19:01:34,109 org.apache.spark.SparkContext logInfo - Created broadcast 229 from broadcast at DAGScheduler.scala:861 [INFO] 2016-06-25 19:01:34,110 org.apache.spark.scheduler.DAGScheduler logInfo - Submitting 1495 missing tasks from ResultStage 228 (MapPartitionsRDD[458] at foreachRDD at DBExpor
[jira] [Created] (SPARK-16244) Failed job/stage couldn't stop JobGenerator immediately.
Liyin Tang created SPARK-16244: -- Summary: Failed job/stage couldn't stop JobGenerator immediately. Key: SPARK-16244 URL: https://issues.apache.org/jira/browse/SPARK-16244 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.5.2 Reporter: Liyin Tang This streaming job has a very simple DAG. Each batch have only 1 job, and each job has only 1 stage. Based on the following logs, we observed a potential race condition. Stage 1 failed due to some tasks failure, and it tigers failJobAndIndependentStages. In the meanwhile, the next stage (job), 2, is submitted and was able to successfully run a few tasks before stopping JobGenerator via shutdown hook. Since the next job was able to run through a few tasks successfully, it just messed up all the checkpoints / offset management. I will attach the log in the jira as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16132) model loading backward compatibility for tree model (DecisionTree, RF, GBT)
[ https://issues.apache.org/jira/browse/SPARK-16132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352301#comment-15352301 ] Yanbo Liang commented on SPARK-16132: - Since we did not support DecisionTree, RandomForest and GBT persistence at 1.6, so there is no model loading compatibility issue for these Estimators/Models. I think we can close this JIRA. > model loading backward compatibility for tree model (DecisionTree, RF, GBT) > --- > > Key: SPARK-16132 > URL: https://issues.apache.org/jira/browse/SPARK-16132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > Please help check model loading compatibility for tree models, including > DecisionTree, RandomForest and GBT. (load models saved in Spark 1.6). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352296#comment-15352296 ] Yanbo Liang commented on SPARK-16240: - Since we did not back port https://github.com/apache/spark/pull/12065 to 1.6, we need to implement our own {{LDA.load}} with special handling for param {{topicDistribution}} and replace it with {{topicDistributionCol}} when setting params. > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang >Priority: Minor > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15799) Release SparkR on CRAN
[ https://issues.apache.org/jira/browse/SPARK-15799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352289#comment-15352289 ] Xiangrui Meng commented on SPARK-15799: --- [~sunrui] Maybe we can download the corresponding Spark jars from Maven if no SPARK_HOME is specified, just to make it really simple to use. > Release SparkR on CRAN > -- > > Key: SPARK-15799 > URL: https://issues.apache.org/jira/browse/SPARK-15799 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Xiangrui Meng > > Story: "As an R user, I would like to see SparkR released on CRAN, so I can > use SparkR easily in an existing R environment and have other packages built > on top of SparkR." > I made this JIRA with the following questions in mind: > * Are there known issues that prevent us releasing SparkR on CRAN? > * Do we want to package Spark jars in the SparkR release? > * Are there license issues? > * How does it fit into Spark's release process? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16242) Wrap the Matrix conversion utils in Python
[ https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16242: -- Assignee: Yanbo Liang > Wrap the Matrix conversion utils in Python > -- > > Key: SPARK-16242 > URL: https://issues.apache.org/jira/browse/SPARK-16242 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > This is to wrap SPARK-16187 in Python. So Python users can use it to convert > DataFrames with matrix columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16143) Group survival analysis methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16143. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 13927 [https://github.com/apache/spark/pull/13927] > Group survival analysis methods in generated doc > > > Key: SPARK-16143 > URL: https://issues.apache.org/jira/browse/SPARK-16143 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Fix For: 2.0.1, 2.1.0 > > > Follow SPARK-16107 and group the doc of spark.survreg, predict(SR), > summary(SR), read/write.ml(SR) under Rd spark.survreg. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16243) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352277#comment-15352277 ] Apache Spark commented on SPARK-16243: -- User 'hhbyyh' has created a pull request for this issue: https://github.com/apache/spark/pull/13936 > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16243 > URL: https://issues.apache.org/jira/browse/SPARK-16243 > Project: Spark > Issue Type: Improvement >Reporter: yuhao yang >Priority: Minor > > Fix PCA to load 1.6 models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16243) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16243: Assignee: (was: Apache Spark) > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16243 > URL: https://issues.apache.org/jira/browse/SPARK-16243 > Project: Spark > Issue Type: Improvement >Reporter: yuhao yang >Priority: Minor > > Fix PCA to load 1.6 models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16243) model loading backward compatibility for ml.feature.PCA
[ https://issues.apache.org/jira/browse/SPARK-16243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16243: Assignee: Apache Spark > model loading backward compatibility for ml.feature.PCA > --- > > Key: SPARK-16243 > URL: https://issues.apache.org/jira/browse/SPARK-16243 > Project: Spark > Issue Type: Improvement >Reporter: yuhao yang >Assignee: Apache Spark >Priority: Minor > > Fix PCA to load 1.6 models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16243) model loading backward compatibility for ml.feature.PCA
yuhao yang created SPARK-16243: -- Summary: model loading backward compatibility for ml.feature.PCA Key: SPARK-16243 URL: https://issues.apache.org/jira/browse/SPARK-16243 Project: Spark Issue Type: Improvement Reporter: yuhao yang Priority: Minor Fix PCA to load 1.6 models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16242) Wrap the Matrix conversion utils in Python
[ https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16242: Assignee: (was: Apache Spark) > Wrap the Matrix conversion utils in Python > -- > > Key: SPARK-16242 > URL: https://issues.apache.org/jira/browse/SPARK-16242 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > This is to wrap SPARK-16187 in Python. So Python users can use it to convert > DataFrames with matrix columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16242) Wrap the Matrix conversion utils in Python
[ https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352269#comment-15352269 ] Apache Spark commented on SPARK-16242: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/13935 > Wrap the Matrix conversion utils in Python > -- > > Key: SPARK-16242 > URL: https://issues.apache.org/jira/browse/SPARK-16242 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib, PySpark >Reporter: Yanbo Liang >Priority: Minor > > This is to wrap SPARK-16187 in Python. So Python users can use it to convert > DataFrames with matrix columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16242) Wrap the Matrix conversion utils in Python
[ https://issues.apache.org/jira/browse/SPARK-16242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16242: Assignee: Apache Spark > Wrap the Matrix conversion utils in Python > -- > > Key: SPARK-16242 > URL: https://issues.apache.org/jira/browse/SPARK-16242 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > This is to wrap SPARK-16187 in Python. So Python users can use it to convert > DataFrames with matrix columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16241) model loading backward compatibility for ml NaiveBayes
[ https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352266#comment-15352266 ] yuhao yang commented on SPARK-16241: Sure, please refer the fix of https://issues.apache.org/jira/browse/SPARK-16130. Thanks. > model loading backward compatibility for ml NaiveBayes > -- > > Key: SPARK-16241 > URL: https://issues.apache.org/jira/browse/SPARK-16241 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. Please manually verify the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16241) model loading backward compatibility for ml NaiveBayes
[ https://issues.apache.org/jira/browse/SPARK-16241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352263#comment-15352263 ] Li Ping Zhang commented on SPARK-16241: --- Can I help on this? > model loading backward compatibility for ml NaiveBayes > -- > > Key: SPARK-16241 > URL: https://issues.apache.org/jira/browse/SPARK-16241 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: yuhao yang >Priority: Minor > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. Please manually verify the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16242) Wrap the Matrix conversion utils in Python
Yanbo Liang created SPARK-16242: --- Summary: Wrap the Matrix conversion utils in Python Key: SPARK-16242 URL: https://issues.apache.org/jira/browse/SPARK-16242 Project: Spark Issue Type: Sub-task Components: ML, MLlib, PySpark Reporter: Yanbo Liang Priority: Minor This is to wrap SPARK-16187 in Python. So Python users can use it to convert DataFrames with matrix columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16241) model loading backward compatibility for ml NaiveBayes
yuhao yang created SPARK-16241: -- Summary: model loading backward compatibility for ml NaiveBayes Key: SPARK-16241 URL: https://issues.apache.org/jira/browse/SPARK-16241 Project: Spark Issue Type: Improvement Components: ML Reporter: yuhao yang Priority: Minor To help users migrate from Spark 1.6. to 2.0, we should make model loading backward compatible with models saved in 1.6. Please manually verify the fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
[ https://issues.apache.org/jira/browse/SPARK-16240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352225#comment-15352225 ] yuhao yang commented on SPARK-16240: cc [~yanboliang] [~josephkb] > model loading backward compatibility for ml.clustering.LDA > -- > > Key: SPARK-16240 > URL: https://issues.apache.org/jira/browse/SPARK-16240 > Project: Spark > Issue Type: Bug >Reporter: yuhao yang >Priority: Minor > > After resolving the matrix conversion issue, LDA model still cannot load 1.6 > models as one of the parameter name is changed. > https://github.com/apache/spark/pull/12065 > We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16240) model loading backward compatibility for ml.clustering.LDA
yuhao yang created SPARK-16240: -- Summary: model loading backward compatibility for ml.clustering.LDA Key: SPARK-16240 URL: https://issues.apache.org/jira/browse/SPARK-16240 Project: Spark Issue Type: Bug Reporter: yuhao yang Priority: Minor After resolving the matrix conversion issue, LDA model still cannot load 1.6 models as one of the parameter name is changed. https://github.com/apache/spark/pull/12065 We can perhaps add some special logic in the loading code. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16239) SQL issues with cast from date to string around daylight savings time
Glen Maisey created SPARK-16239: --- Summary: SQL issues with cast from date to string around daylight savings time Key: SPARK-16239 URL: https://issues.apache.org/jira/browse/SPARK-16239 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.1 Reporter: Glen Maisey Priority: Critical Hi all, I have a dataframe with a date column. When I cast to a string using the spark sql cast function it converts it to the wrong date on certain days. Looking into it, it occurs once a year when summer daylight savings starts. I've tried to show this issue the code below. The toString() function works correctly whereas the cast does not. Unfortunately my users are using SQL code rather than scala dataframes and therefore this workaround does not apply. This was actually picked up where a user was writing something like "SELECT date1 UNION ALL select date2" where date1 was a string and date2 was a date type. It must be implicitly converting the date to a string which gives this error. I'm in the Australia/Sydney timezone (see the time changes here http://www.timeanddate.com/time/zone/australia/sydney) val dates = Array("2014-10-03","2014-10-04","2014-10-05","2014-10-06","2015-10-02","2015-10-03", "2015-10-04", "2015-10-05") val df = sc.parallelize(dates) .toDF("txn_date") .select(col("txn_date").cast("Date")) df.select( col("txn_date"), col("txn_date").cast("Timestamp").alias("txn_date_timestamp"), col("txn_date").cast("String").alias("txn_date_str_cast"), col("txn_date".toString()).alias("txn_date_str_toString") ) .show() +--++-+-+ | txn_date| txn_date_timestamp|txn_date_str_cast|txn_date_str_toString| +--++-+-+ |2014-10-03|2014-10-02 14:00:...| 2014-10-03| 2014-10-03| |2014-10-04|2014-10-03 14:00:...| 2014-10-04| 2014-10-04| |2014-10-05|2014-10-04 13:00:...| 2014-10-04| 2014-10-05| |2014-10-06|2014-10-05 13:00:...| 2014-10-06| 2014-10-06| |2015-10-02|2015-10-01 14:00:...| 2015-10-02| 2015-10-02| |2015-10-03|2015-10-02 14:00:...| 2015-10-03| 2015-10-03| |2015-10-04|2015-10-03 13:00:...| 2015-10-03| 2015-10-04| |2015-10-05|2015-10-04 13:00:...| 2015-10-05| 2015-10-05| +--++-+-+ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16238) Metrics for generated method bytecode size
[ https://issues.apache.org/jira/browse/SPARK-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352136#comment-15352136 ] Apache Spark commented on SPARK-16238: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/13934 > Metrics for generated method bytecode size > -- > > Key: SPARK-16238 > URL: https://issues.apache.org/jira/browse/SPARK-16238 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Eric Liang >Priority: Minor > > Add metrics for the generated method size, too increase visibility into when > we come close to exceeding the JVM limit of 64KB per method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16238) Metrics for generated method bytecode size
[ https://issues.apache.org/jira/browse/SPARK-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16238: Assignee: Apache Spark > Metrics for generated method bytecode size > -- > > Key: SPARK-16238 > URL: https://issues.apache.org/jira/browse/SPARK-16238 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > Add metrics for the generated method size, too increase visibility into when > we come close to exceeding the JVM limit of 64KB per method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16238) Metrics for generated method bytecode size
[ https://issues.apache.org/jira/browse/SPARK-16238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16238: Assignee: (was: Apache Spark) > Metrics for generated method bytecode size > -- > > Key: SPARK-16238 > URL: https://issues.apache.org/jira/browse/SPARK-16238 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Eric Liang >Priority: Minor > > Add metrics for the generated method size, too increase visibility into when > we come close to exceeding the JVM limit of 64KB per method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16238) Metrics for generated method bytecode size
Eric Liang created SPARK-16238: -- Summary: Metrics for generated method bytecode size Key: SPARK-16238 URL: https://issues.apache.org/jira/browse/SPARK-16238 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Eric Liang Priority: Minor Add metrics for the generated method size, too increase visibility into when we come close to exceeding the JVM limit of 64KB per method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16237) PySpark gapply
Vladimir Feinberg created SPARK-16237: - Summary: PySpark gapply Key: SPARK-16237 URL: https://issues.apache.org/jira/browse/SPARK-16237 Project: Spark Issue Type: New Feature Components: PySpark, SQL Reporter: Vladimir Feinberg To maintain feature parity, `gapply` functionality should be added to `pyspark`'s `GroupedData` with an interface. The implementation already exists because it fulfilled a need in another package: https://github.com/vlad17/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py It needs to be migrated (to become a GroupedData method, the first argument now to be called self). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16236) Add Path Option back to Load API in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-16236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352085#comment-15352085 ] Apache Spark commented on SPARK-16236: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/13933 > Add Path Option back to Load API in DataFrameReader > --- > > Key: SPARK-16236 > URL: https://issues.apache.org/jira/browse/SPARK-16236 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > @koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ > changed the behavior of `load` API. After the change, the `load` API does not > add the value of `path` into the `options`. Thank you! > We should add the option `path` back to `load()` API in `DataFrameReader`, if > and only if users specify one and only one path in the load API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16236) Add Path Option back to Load API in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-16236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16236: Assignee: (was: Apache Spark) > Add Path Option back to Load API in DataFrameReader > --- > > Key: SPARK-16236 > URL: https://issues.apache.org/jira/browse/SPARK-16236 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > @koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ > changed the behavior of `load` API. After the change, the `load` API does not > add the value of `path` into the `options`. Thank you! > We should add the option `path` back to `load()` API in `DataFrameReader`, if > and only if users specify one and only one path in the load API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16236) Add Path Option back to Load API in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-16236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16236: Assignee: Apache Spark > Add Path Option back to Load API in DataFrameReader > --- > > Key: SPARK-16236 > URL: https://issues.apache.org/jira/browse/SPARK-16236 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > @koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ > changed the behavior of `load` API. After the change, the `load` API does not > add the value of `path` into the `options`. Thank you! > We should add the option `path` back to `load()` API in `DataFrameReader`, if > and only if users specify one and only one path in the load API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
[ https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-16220. - Resolution: Fixed Assignee: Herman van Hovell Fix Version/s: 2.0.1 > Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality > -- > > Key: SPARK-16220 > URL: https://issues.apache.org/jira/browse/SPARK-16220 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Bill Chambers >Assignee: Herman van Hovell > Fix For: 2.0.1 > > > After discussing this with [~marmbrus] and [~rxin]. We've decided to revert > SPARK-15663. After doing some research it seems like this is an unnecessary > departure from 1.X functionality and does not have a reasonable substitute > that gives the same functionality. > The first step is to revert the change. After doing that there are a couple > of different ways to approachs to getting at user defined functions. > 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does > this) > 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS > 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar) > 4. SHOW FUNCTIONS + some column to designate if it's system designed or user > defined. > 1. This aligns with previous functionality and then supplements it with > something a bit more specific. > 2. Is unclear because "all" is just unclear why does the default refer to > only user defined functions. This doesn't seem like the right approach. > 3. Same kind of issue, I'm not sure why the user functions should be the > default over the system functions. That doesn't seem like the correct > approach. > 4. This one seems nice because it kind of achieves #1, keeps existing > functionality, but then supplants it with some more. This also allows you, > for example, to create your own set of date functions and then search them > all in one go as opposed to searching system and then user functions. This > would have to return two columns though, which could potentially be an issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16236) Add Path Option back to Load API in DataFrameReader
Xiao Li created SPARK-16236: --- Summary: Add Path Option back to Load API in DataFrameReader Key: SPARK-16236 URL: https://issues.apache.org/jira/browse/SPARK-16236 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li @koertkuipers identified the PR https://github.com/apache/spark/pull/13727/ changed the behavior of `load` API. After the change, the `load` API does not add the value of `path` into the `options`. Thank you! We should add the option `path` back to `load()` API in `DataFrameReader`, if and only if users specify one and only one path in the load API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16055) sparkR.init() can not load sparkPackages when executing an R file
[ https://issues.apache.org/jira/browse/SPARK-16055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352051#comment-15352051 ] Krishna Kalyan commented on SPARK-16055: Hi [~shivaram] (Spark 1.6) - Could replicate the issue with the error above. log stack trace below (Spark 1.5) - Every thing works fine / Unable to replicate the issue https://gist.github.com/krishnakalyan3/4a433cc854def9cb13925b431bd2dfd2 Could you please help me understand why there is a problem for version 1.6 and for 1.5 every thing works fine. Thanks, Krishna > sparkR.init() can not load sparkPackages when executing an R file > - > > Key: SPARK-16055 > URL: https://issues.apache.org/jira/browse/SPARK-16055 > Project: Spark > Issue Type: Brainstorming > Components: SparkR >Affects Versions: 1.6.1 >Reporter: Sun Rui >Priority: Minor > > This is an issue reported in the Spark user mailing list. Refer to > http://comments.gmane.org/gmane.comp.lang.scala.spark.user/35742 > This issue does not occur in an interactive SparkR session, while it does > occur when executing an R file. > The following example code can be put into an R file to reproduce this issue: > {code} > .libPaths(c("/home/user/spark-1.6.1-bin-hadoop2.6/R/lib",.libPaths())) > Sys.setenv(SPARK_HOME="/home/user/spark-1.6.1-bin-hadoop2.6") > library("SparkR") > sc <- sparkR.init(sparkPackages = "com.databricks:spark-csv_2.11:1.4.0") > sqlContext <- sparkRSQL.init(sc) > df <- read.df(sqlContext, > "file:///home/user/spark-1.6.1-bin-hadoop2.6/data/mllib/sample_tree_data.csv","csv") > showDF(df) > {code} > The error message is as such: > {panel} > 16/06/19 15:48:56 ERROR RBackendHandler: loadDF on > org.apache.spark.sql.api.r.SQLUtils failed > Error in invokeJava(isStatic = TRUE, className, methodName, ...) : > java.lang.ClassNotFoundException: Failed to find data source: csv. Please > find packages at http://spark-packages.org > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.lookupDataSource(ResolvedDataSource.scala:77) > at > org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:102) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:119) > at org.apache.spark.sql.api.r.SQLUtils$.loadDF(SQLUtils.scala:160) > at org.apache.spark.sql.api.r.SQLUtils.loadDF(SQLUtils.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala > Calls: read.df -> callJStatic -> invokeJava > Execution halted > {panel} > The reason behind this is that in case you execute an R file, the R backend > launches before the R interpreter, so there is no opportunity for packages > specified with ‘sparkPackages’ to be processed. > This JIRA issue is to track this issue. An appropriate solution is to be > discussed. Maybe documentation the limitation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15354) Topology aware block replication strategies
[ https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15354: Assignee: Apache Spark > Topology aware block replication strategies > --- > > Key: SPARK-15354 > URL: https://issues.apache.org/jira/browse/SPARK-15354 > Project: Spark > Issue Type: Sub-task > Components: Mesos, Spark Core, YARN >Reporter: Shubham Chopra >Assignee: Apache Spark > > Implementations of strategies for resilient block replication for different > resource managers that replicate the 3-replica strategy used by HDFS, where > the first replica is on an executor, the second replica within the same rack > as the executor and a third replica on a different rack. > The implementation involves providing two pluggable classes, one running in > the driver that provides topology information for every host at cluster start > and the second prioritizing a list of peer BlockManagerIds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15354) Topology aware block replication strategies
[ https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352038#comment-15352038 ] Apache Spark commented on SPARK-15354: -- User 'shubhamchopra' has created a pull request for this issue: https://github.com/apache/spark/pull/13932 > Topology aware block replication strategies > --- > > Key: SPARK-15354 > URL: https://issues.apache.org/jira/browse/SPARK-15354 > Project: Spark > Issue Type: Sub-task > Components: Mesos, Spark Core, YARN >Reporter: Shubham Chopra > > Implementations of strategies for resilient block replication for different > resource managers that replicate the 3-replica strategy used by HDFS, where > the first replica is on an executor, the second replica within the same rack > as the executor and a third replica on a different rack. > The implementation involves providing two pluggable classes, one running in > the driver that provides topology information for every host at cluster start > and the second prioritizing a list of peer BlockManagerIds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15354) Topology aware block replication strategies
[ https://issues.apache.org/jira/browse/SPARK-15354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15354: Assignee: (was: Apache Spark) > Topology aware block replication strategies > --- > > Key: SPARK-15354 > URL: https://issues.apache.org/jira/browse/SPARK-15354 > Project: Spark > Issue Type: Sub-task > Components: Mesos, Spark Core, YARN >Reporter: Shubham Chopra > > Implementations of strategies for resilient block replication for different > resource managers that replicate the 3-replica strategy used by HDFS, where > the first replica is on an executor, the second replica within the same rack > as the executor and a third replica on a different rack. > The implementation involves providing two pluggable classes, one running in > the driver that provides topology information for every host at cluster start > and the second prioritizing a list of peer BlockManagerIds. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16235) "evaluateEachIteration" is returning wrong results when calculated for classification model.
[ https://issues.apache.org/jira/browse/SPARK-16235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352022#comment-15352022 ] Mahmoud Rawas commented on SPARK-16235: --- I am working on a fix > "evaluateEachIteration" is returning wrong results when calculated for > classification model. > > > Key: SPARK-16235 > URL: https://issues.apache.org/jira/browse/SPARK-16235 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1, 1.6.2, 2.0.0 >Reporter: Mahmoud Rawas > > Basically within the mentioned function there is a code to map the actual > value which supposed to be in the range of \[0,1] into the range of \[-1,1], > in order to make it compatible with the predicted value produces by a > classification mode. > {code} > val remappedData = algo match { > case Classification => data.map(x => new LabeledPoint((x.label * 2) - > 1, x.features)) > case _ => data > } > {code} > the problem with this approach is the fact that it will calculate an > incorrect error for an example mse will be be 4 time larger than the actual > expected mse > Instead we should map the predicted value into probability value in [0,1]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16235) "evaluateEachIteration" is returning wrong results when calculated for classification model.
Mahmoud Rawas created SPARK-16235: - Summary: "evaluateEachIteration" is returning wrong results when calculated for classification model. Key: SPARK-16235 URL: https://issues.apache.org/jira/browse/SPARK-16235 Project: Spark Issue Type: Bug Affects Versions: 1.6.2, 1.6.1, 2.0.0 Reporter: Mahmoud Rawas Basically within the mentioned function there is a code to map the actual value which supposed to be in the range of \[0,1] into the range of \[-1,1], in order to make it compatible with the predicted value produces by a classification mode. {code} val remappedData = algo match { case Classification => data.map(x => new LabeledPoint((x.label * 2) - 1, x.features)) case _ => data } {code} the problem with this approach is the fact that it will calculate an incorrect error for an example mse will be be 4 time larger than the actual expected mse Instead we should map the predicted value into probability value in [0,1]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15700) Spark 2.0 dataframes using more memory (reading/writing parquet)
[ https://issues.apache.org/jira/browse/SPARK-15700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352016#comment-15352016 ] Davies Liu commented on SPARK-15700: My guess is that the SQL metrics required more memory than before. cc [~cloud_fan] , could you test a join with these many partitions to measure the memory used by SQL metrics? > Spark 2.0 dataframes using more memory (reading/writing parquet) > > > Key: SPARK-15700 > URL: https://issues.apache.org/jira/browse/SPARK-15700 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Thomas Graves > > I was running a large 15TB join job with 10 map tasks, 2 reducers > that I frequently have run on Spark 1.6 successfully (with very little GC) > and it failed with an out of heap memory on the driver. Driver had 10GB heap > with 3GB overhead. > 16/05/31 22:47:44 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > java.lang.OutOfMemoryError: Java heap space > at java.util.Arrays.copyOfRange(Arrays.java:3520) > at > org.apache.parquet.io.api.Binary$ByteArrayBackedBinary.getBytes(Binary.java:262) > at > org.apache.parquet.column.statistics.BinaryStatistics.getMinBytes(BinaryStatistics.java:67) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetStatistics(ParquetMetadataConverter.java:242) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.addRowGroup(ParquetMetadataConverter.java:184) > at > org.apache.parquet.format.converter.ParquetMetadataConverter.toParquetMetadata(ParquetMetadataConverter.java:95) > at > org.apache.parquet.hadoop.ParquetFileWriter.serializeFooter(ParquetFileWriter.java:472) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:500) > at > org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:490) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:63) > at > org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:221) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:57) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:55) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:69) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:479) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:252) > at > org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:234) > at > org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:626) > I haven't had a chance to look into this further yet just reporting it for > now. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For
[jira] [Commented] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352004#comment-15352004 ] Davies Liu commented on SPARK-15621: The number of rows in the queue will bounded by the number of values in input/output buffer of Python process together with some buffered Python process (under processing), so it's not exactly unbounded. Do you have a way to reproduce the issue? > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs >Priority: Critical > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15621) BatchEvalPythonExec fails with OOM
[ https://issues.apache.org/jira/browse/SPARK-15621?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-15621: --- Priority: Major (was: Critical) > BatchEvalPythonExec fails with OOM > -- > > Key: SPARK-15621 > URL: https://issues.apache.org/jira/browse/SPARK-15621 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Krisztian Szucs > > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python/BatchEvalPythonExec.scala#L40 > No matter what the queue grows unboundedly and fails with OOM, even with > identity `lambda x: x` udf function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit
[ https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-16224: --- Description: Hi, This is a continuation of a resolved bug [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345] I can access databases when using new methodology, i.e: {code} from pyspark.sql import SparkSession from pyspark import SparkConf if __name__ == "__main__": conf = SparkConf() hc = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() print(hc.sql("show databases").collect()) {code} This shows all database in hive. However, using HiveContext, i.e.: {code} from pyspark.sql import HiveContext from pyspark import SparkContext, SparkConf if __name__ == "__main__": conf = SparkConf() sc = SparkContext(conf=conf) hive_context = HiveContext(sc) print(hive_context.sql("show databases").collect()) # The result is #[Row(result='default')] {code} prints only default database. I have {{hive-site.xml}} file configured. Those snippets are for scripts launched with {{spark-submit}} command. With pyspark those code fragments work fine, displaying all the databases. was: Hi, This is a continuation of a resolved bug [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345] I can access databases when using new methodology, i.e: {code} from pyspark.sql import SparkSession from pyspark import SparkConf if __name__ == "__main__": conf = SparkConf() hc = SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() print(hc.sql("show databases").collect()) {code} This shows all database in hive. However, using HiveContext, i.e.: {code} from pyspark.sql import HiveContext from pyspark improt SparkContext, SparkConf if __name__ == "__main__": conf = SparkConf() sc = SparkContext(conf=conf) hive_context = HiveContext(sc) print(hive_context.sql("show databases").collect()) # The result is #[Row(result='default')] {code} prints only default database. I have {{hive-site.xml}} file configured. Those snippets are for scripts launched with {{spark-submit}} command. With pyspark those code fragments work fine, displaying all the databases. > Hive context created by HiveContext can't access Hive databases when used in > a script launched be spark-submit > -- > > Key: SPARK-16224 > URL: https://issues.apache.org/jira/browse/SPARK-16224 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: branch-2.0 >Reporter: Piotr Milanowski >Assignee: Yin Huai >Priority: Blocker > > Hi, > This is a continuation of a resolved bug > [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345] > I can access databases when using new methodology, i.e: > {code} > from pyspark.sql import SparkSession > from pyspark import SparkConf > if __name__ == "__main__": > conf = SparkConf() > hc = > SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() > print(hc.sql("show databases").collect()) > {code} > This shows all database in hive. > However, using HiveContext, i.e.: > {code} > from pyspark.sql import HiveContext > from pyspark import SparkContext, SparkConf > if __name__ == "__main__": > conf = SparkConf() > sc = SparkContext(conf=conf) > hive_context = HiveContext(sc) > print(hive_context.sql("show databases").collect()) > # The result is > #[Row(result='default')] > {code} > prints only default database. > I have {{hive-site.xml}} file configured. > Those snippets are for scripts launched with {{spark-submit}} command. With > pyspark those code fragments work fine, displaying all the databases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16144) Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict
[ https://issues.apache.org/jira/browse/SPARK-16144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351942#comment-15351942 ] Xusen Yin commented on SPARK-16144: --- I'd like to work on this. > Add a separate Rd for ML generic methods: read.ml, write.ml, summary, predict > - > > Key: SPARK-16144 > URL: https://issues.apache.org/jira/browse/SPARK-16144 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > After we grouped generic methods by the algorithm, it would be nice to add a > separate Rd for each ML generic methods, in particular, write.ml, read.ml, > summary, and predict and link the implementations with seealso. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10408) Autoencoder
[ https://issues.apache.org/jira/browse/SPARK-10408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351922#comment-15351922 ] Alexander Ulanov commented on SPARK-10408: -- Here is the PR https://github.com/apache/spark/pull/13621 > Autoencoder > --- > > Key: SPARK-10408 > URL: https://issues.apache.org/jira/browse/SPARK-10408 > Project: Spark > Issue Type: Umbrella > Components: ML >Affects Versions: 1.5.0 >Reporter: Alexander Ulanov >Assignee: Alexander Ulanov >Priority: Minor > > Goal: Implement various types of autoencoders > Requirements: > 1)Basic (deep) autoencoder that supports different types of inputs: binary, > real in [0..1]. real in [-inf, +inf] > 2)Sparse autoencoder i.e. L1 regularization. It should be added as a feature > to the MLP and then used here > 3)Denoising autoencoder > 4)Stacked autoencoder for pre-training of deep networks. It should support > arbitrary network layers > References: > 1. Vincent, Pascal, et al. "Extracting and composing robust features with > denoising autoencoders." Proceedings of the 25th international conference on > Machine learning. ACM, 2008. > http://www.iro.umontreal.ca/~vincentp/Publications/denoising_autoencoders_tr1316.pdf > > 2. > http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Rifai_455.pdf, > 3. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., and Manzagol, P.-A. > (2010). Stacked denoising autoencoders: Learning useful representations in a > deep network with a local denoising criterion. Journal of Machine Learning > Research, 11(3371–3408). > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.297.3484&rep=rep1&type=pdf > 4, 5, 6. Bengio, Yoshua, et al. "Greedy layer-wise training of deep > networks." Advances in neural information processing systems 19 (2007): 153. > http://www.iro.umontreal.ca/~lisa/pointeurs/dbn_supervised_tr1282.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-16234: -- Description: resolved... (was: given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with a straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists) > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > resolved... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers closed SPARK-16234. - Resolution: Resolved > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > given spark.speculative set to true, I'm running a large spark job with > parquet and savemode overwrite. > Spark will speculatively try to create a task to deal with a straggler. > However, doing this comes with risk because EVEN THOUGH savemode overwrite is > selected, if the straggler completes before the original task or the original > task completes before the straggler then the job will fail due to the file > already existing. > java.io.IOException: > /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet > already exists -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16106) TaskSchedulerImpl does not correctly handle new executors on existing hosts
[ https://issues.apache.org/jira/browse/SPARK-16106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-16106. -- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 13826 [https://github.com/apache/spark/pull/13826] > TaskSchedulerImpl does not correctly handle new executors on existing hosts > --- > > Key: SPARK-16106 > URL: https://issues.apache.org/jira/browse/SPARK-16106 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Priority: Trivial > Fix For: 2.1.0 > > > The TaskSchedulerImpl updates the set of executors and hosts in each call to > {{resourceOffers}}. During this call, it also tracks whether there are any > new executors observed in {{newExecAvail}}: > {code} > executorIdToHost(o.executorId) = o.host > executorIdToTaskCount.getOrElseUpdate(o.executorId, 0) > if (!executorsByHost.contains(o.host)) { > executorsByHost(o.host) = new HashSet[String]() > executorAdded(o.executorId, o.host) > newExecAvail = true > } > {code} > However, this only detects when a new *host* is added, not when an additional > executor is added to an existing host (a relatively common event in dynamic > allocation). > The end result is that task locality and {{failedEpochs}} is not updated > correctly for new executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16136) Flaky Test: TaskManagerSuite "Kill other task attempts when one attempt belonging to the same task succeeds"
[ https://issues.apache.org/jira/browse/SPARK-16136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Imran Rashid resolved SPARK-16136. -- Resolution: Fixed Fix Version/s: 2.1.0 Fixed by https://github.com/apache/spark/pull/13848 > Flaky Test: TaskManagerSuite "Kill other task attempts when one attempt > belonging to the same task succeeds" > > > Key: SPARK-16136 > URL: https://issues.apache.org/jira/browse/SPARK-16136 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.0.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.1.0 > > > TaskManagerSuite "Kill other task attempts when one attempt belonging to the > same task succeeds" is flaky because it requires at least one millisecond to > elapse between when the tasks are schedule and when the check is made for > speculatable tasks. > Fix this by using a manual clock. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16234) Speculative Task may not be able to overwrite file
[ https://issues.apache.org/jira/browse/SPARK-16234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bill Chambers updated SPARK-16234: -- Description: given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with a straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists was: given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with this straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists > Speculative Task may not be able to overwrite file > -- > > Key: SPARK-16234 > URL: https://issues.apache.org/jira/browse/SPARK-16234 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Bill Chambers > > given spark.speculative set to true, I'm running a large spark job with > parquet and savemode overwrite. > Spark will speculatively try to create a task to deal with a straggler. > However, doing this comes with risk because EVEN THOUGH savemode overwrite is > selected, if the straggler completes before the original task or the original > task completes before the straggler then the job will fail due to the file > already existing. > java.io.IOException: > /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet > already exists -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16234) Speculative Task may not be able to overwrite file
Bill Chambers created SPARK-16234: - Summary: Speculative Task may not be able to overwrite file Key: SPARK-16234 URL: https://issues.apache.org/jira/browse/SPARK-16234 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.0.0 Reporter: Bill Chambers given spark.speculative set to true, I'm running a large spark job with parquet and savemode overwrite. Spark will speculatively try to create a task to deal with this straggler. However, doing this comes with risk because EVEN THOUGH savemode overwrite is selected, if the straggler completes before the original task or the original task completes before the straggler then the job will fail due to the file already existing. java.io.IOException: /...some-file.../part-r-00049-401da178-3343-43a4-9c8d-277cc0173bf9.gz.parquet already exists -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15767) Decision Tree Regression wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351826#comment-15351826 ] Kai Jiang commented on SPARK-15767: --- ping [~shvenkat] > Decision Tree Regression wrapper in SparkR > -- > > Key: SPARK-15767 > URL: https://issues.apache.org/jira/browse/SPARK-15767 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Reporter: Kai Jiang >Assignee: Kai Jiang > > Implement a wrapper in SparkR to support decision tree regression. R's naive > Decision Tree Regression implementation is from package rpart with signature > rpart(formula, dataframe, method="anova"). I propose we could implement API > like spark.rpart(dataframe, formula, ...) . After having implemented > decision tree classification, we could refactor this two into an API more > like rpart() -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16233) test_sparkSQL.R is failing
[ https://issues.apache.org/jira/browse/SPARK-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin Ren updated SPARK-16233: Description: By running {code} ./R/run-tests.sh {code} Getting error: {code} xin:spark xr$ ./R/run-tests.sh Warning: Ignoring non-spark config property: SPARK_SCALA_VERSION=2.11 Loading required package: methods Attaching package: ‘SparkR’ The following object is masked from ‘package:testthat’: describe The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from ‘package:base’: as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union binary functions: ... functions on binary files: broadcast variables: .. functions in client.R: . test functions in sparkR.R: .Re-using existing Spark Context. Call sparkR.session.stop() or restart R to create a new Spark Context Re-using existing Spark Context. Call sparkR.session.stop() or restart R to create a new Spark Context ... include an external JAR in SparkContext: Warning: Ignoring non-spark config property: SPARK_SCALA_VERSION=2.11 .. include R packages: MLlib functions: .SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. .27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Maximum row group padding size is 0 bytes 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 65,622 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 70B for [label] BINARY: 1 values, 21B raw, 23B comp, 1 pages, encodings: [PLAIN, RLE, BIT_PACKED] 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 87B for [terms, list, element, list, element] BINARY: 2 values, 42B raw, 43B comp, 1 pages, encodings: [PLAIN, RLE] 27-Jun-2016 1:51:25 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 30B for [hasIntercept] BOOLEAN: 1 values, 1B raw, 3B comp, 1 pages, encodings: [PLAIN, BIT_PACKED] 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Maximum row group padding size is 0 bytes 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 49 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 90B for [labels, list, element] BINARY: 3 values, 50B raw, 50B comp, 1 pages, encodings: [PLAIN, RLE] 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on 27-Jun-2016 1:51:26 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off 27-Jun-2016 1:51:26 PM INFO: org.apache.parque
[jira] [Commented] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit
[ https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351784#comment-15351784 ] Yin Huai commented on SPARK-16224: -- https://github.com/apache/spark/pull/13931 should fix the issue. Can you try it? > Hive context created by HiveContext can't access Hive databases when used in > a script launched be spark-submit > -- > > Key: SPARK-16224 > URL: https://issues.apache.org/jira/browse/SPARK-16224 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: branch-2.0 >Reporter: Piotr Milanowski >Assignee: Yin Huai >Priority: Blocker > > Hi, > This is a continuation of a resolved bug > [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345] > I can access databases when using new methodology, i.e: > {code} > from pyspark.sql import SparkSession > from pyspark import SparkConf > if __name__ == "__main__": > conf = SparkConf() > hc = > SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() > print(hc.sql("show databases").collect()) > {code} > This shows all database in hive. > However, using HiveContext, i.e.: > {code} > from pyspark.sql import HiveContext > from pyspark improt SparkContext, SparkConf > if __name__ == "__main__": > conf = SparkConf() > sc = SparkContext(conf=conf) > hive_context = HiveContext(sc) > print(hive_context.sql("show databases").collect()) > # The result is > #[Row(result='default')] > {code} > prints only default database. > I have {{hive-site.xml}} file configured. > Those snippets are for scripts launched with {{spark-submit}} command. With > pyspark those code fragments work fine, displaying all the databases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit
[ https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351783#comment-15351783 ] Apache Spark commented on SPARK-16224: -- User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/13931 > Hive context created by HiveContext can't access Hive databases when used in > a script launched be spark-submit > -- > > Key: SPARK-16224 > URL: https://issues.apache.org/jira/browse/SPARK-16224 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: branch-2.0 >Reporter: Piotr Milanowski >Assignee: Yin Huai >Priority: Blocker > > Hi, > This is a continuation of a resolved bug > [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345] > I can access databases when using new methodology, i.e: > {code} > from pyspark.sql import SparkSession > from pyspark import SparkConf > if __name__ == "__main__": > conf = SparkConf() > hc = > SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() > print(hc.sql("show databases").collect()) > {code} > This shows all database in hive. > However, using HiveContext, i.e.: > {code} > from pyspark.sql import HiveContext > from pyspark improt SparkContext, SparkConf > if __name__ == "__main__": > conf = SparkConf() > sc = SparkContext(conf=conf) > hive_context = HiveContext(sc) > print(hive_context.sql("show databases").collect()) > # The result is > #[Row(result='default')] > {code} > prints only default database. > I have {{hive-site.xml}} file configured. > Those snippets are for scripts launched with {{spark-submit}} command. With > pyspark those code fragments work fine, displaying all the databases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16224) Hive context created by HiveContext can't access Hive databases when used in a script launched be spark-submit
[ https://issues.apache.org/jira/browse/SPARK-16224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351780#comment-15351780 ] Yin Huai commented on SPARK-16224: -- https://github.com/apache/spark/pull/13931 should fix the issue. Can you try it? > Hive context created by HiveContext can't access Hive databases when used in > a script launched be spark-submit > -- > > Key: SPARK-16224 > URL: https://issues.apache.org/jira/browse/SPARK-16224 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: branch-2.0 >Reporter: Piotr Milanowski >Assignee: Yin Huai >Priority: Blocker > > Hi, > This is a continuation of a resolved bug > [SPARK-15345|https://issues.apache.org/jira/browse/SPARK-15345] > I can access databases when using new methodology, i.e: > {code} > from pyspark.sql import SparkSession > from pyspark import SparkConf > if __name__ == "__main__": > conf = SparkConf() > hc = > SparkSession.builder.config(conf=conf).enableHiveSupport().getOrCreate() > print(hc.sql("show databases").collect()) > {code} > This shows all database in hive. > However, using HiveContext, i.e.: > {code} > from pyspark.sql import HiveContext > from pyspark improt SparkContext, SparkConf > if __name__ == "__main__": > conf = SparkConf() > sc = SparkContext(conf=conf) > hive_context = HiveContext(sc) > print(hive_context.sql("show databases").collect()) > # The result is > #[Row(result='default')] > {code} > prints only default database. > I have {{hive-site.xml}} file configured. > Those snippets are for scripts launched with {{spark-submit}} command. With > pyspark those code fragments work fine, displaying all the databases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15581) MLlib 2.1 Roadmap
[ https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15581: -- Description: This is a master list for MLlib improvements we are working on for the next release. Please view this as a wish list rather than a definite plan, for we don't have an accurate estimate of available resources. Due to limited review bandwidth, features appearing on this list will get higher priority during code review. But feel free to suggest new items to the list in comments. We are experimenting with this process. Your feedback would be greatly appreciated. h1. Instructions h2. For contributors: * Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark carefully. Code style, documentation, and unit tests are important. * If you are a first-time Spark contributor, please always start with a [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather than a medium/big feature. Based on our experience, mixing the development process with a big feature usually causes long delay in code review. * Never work silently. Let everyone know on the corresponding JIRA page when you start working on some features. This is to avoid duplicate work. For small features, you don't need to wait to get JIRA assigned. * For medium/big features or features with dependencies, please get assigned first before coding and keep the ETA updated on the JIRA. If there exist no activity on the JIRA page for a certain amount of time, the JIRA should be released for other contributors. * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one after another. * Remember to add the `@Since("VERSION")` annotation to new public APIs. * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code review greatly helps to improve others' code as well as yours. h2. For committers: * Try to break down big features into small and specific JIRA tasks and link them properly. * Add a "starter" label to starter tasks. * Put a rough estimate for medium/big features and track the progress. * If you start reviewing a PR, please add yourself to the Shepherd field on JIRA. * If the code looks good to you, please comment "LGTM". For non-trivial PRs, please ping a maintainer to make a final pass. * After merging a PR, create and link JIRAs for Python, example code, and documentation if applicable. h1. Roadmap (*WIP*) This is NOT [a complete list of MLlib JIRAs for 2.1| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority]. We only include umbrella JIRAs and high-level tasks. Major efforts in this release: * Feature parity for the DataFrames-based API (`spark.ml`), relative to the RDD-based API * ML persistence * Python API feature parity and test coverage * R API expansion and improvements * Note about new features: As usual, we expect to expand the feature set of MLlib. However, we will prioritize API parity, bug fixes, and improvements over new features. Note `spark.mllib` is in maintenance mode now. We will accept bug fixes for it, but new features, APIs, and improvements will only be added to `spark.ml`. h2. Critical feature parity in DataFrame-based API * Umbrella JIRA: [SPARK-4591] h2. Persistence * Complete persistence within MLlib ** Python tuning (SPARK-13786) * MLlib in R format: compatibility with other languages (SPARK-15572) * Impose backwards compatibility for persistence (SPARK-15573) h2. Python API * Standardize unit tests for Scala and Python to improve and consolidate test coverage for Params, persistence, and other common functionality (SPARK-15571) * Improve Python API handling of Params, persistence (SPARK-14771) (SPARK-14706) ** Note: The linked JIRAs for this are incomplete. More to be created... ** Related: Implement Python meta-algorithms in Scala (to simplify persistence) (SPARK-15574) * Feature parity: The main goal of the Python API is to have feature parity with the Scala/Java API. You can find a [complete list here| https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(PySpark)%20AND%20"Target%20Version%2Fs"%20%3D%202.1.0%20ORDER%20BY%20priority%20DESC]. The tasks fall into two major categories: ** Python API for missing methods (SPARK-14813) ** Python API for new algorithms. Committers should create a JIRA for the Python API after merging a public feature in Scala/Java. h2. SparkR * Improve R formula support and implementation (SPARK-15540) *
[jira] [Commented] (SPARK-16228) "Percentile" needs explicit cast to double
[ https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351727#comment-15351727 ] Apache Spark commented on SPARK-16228: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/13930 > "Percentile" needs explicit cast to double > -- > > Key: SPARK-16228 > URL: https://issues.apache.org/jira/browse/SPARK-16228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > {quote} > select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla > {quote} > Works. > {quote} > select percentile(cast(id as bigint), 0.5 ) from temp.bla > {quote} > Throws > {quote} > Error in query: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, > decimal(38,18)). Possible choices: _FUNC_(bigint, array) > _FUNC_(bigint, double) ; line 1 pos 7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16228) "Percentile" needs explicit cast to double
[ https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16228: Assignee: (was: Apache Spark) > "Percentile" needs explicit cast to double > -- > > Key: SPARK-16228 > URL: https://issues.apache.org/jira/browse/SPARK-16228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > {quote} > select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla > {quote} > Works. > {quote} > select percentile(cast(id as bigint), 0.5 ) from temp.bla > {quote} > Throws > {quote} > Error in query: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, > decimal(38,18)). Possible choices: _FUNC_(bigint, array) > _FUNC_(bigint, double) ; line 1 pos 7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16228) "Percentile" needs explicit cast to double
[ https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16228: Assignee: Apache Spark > "Percentile" needs explicit cast to double > -- > > Key: SPARK-16228 > URL: https://issues.apache.org/jira/browse/SPARK-16228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Assignee: Apache Spark > > {quote} > select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla > {quote} > Works. > {quote} > select percentile(cast(id as bigint), 0.5 ) from temp.bla > {quote} > Throws > {quote} > Error in query: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, > decimal(38,18)). Possible choices: _FUNC_(bigint, array) > _FUNC_(bigint, double) ; line 1 pos 7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16228) "Percentile" needs explicit cast to double
[ https://issues.apache.org/jira/browse/SPARK-16228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351708#comment-15351708 ] Dongjoon Hyun commented on SPARK-16228: --- Hi, [~epahomov] and [~srowen]. The root cause is that Spark 2.0 uses `Decimal` as a default type for literal '0.5'. This happens for `percentile_approx`, too. I guess it will happen for all double-type-only external functions. I'll make a PR for this soon. > "Percentile" needs explicit cast to double > -- > > Key: SPARK-16228 > URL: https://issues.apache.org/jira/browse/SPARK-16228 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov > > {quote} > select percentile(cast(id as bigint), cast(0.5 as double)) from temp.bla > {quote} > Works. > {quote} > select percentile(cast(id as bigint), 0.5 ) from temp.bla > {quote} > Throws > {quote} > Error in query: No handler for Hive UDF > 'org.apache.hadoop.hive.ql.udf.UDAFPercentile': > org.apache.hadoop.hive.ql.exec.NoMatchingMethodException: No matching method > for class org.apache.hadoop.hive.ql.udf.UDAFPercentile with (bigint, > decimal(38,18)). Possible choices: _FUNC_(bigint, array) > _FUNC_(bigint, double) ; line 1 pos 7 > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16220) Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality
[ https://issues.apache.org/jira/browse/SPARK-16220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351705#comment-15351705 ] Apache Spark commented on SPARK-16220: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/13929 > Revert ShowFunctions/ListFunctions in 2.0 to Reflect 1.6 Functionality > -- > > Key: SPARK-16220 > URL: https://issues.apache.org/jira/browse/SPARK-16220 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Bill Chambers > > After discussing this with [~marmbrus] and [~rxin]. We've decided to revert > SPARK-15663. After doing some research it seems like this is an unnecessary > departure from 1.X functionality and does not have a reasonable substitute > that gives the same functionality. > The first step is to revert the change. After doing that there are a couple > of different ways to approachs to getting at user defined functions. > 1. SHOW FUNCTIONS (shows all of them) + SHOW USER FUNCTIONS (Snowflake does > this) > 2. SHOW FUNCTIONS + SHOW USER FUNCTIONS + SHOW ALL FUNCTIONS > 3. SHOW FUNCTIONS + SHOW SYSTEM FUNCTIONS (or something similar) > 4. SHOW FUNCTIONS + some column to designate if it's system designed or user > defined. > 1. This aligns with previous functionality and then supplements it with > something a bit more specific. > 2. Is unclear because "all" is just unclear why does the default refer to > only user defined functions. This doesn't seem like the right approach. > 3. Same kind of issue, I'm not sure why the user functions should be the > default over the system functions. That doesn't seem like the correct > approach. > 4. This one seems nice because it kind of achieves #1, keeps existing > functionality, but then supplants it with some more. This also allows you, > for example, to create your own set of date functions and then search them > all in one go as opposed to searching system and then user functions. This > would have to return two columns though, which could potentially be an issue? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16233) test_sparkSQL.R is failing
Xin Ren created SPARK-16233: --- Summary: test_sparkSQL.R is failing Key: SPARK-16233 URL: https://issues.apache.org/jira/browse/SPARK-16233 Project: Spark Issue Type: Bug Components: SparkR, Tests Affects Versions: 2.0.0 Reporter: Xin Ren Priority: Minor By running {code} ./R/run-tests.sh {code} Getting error: {code} 15. Error: create DataFrame from list or data.frame (@test_sparkSQL.R#277) - java.lang.NoClassDefFoundorg/apache/spark/sql/execution/datasources/PreInsertCastAndRename$ at org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:69) at org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63) at org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:533) at org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:293) at org.apache.spark.sql.api.r.SQLUtils$.createDF(SQLUtils.scala:135) at org.apache.spark.sql.api.r.SQLUtils.createDF(SQLUtils.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) at org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) at java.lang.Thread.run(Thread.java:745) 1: createDataFrame(l, c("a", "b")) at /Users/quickmobile/workspace/spark/R/lib/SparkR/tests/testthat/test_sparkSQL.R:277 2: dispatchFunc("createDataFrame(data, schema = NULL, samplingRatio = 1.0)", x, ...) 3: f(x, ...) 4: callJStatic("org.apache.spark.sql.api.r.SQLUtils", "createDF", srdd, schema$jobj, sparkSession) 5: invokeJava(isStatic = TRUE, className, methodName, ...) 6: stop(readString(conn)) DONE === Execution halted {code} Cause: most probably these tests are using 'createDataFrame(sqlContext...)' which is deprecated. Should update tests method invocations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16233) test_sparkSQL.R is failing
[ https://issues.apache.org/jira/browse/SPARK-16233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15351703#comment-15351703 ] Xin Ren commented on SPARK-16233: - I'm working on this > test_sparkSQL.R is failing > -- > > Key: SPARK-16233 > URL: https://issues.apache.org/jira/browse/SPARK-16233 > Project: Spark > Issue Type: Bug > Components: SparkR, Tests >Affects Versions: 2.0.0 >Reporter: Xin Ren >Priority: Minor > > By running > {code} > ./R/run-tests.sh > {code} > Getting error: > {code} > 15. Error: create DataFrame from list or data.frame (@test_sparkSQL.R#277) > - > java.lang.NoClassDefFoundorg/apache/spark/sql/execution/datasources/PreInsertCastAndRename$ > at > org.apache.spark.sql.hive.HiveSessionState$$anon$1.(HiveSessionState.scala:69) > at > org.apache.spark.sql.hive.HiveSessionState.analyzer$lzycompute(HiveSessionState.scala:63) > at > org.apache.spark.sql.hive.HiveSessionState.analyzer(HiveSessionState.scala:62) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at > org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:533) > at > org.apache.spark.sql.SparkSession.createDataFrame(SparkSession.scala:293) > at org.apache.spark.sql.api.r.SQLUtils$.createDF(SQLUtils.scala:135) > at org.apache.spark.sql.api.r.SQLUtils.createDF(SQLUtils.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.spark.api.r.RBackendHandler.handleMethodCall(RBackendHandler.scala:141) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:86) > at > org.apache.spark.api.r.RBackendHandler.channelRead0(RBackendHandler.scala:38) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:244) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137) > at java.lang.Thread.run(Thread.java:745) > 1: createDataFrame(l, c("a", "b")) at > /Users/quickmobile/workspace/spark/R/lib/SparkR/tests/testthat/test_sparkSQL.R:277 > 2: dispatchFunc("createDataFrame(data, schema = NULL, samplingRatio = 1.0)", > x, ...) > 3: f(x, ...) > 4: callJStatic("org.apache.spark.sql.api.r.SQLUtils", "createDF", srdd, > schema$jobj, >sparkSession) > 5: invokeJava(isStatic = TRUE, className, methodName, ...) > 6: stop(readString(conn)) > DONE > === > Execution halted > {code} > Cause: most probably these tests are using 'createDataFrame(sqlContext...)' > which is deprecated. Should update tests method invocations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.a
[jira] [Updated] (SPARK-16231) PySpark ML DataFrame example fails on Vector conversion
[ https://issues.apache.org/jira/browse/SPARK-16231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16231: -- Assignee: Bryan Cutler > PySpark ML DataFrame example fails on Vector conversion > --- > > Key: SPARK-16231 > URL: https://issues.apache.org/jira/browse/SPARK-16231 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.0.1, 2.1.0 > > > The PySpark example dataframe_example.py fails when attempting to convert a > ML style Vector (as loaded from libsvm format) to MLlib style Vector to be > used in stat calculations. Before the stat calculations, the ML Vectors need > to be converted to the old MLlib style with the utility function > MLUtils.convertVectorColumnsFromML -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org