[jira] [Assigned] (SPARK-12644) Vectorize/Batch decode parquet
[ https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12644: Assignee: Apache Spark (was: Nong Li) > Vectorize/Batch decode parquet > -- > > Key: SPARK-12644 > URL: https://issues.apache.org/jira/browse/SPARK-12644 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Apache Spark > > The parquet encodings are largely designed to decode faster in batches, > column by column. This can speed up the decoding considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12644) Vectorize/Batch decode parquet
[ https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082629#comment-15082629 ] Apache Spark commented on SPARK-12644: -- User 'nongli' has created a pull request for this issue: https://github.com/apache/spark/pull/10593 > Vectorize/Batch decode parquet > -- > > Key: SPARK-12644 > URL: https://issues.apache.org/jira/browse/SPARK-12644 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > > The parquet encodings are largely designed to decode faster in batches, > column by column. This can speed up the decoding considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12644) Vectorize/Batch decode parquet
[ https://issues.apache.org/jira/browse/SPARK-12644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12644: Assignee: Nong Li (was: Apache Spark) > Vectorize/Batch decode parquet > -- > > Key: SPARK-12644 > URL: https://issues.apache.org/jira/browse/SPARK-12644 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Nong Li >Assignee: Nong Li > > The parquet encodings are largely designed to decode faster in batches, > column by column. This can speed up the decoding considerably. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update
[ https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082631#comment-15082631 ] Apache Spark commented on SPARK-12570: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10594 > DecisionTreeRegressor: provide variance of prediction: user guide update > > > Key: SPARK-12570 > URL: https://issues.apache.org/jira/browse/SPARK-12570 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Priority: Minor > > See linked JIRA for details. This should update the table of output columns > and text. Examples are probably not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update
[ https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12570: Assignee: (was: Apache Spark) > DecisionTreeRegressor: provide variance of prediction: user guide update > > > Key: SPARK-12570 > URL: https://issues.apache.org/jira/browse/SPARK-12570 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Priority: Minor > > See linked JIRA for details. This should update the table of output columns > and text. Examples are probably not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update
[ https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12570: Assignee: Apache Spark > DecisionTreeRegressor: provide variance of prediction: user guide update > > > Key: SPARK-12570 > URL: https://issues.apache.org/jira/browse/SPARK-12570 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark >Priority: Minor > > See linked JIRA for details. This should update the table of output columns > and text. Examples are probably not needed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12623) map key_values to values
[ https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082639#comment-15082639 ] Elazar Gershuni commented on SPARK-12623: - That does not answer the question/feature request. Mapping values to values can be achieved by similar code to the one you suggested: rdd.map { case (key, value) => (key, myFunctionOf(value)) } Yet Spark does provide rdd.mapValues(), for performance reasons (retaining the partitioning - avoiding the need to reshuffle when the key does not change). I would like to enjoy similar benefits for my case too. The code that you suggested does not, since spark cannot know that the key does not change. I'm sorry if that's not the place for the question/feature request, but it really isn't a user question. > map key_values to values > > > Key: SPARK-12623 > URL: https://issues.apache.org/jira/browse/SPARK-12623 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Elazar Gershuni >Priority: Minor > Labels: easyfix, features, performance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Why doesn't the argument to mapValues() take a key as an agument? > Alternatively, can we have a "mapKeyValuesToValues" that does? > Use case: I want to write a simpler analyzer that takes the argument to > map(), and analyze it to see whether it (trivially) doesn't change the key, > e.g. > g = lambda kv: (kv[0], f(kv[0], kv[1])) > rdd.map(g) > Problem is, if I find that it is the case, I can't call mapValues() with that > function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues > receives only `v` as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082643#comment-15082643 ] somil deshmukh commented on SPARK-12632: I would like to work on this > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12641) Remove unused code related to Hadoop 0.23
[ https://issues.apache.org/jira/browse/SPARK-12641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12641. - Resolution: Fixed Assignee: Kousuke Saruta Fix Version/s: 2.0.0 > Remove unused code related to Hadoop 0.23 > - > > Key: SPARK-12641 > URL: https://issues.apache.org/jira/browse/SPARK-12641 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 2.0.0 > > > Currently we don't support Hadoop 0.23 but there is a few code related to it > so let's clean it up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11806) Spark 2.0 deprecations and removals
[ https://issues.apache.org/jira/browse/SPARK-11806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-11806: Description: This is an umbrella ticket to track things we are deprecating and removing in Spark 2.0. was: This is an umbrella ticket to track things we are deprecating and removing in Spark 2.0. All sub-tasks are currently assigned to Reynold to prevent others from picking up prematurely. > Spark 2.0 deprecations and removals > --- > > Key: SPARK-11806 > URL: https://issues.apache.org/jira/browse/SPARK-11806 > Project: Spark > Issue Type: Umbrella > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: releasenotes > > This is an umbrella ticket to track things we are deprecating and removing in > Spark 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12623) map key_values to values
[ https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082639#comment-15082639 ] Elazar Gershuni edited comment on SPARK-12623 at 1/5/16 8:41 AM: - That does not answer the question/feature request. Mapping values to values can be achieved by similar code to the one you suggested: {code} rdd.map { case (key, value) => (key, myFunctionOf(value)) } {code} Yet Spark does provide {{rdd.mapValues()}}, for performance reasons (retaining the partitioning - avoiding the need to reshuffle when the key does not change). I would like to enjoy similar benefits for my case too. The code that you suggested does not, since spark cannot know that the key does not change. I'm sorry if that's not the place for the question/feature request, but it really isn't a user question. was (Author: elazar): That does not answer the question/feature request. Mapping values to values can be achieved by similar code to the one you suggested: rdd.map { case (key, value) => (key, myFunctionOf(value)) } Yet Spark does provide rdd.mapValues(), for performance reasons (retaining the partitioning - avoiding the need to reshuffle when the key does not change). I would like to enjoy similar benefits for my case too. The code that you suggested does not, since spark cannot know that the key does not change. I'm sorry if that's not the place for the question/feature request, but it really isn't a user question. > map key_values to values > > > Key: SPARK-12623 > URL: https://issues.apache.org/jira/browse/SPARK-12623 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Elazar Gershuni >Priority: Minor > Labels: easyfix, features, performance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Why doesn't the argument to mapValues() take a key as an agument? > Alternatively, can we have a "mapKeyValuesToValues" that does? > Use case: I want to write a simpler analyzer that takes the argument to > map(), and analyze it to see whether it (trivially) doesn't change the key, > e.g. > g = lambda kv: (kv[0], f(kv[0], kv[1])) > rdd.map(g) > Problem is, if I find that it is the case, I can't call mapValues() with that > function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues > receives only `v` as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it
[ https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082653#comment-15082653 ] Adrian Bridgett commented on SPARK-12622: - Ajesh - that'd be a good improvement (I raised the ticket as it's not obvious what the problem is rather than that I really want spaces to work!) I'd worry that someone would then raise a problem about "file:/tmp/f%20oo.jar" failing :-) Jayadevan - I disliked the space when I saw it (sbt assembly of some in house code) but didn't know if it was invalid or not (but made a mental note to ask if we could lose the space). FYI it looks like it's due to name in our sbt being "foo data" so we get "foo data-assembly-1.0.jar". Interestingly, the sbt example also has spaces: http://www.scala-sbt.org/0.13/docs/Howto-Project-Metadata.html > spark-submit fails on executors when jar has a space in it > -- > > Key: SPARK-12622 > URL: https://issues.apache.org/jira/browse/SPARK-12622 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.0 > Environment: Linux, Mesos >Reporter: Adrian Bridgett >Priority: Minor > > spark-submit --class foo "Foo.jar" works > but when using "f oo.jar" it starts to run and then breaks on the executors > as they cannot find the various functions. > Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this > fails immediately. > {noformat} > spark-submit --class Foo /tmp/f\ oo.jar > ... > spark.jars=file:/tmp/f%20oo.jar > 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at > http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769 > 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 > (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: > Foo$$anonfun$46 > {noformat} > SPARK-6568 is related but maybe specific to the Windows environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12401) Add support for enums in postgres
[ https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12401: Assignee: Apache Spark > Add support for enums in postgres > - > > Key: SPARK-12401 > URL: https://issues.apache.org/jira/browse/SPARK-12401 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jaka Jancar >Assignee: Apache Spark >Priority: Minor > > JSON and JSONB types [are now > converted|https://github.com/apache/spark/pull/8948/files] into strings on > the Spark side instead of throwing. It would be great it [enumerated > types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were > treated similarly instead of failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12401) Add support for enums in postgres
[ https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082659#comment-15082659 ] Apache Spark commented on SPARK-12401: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/10596 > Add support for enums in postgres > - > > Key: SPARK-12401 > URL: https://issues.apache.org/jira/browse/SPARK-12401 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jaka Jancar >Priority: Minor > > JSON and JSONB types [are now > converted|https://github.com/apache/spark/pull/8948/files] into strings on > the Spark side instead of throwing. It would be great it [enumerated > types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were > treated similarly instead of failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12401) Add support for enums in postgres
[ https://issues.apache.org/jira/browse/SPARK-12401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12401: Assignee: (was: Apache Spark) > Add support for enums in postgres > - > > Key: SPARK-12401 > URL: https://issues.apache.org/jira/browse/SPARK-12401 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.6.0 >Reporter: Jaka Jancar >Priority: Minor > > JSON and JSONB types [are now > converted|https://github.com/apache/spark/pull/8948/files] into strings on > the Spark side instead of throwing. It would be great it [enumerated > types|http://www.postgresql.org/docs/current/static/datatype-enum.html] were > treated similarly instead of failing. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API
[ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082674#comment-15082674 ] Mario Briggs commented on SPARK-12177: -- implemented here - https://github.com/mariobriggs/spark/commit/2fcbb721b99b48e336ba7ef7c317c279c9483840 > Update KafkaDStreams to new Kafka 0.9 Consumer API > -- > > Key: SPARK-12177 > URL: https://issues.apache.org/jira/browse/SPARK-12177 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Nikita Tarasenko > Labels: consumer, kafka > > Kafka 0.9 already released and it introduce new consumer API that not > compatible with old one. So, I added new consumer api. I made separate > classes in package org.apache.spark.streaming.kafka.v09 with changed API. I > didn't remove old classes for more backward compatibility. User will not need > to change his old spark applications when he uprgade to new Spark version. > Please rewiew my changes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
[ https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082687#comment-15082687 ] Lunen commented on SPARK-12403: --- I've managed to get in contact with the people who develops the Spark ODBC drivers. They told me that they OEM the driver to Databricks and that they don't understand why they would not make the latest driver available. I've also tested a trail version of the developer's latest driver and it works perfectly fine. Asked on Databricks' forumn and sent emails to their sales and info department explaining the situation. Hopefully someone can help. > "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore > > > Key: SPARK-12403 > URL: https://issues.apache.org/jira/browse/SPARK-12403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: ODBC connector query >Reporter: Lunen > > We are unable to query the SPARK tables using the ODBC driver from Simba > Spark(Databricks - "Simba Spark ODBC Driver 1.0") We are able to do a show > databases and show tables, but not any queries. eg. > Working: > Select * from openquery(SPARK,'SHOW DATABASES') > Select * from openquery(SPARK,'SHOW TABLES') > Not working: > Select * from openquery(SPARK,'Select * from lunentest') > The error I get is: > OLE DB provider "MSDASQL" for linked server "SPARK" returned message > "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest". > Msg 7321, Level 16, State 2, Line 2 > An error occurred while preparing the query "Select * from lunentest" for > execution against OLE DB provider "MSDASQL" for linked server "SPARK" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12403) "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore
[ https://issues.apache.org/jira/browse/SPARK-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12403. --- Resolution: Not A Problem OK, but as far as I can tell from this conversation it's an issue with a third-party ODBC driver. > "Simba Spark ODBC Driver 1.0" not working with 1.5.2 anymore > > > Key: SPARK-12403 > URL: https://issues.apache.org/jira/browse/SPARK-12403 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0, 1.5.1, 1.5.2 > Environment: ODBC connector query >Reporter: Lunen > > We are unable to query the SPARK tables using the ODBC driver from Simba > Spark(Databricks - "Simba Spark ODBC Driver 1.0") We are able to do a show > databases and show tables, but not any queries. eg. > Working: > Select * from openquery(SPARK,'SHOW DATABASES') > Select * from openquery(SPARK,'SHOW TABLES') > Not working: > Select * from openquery(SPARK,'Select * from lunentest') > The error I get is: > OLE DB provider "MSDASQL" for linked server "SPARK" returned message > "[Simba][SQLEngine] (31740) Table or view not found: spark..lunentest". > Msg 7321, Level 16, State 2, Line 2 > An error occurred while preparing the query "Select * from lunentest" for > execution against OLE DB provider "MSDASQL" for linked server "SPARK" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12645) SparkR add function hash
Yanbo Liang created SPARK-12645: --- Summary: SparkR add function hash Key: SPARK-12645 URL: https://issues.apache.org/jira/browse/SPARK-12645 Project: Spark Issue Type: Improvement Components: SparkR Reporter: Yanbo Liang SparkR add function hash for DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12645) SparkR add function hash
[ https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12645: Summary: SparkR add function hash (was: SparkR add function hash) > SparkR add function hash > - > > Key: SPARK-12645 > URL: https://issues.apache.org/jira/browse/SPARK-12645 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > SparkR add function hash for DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12645) SparkR support hash function
[ https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12645: Summary: SparkR support hash function (was: SparkR add function hash ) > SparkR support hash function > - > > Key: SPARK-12645 > URL: https://issues.apache.org/jira/browse/SPARK-12645 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > SparkR add function hash for DataFrame -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12645) SparkR support hash function
[ https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-12645: Description: Add hash function for SparkR (was: SparkR add function hash for DataFrame) > SparkR support hash function > - > > Key: SPARK-12645 > URL: https://issues.apache.org/jira/browse/SPARK-12645 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > Add hash function for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12645) SparkR support hash function
[ https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082732#comment-15082732 ] Apache Spark commented on SPARK-12645: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/10597 > SparkR support hash function > - > > Key: SPARK-12645 > URL: https://issues.apache.org/jira/browse/SPARK-12645 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > Add hash function for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12645) SparkR support hash function
[ https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12645: Assignee: (was: Apache Spark) > SparkR support hash function > - > > Key: SPARK-12645 > URL: https://issues.apache.org/jira/browse/SPARK-12645 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang > > Add hash function for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12645) SparkR support hash function
[ https://issues.apache.org/jira/browse/SPARK-12645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12645: Assignee: Apache Spark > SparkR support hash function > - > > Key: SPARK-12645 > URL: https://issues.apache.org/jira/browse/SPARK-12645 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Yanbo Liang >Assignee: Apache Spark > > Add hash function for SparkR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12623) map key_values to values
[ https://issues.apache.org/jira/browse/SPARK-12623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082739#comment-15082739 ] Sean Owen commented on SPARK-12623: --- There is a {{preservesPartitioning}} flag on some API methods that lets you specify that your function of {{(key, value)}} pairs won't change keys, or at least won't change the partitioning. Unfortunately, for historical reasons this wasn't exposed on the {{map()}} function, but was exposed on {{mapPartitions}}. It's a little clunky to invoke if you only need map, but not much -- you get an iterator that you then map as before. That would at least let you do what you're trying to do. As to exposing a specialized method for this, yeah it's not crazy or anything but I doubt it would be viewed as worth it when there's a fairly direct way to do what you want. (Or else, I'd say argue for a new param to map, but that has its own obscure issues.) > map key_values to values > > > Key: SPARK-12623 > URL: https://issues.apache.org/jira/browse/SPARK-12623 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Elazar Gershuni >Priority: Minor > Labels: easyfix, features, performance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > Why doesn't the argument to mapValues() take a key as an agument? > Alternatively, can we have a "mapKeyValuesToValues" that does? > Use case: I want to write a simpler analyzer that takes the argument to > map(), and analyze it to see whether it (trivially) doesn't change the key, > e.g. > g = lambda kv: (kv[0], f(kv[0], kv[1])) > rdd.map(g) > Problem is, if I find that it is the case, I can't call mapValues() with that > function, as in `rdd.mapValues(lambda kv: g(kv)[1])`, since mapValues > receives only `v` as an argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification
[ https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082764#comment-15082764 ] Vijay Kiran commented on SPARK-12630: - I've made the changes, after I run the tests, I'll open a PR. > Make Parameter Descriptions Consistent for PySpark MLlib Classification > --- > > Key: SPARK-12630 > URL: https://issues.apache.org/jira/browse/SPARK-12630 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > classification.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12646) Support _HOST in kerberos principal for connecting to secure cluster
Hari Krishna Dara created SPARK-12646: - Summary: Support _HOST in kerberos principal for connecting to secure cluster Key: SPARK-12646 URL: https://issues.apache.org/jira/browse/SPARK-12646 Project: Spark Issue Type: Improvement Components: YARN Reporter: Hari Krishna Dara Priority: Minor Hadoop supports _HOST as a token that is dynamically replaced with the actual hostname at the time the kerberos authentication is done. This is supported in many hadoop stacks including YARN. When configuring Spark to connect to secure cluster (e.g., yarn-cluster or yarn-client as master), it would be natural to extend support for this token to Spark as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification
[ https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12630: Assignee: (was: Apache Spark) > Make Parameter Descriptions Consistent for PySpark MLlib Classification > --- > > Key: SPARK-12630 > URL: https://issues.apache.org/jira/browse/SPARK-12630 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > classification.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification
[ https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082821#comment-15082821 ] Apache Spark commented on SPARK-12630: -- User 'vijaykiran' has created a pull request for this issue: https://github.com/apache/spark/pull/10598 > Make Parameter Descriptions Consistent for PySpark MLlib Classification > --- > > Key: SPARK-12630 > URL: https://issues.apache.org/jira/browse/SPARK-12630 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > classification.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification
[ https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12630: Assignee: Apache Spark > Make Parameter Descriptions Consistent for PySpark MLlib Classification > --- > > Key: SPARK-12630 > URL: https://issues.apache.org/jira/browse/SPARK-12630 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > classification.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it
[ https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082829#comment-15082829 ] Sean Owen commented on SPARK-12622: --- I don't see details of the actual problem here. Everything so far looks correct. {{file:/tmp/f%20oo.jar}} is a valid URI for the file, so that can't be rejected. What breaks? > spark-submit fails on executors when jar has a space in it > -- > > Key: SPARK-12622 > URL: https://issues.apache.org/jira/browse/SPARK-12622 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.0 > Environment: Linux, Mesos >Reporter: Adrian Bridgett >Priority: Minor > > spark-submit --class foo "Foo.jar" works > but when using "f oo.jar" it starts to run and then breaks on the executors > as they cannot find the various functions. > Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this > fails immediately. > {noformat} > spark-submit --class Foo /tmp/f\ oo.jar > ... > spark.jars=file:/tmp/f%20oo.jar > 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at > http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769 > 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 > (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: > Foo$$anonfun$46 > {noformat} > SPARK-6568 is related but maybe specific to the Windows environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
Pete Robbins created SPARK-12647: Summary: 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator Key: SPARK-12647 URL: https://issues.apache.org/jira/browse/SPARK-12647 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Pete Robbins Priority: Minor All 1.6 branch builds failing eg https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ 3 did not equal 2 PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082846#comment-15082846 ] Sean Owen commented on SPARK-12647: --- [~robbinspg] rather than make a new JIRA, you should reopen your existing one and provide another PR. The additional change must logically go with your original one. > 1.6 branch test failure > o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of > reducers: aggregate operator > --- > > Key: SPARK-12647 > URL: https://issues.apache.org/jira/browse/SPARK-12647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Pete Robbins >Priority: Minor > > All 1.6 branch builds failing eg > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ > 3 did not equal 2 > PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082847#comment-15082847 ] Apache Spark commented on SPARK-12647: -- User 'robbinspg' has created a pull request for this issue: https://github.com/apache/spark/pull/10599 > 1.6 branch test failure > o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of > reducers: aggregate operator > --- > > Key: SPARK-12647 > URL: https://issues.apache.org/jira/browse/SPARK-12647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Pete Robbins >Priority: Minor > > All 1.6 branch builds failing eg > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ > 3 did not equal 2 > PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12647: Assignee: (was: Apache Spark) > 1.6 branch test failure > o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of > reducers: aggregate operator > --- > > Key: SPARK-12647 > URL: https://issues.apache.org/jira/browse/SPARK-12647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Pete Robbins >Priority: Minor > > All 1.6 branch builds failing eg > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ > 3 did not equal 2 > PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12647: Assignee: Apache Spark > 1.6 branch test failure > o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of > reducers: aggregate operator > --- > > Key: SPARK-12647 > URL: https://issues.apache.org/jira/browse/SPARK-12647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Pete Robbins >Assignee: Apache Spark >Priority: Minor > > All 1.6 branch builds failing eg > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ > 3 did not equal 2 > PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12633: Assignee: Apache Spark > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082853#comment-15082853 ] Apache Spark commented on SPARK-12633: -- User 'vijaykiran' has created a pull request for this issue: https://github.com/apache/spark/pull/10600 > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082854#comment-15082854 ] Vijay Kiran commented on SPARK-12633: - Opened a PR https://github.com/apache/spark/pull/10600 > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12633: Assignee: (was: Apache Spark) > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vijay Kiran updated SPARK-12633: Comment: was deleted (was: Opened a PR https://github.com/apache/spark/pull/10600) > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it
[ https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082855#comment-15082855 ] Adrian Bridgett commented on SPARK-12622: - The job fails with the ClassNotFound exception, if I rename the jar file and resubmit it all works. > spark-submit fails on executors when jar has a space in it > -- > > Key: SPARK-12622 > URL: https://issues.apache.org/jira/browse/SPARK-12622 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.0 > Environment: Linux, Mesos >Reporter: Adrian Bridgett >Priority: Minor > > spark-submit --class foo "Foo.jar" works > but when using "f oo.jar" it starts to run and then breaks on the executors > as they cannot find the various functions. > Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this > fails immediately. > {noformat} > spark-submit --class Foo /tmp/f\ oo.jar > ... > spark.jars=file:/tmp/f%20oo.jar > 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at > http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769 > 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 > (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: > Foo$$anonfun$46 > {noformat} > SPARK-6568 is related but maybe specific to the Windows environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082859#comment-15082859 ] Kazuaki Ishizaki commented on SPARK-3785: - To use CUDA is an intermediate approach to evaluate idea A. A future version will drive GPU code from a Spark program without writing CUDA code by hand. The version may generate GPU binary thru CUDA or OpenCL by using it as a backend in a compiler. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it
[ https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082862#comment-15082862 ] Sean Owen commented on SPARK-12622: --- Oh I see it. Ultimately I assume it's because the JAR isn't found locally, though the question is why. This looks suspicious: {{Added JAR file:/tmpf%20oo.jar at http://10.1.201.77:43888/jars/f%oo.jar}} The second http URL can't be right. I don't have any more ideas but that looks like somewhere to start looking. > spark-submit fails on executors when jar has a space in it > -- > > Key: SPARK-12622 > URL: https://issues.apache.org/jira/browse/SPARK-12622 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 1.6.0 > Environment: Linux, Mesos >Reporter: Adrian Bridgett >Priority: Minor > > spark-submit --class foo "Foo.jar" works > but when using "f oo.jar" it starts to run and then breaks on the executors > as they cannot find the various functions. > Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this > fails immediately. > {noformat} > spark-submit --class Foo /tmp/f\ oo.jar > ... > spark.jars=file:/tmp/f%20oo.jar > 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at > http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769 > 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 > (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: > Foo$$anonfun$46 > {noformat} > SPARK-6568 is related but maybe specific to the Windows environment -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree
[ https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082872#comment-15082872 ] Vijay Kiran commented on SPARK-12634: - I'm editing tree.py. > Make Parameter Descriptions Consistent for PySpark MLlib Tree > - > > Key: SPARK-12634 > URL: https://issues.apache.org/jira/browse/SPARK-12634 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up tree.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it
[ https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082875#comment-15082875 ] Adrian Bridgett commented on SPARK-12622: - Damn - sorry, that's my obfuscation error, so sorry about that - it :-( It should read: {noformat} Added JAR file:/tmp/f%20oo.jar at http://10.1.201.77:35016/jars/f%20oo.jar with timestamp 1451917055779 {noformat} Let me also post the full stack trace: {noformat} [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - 16/01/04 14:23:00 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 0.0 (TID 20, ip-10-1-200-159.ec2.internal): java.lang.ClassNotFoundException: ProcessFoo$$anonfun$46 [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.net.URLClassLoader.findClass(URLClassLoader.java:381) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:424) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.lang.ClassLoader.loadClass(ClassLoader.java:357) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.lang.Class.forName0(Native Method) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.lang.Class.forName(Class.java:348) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) [2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at scala.collection.immutable.$colon$colon.readObject(List.scala:362) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at java.lang.reflect.Method.invoke(Method.java:497) [2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) [2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900) [2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801) [2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351) [2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000) [2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924) [2016-01-04 14:23:00,055]
[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082879#comment-15082879 ] Pete Robbins commented on SPARK-12647: -- @sowen should I close this and move the PR? > 1.6 branch test failure > o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of > reducers: aggregate operator > --- > > Key: SPARK-12647 > URL: https://issues.apache.org/jira/browse/SPARK-12647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Pete Robbins >Priority: Minor > > All 1.6 branch builds failing eg > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ > 3 did not equal 2 > PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082879#comment-15082879 ] Pete Robbins edited comment on SPARK-12647 at 1/5/16 11:30 AM: --- [~sowen] should I close this and move the PR? was (Author: robbinspg): @sowen should I close this and move the PR? > 1.6 branch test failure > o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of > reducers: aggregate operator > --- > > Key: SPARK-12647 > URL: https://issues.apache.org/jira/browse/SPARK-12647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Pete Robbins >Priority: Minor > > All 1.6 branch builds failing eg > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ > 3 did not equal 2 > PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU
[ https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082881#comment-15082881 ] Kazuaki Ishizaki commented on SPARK-3785: - # You can specify cpu-cores by using conventional Spark options like "--executor-cores". # Do you want to execute an operation for a matrix represented by a RDD? The current version has possible two GPU memory limitations #* Since it copies the whole data in a partition of RDD between CPU and GPU, a GPU kernel for a task cannot exceed the capacity of the GPU memory #* Since tasks are concurrently executed, a sum of the required GPU memories by tasks at a time cannot exceed the capacity of the GPU memory. Comment 2 is a very good question. To exploit GPU in Spark, it is necessary to devise better approaches. > Support off-loading computations to a GPU > - > > Key: SPARK-3785 > URL: https://issues.apache.org/jira/browse/SPARK-3785 > Project: Spark > Issue Type: Brainstorming > Components: MLlib >Reporter: Thomas Darimont >Priority: Minor > > Are there any plans to adding support for off-loading computations to the > GPU, e.g. via an open-cl binding? > http://www.jocl.org/ > https://code.google.com/p/javacl/ > http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator
[ https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082889#comment-15082889 ] Sean Owen commented on SPARK-12647: --- *shrug* at this point probably doesn't matter; mostly for next time here. The concern is just that someone finds your fix to the first JIRA but not the fix to the fix. I linked them here at least. > 1.6 branch test failure > o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of > reducers: aggregate operator > --- > > Key: SPARK-12647 > URL: https://issues.apache.org/jira/browse/SPARK-12647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Pete Robbins >Priority: Minor > > All 1.6 branch builds failing eg > https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/ > 3 did not equal 2 > PR for SPARK-12470 causes change in partition size so test needs updating -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree
[ https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082890#comment-15082890 ] Apache Spark commented on SPARK-12634: -- User 'vijaykiran' has created a pull request for this issue: https://github.com/apache/spark/pull/10601 > Make Parameter Descriptions Consistent for PySpark MLlib Tree > - > > Key: SPARK-12634 > URL: https://issues.apache.org/jira/browse/SPARK-12634 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up tree.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree
[ https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12634: Assignee: (was: Apache Spark) > Make Parameter Descriptions Consistent for PySpark MLlib Tree > - > > Key: SPARK-12634 > URL: https://issues.apache.org/jira/browse/SPARK-12634 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up tree.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree
[ https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12634: Assignee: Apache Spark > Make Parameter Descriptions Consistent for PySpark MLlib Tree > - > > Key: SPARK-12634 > URL: https://issues.apache.org/jira/browse/SPARK-12634 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up tree.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12331) R^2 for regression through the origin
[ https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12331. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10384 [https://github.com/apache/spark/pull/10384] > R^2 for regression through the origin > - > > Key: SPARK-12331 > URL: https://issues.apache.org/jira/browse/SPARK-12331 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Imran Younus >Priority: Minor > Fix For: 2.0.0 > > > The value of R^2 (coefficient of determination) obtained from > LinearRegressionModel is not consistent with R and statsmodels when the > fitIntercept is false i.e., regression through the origin. In this case, both > R and statsmodels use the definition of R^2 given by eq(4') in the following > review paper: > https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf > Here is the definition from this paper: > R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2) > The paper also describes why this should be the case. I've double checked > that the value of R^2 from statsmodels and R are consistent with this > definition. On the other hand, scikit-learn doesn't use the above definition. > I would recommend using the above definition in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12331) R^2 for regression through the origin
[ https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12331: -- Assignee: Imran Younus > R^2 for regression through the origin > - > > Key: SPARK-12331 > URL: https://issues.apache.org/jira/browse/SPARK-12331 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Imran Younus >Assignee: Imran Younus >Priority: Minor > Fix For: 2.0.0 > > > The value of R^2 (coefficient of determination) obtained from > LinearRegressionModel is not consistent with R and statsmodels when the > fitIntercept is false i.e., regression through the origin. In this case, both > R and statsmodels use the definition of R^2 given by eq(4') in the following > review paper: > https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf > Here is the definition from this paper: > R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2) > The paper also describes why this should be the case. I've double checked > that the value of R^2 from statsmodels and R are consistent with this > definition. On the other hand, scikit-learn doesn't use the above definition. > I would recommend using the above definition in Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1061) allow Hadoop RDDs to be read w/ a partitioner
[ https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1061. -- Resolution: Won't Fix > allow Hadoop RDDs to be read w/ a partitioner > - > > Key: SPARK-1061 > URL: https://issues.apache.org/jira/browse/SPARK-1061 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Imran Rashid >Assignee: Imran Rashid > > Using partitioners to get narrow dependencies can save tons of time on a > shuffle. However, after saving an RDD to hdfs, and then reloading it, all > partitioner information is lost. This means that you can never get a narrow > dependency when loading data from hadoop. > I think we could get around this by: > 1) having a modified version of hadoop rdd that kept track of original part > file (or maybe just prevent splits altogether ...) > 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function > to RDD. It would create a new RDD, which had the exact same data but just > pretended that the RDD had the given partitioner applied to it. And if > verify=true, it could add a mapPartitionsWithIndex to check that each record > was in the right partition. > http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception
[ https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082921#comment-15082921 ] Tristan Reid commented on SPARK-12095: -- The SQL syntax doesn't appear to work at all. `select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl` Is that the case? > Window function rowsBetween throws exception > > > Key: SPARK-12095 > URL: https://issues.apache.org/jira/browse/SPARK-12095 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 >Reporter: Irakli Machabeli > > From pyspark : > windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', > 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT') > Py4JError: An error occurred while calling o1107.rowsBetween. Trace: > py4j.Py4JException: Method rowsBetween([class java.lang.String, class > java.lang.Long]) does not exist > from SQL query parser fails immediately: > Py4JJavaError: An error occurred while calling o18.sql. > : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found > select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl >^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082943#comment-15082943 ] Vijay Kiran commented on SPARK-12632: - [~somi...@us.ibm.com] Did you start working on this already ? I opened PRs for other three, and made changes to these files as well. > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082943#comment-15082943 ] Vijay Kiran edited comment on SPARK-12632 at 1/5/16 12:20 PM: -- [~somi...@us.ibm.com] Did you start working on this already ? I opened PRs for other three, and made changes to these files as well. WIP: https://github.com/vijaykiran/spark/commit/f7c6c49638710cc62d36dbf3b306abed0983b30f was (Author: vijaykiran): [~somi...@us.ibm.com] Did you start working on this already ? I opened PRs for other three, and made changes to these files as well. > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082964#comment-15082964 ] Apache Spark commented on SPARK-12632: -- User 'somideshmukh' has created a pull request for this issue: https://github.com/apache/spark/pull/10602 > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12632: Assignee: Apache Spark > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12632: Assignee: (was: Apache Spark) > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15082973#comment-15082973 ] Vijay Kiran commented on SPARK-12632: - [~somi...@us.ibm.com] I added a couple of comments, I guess `recommendation.py` needs to be fixed as well, but I think [~bryanc] will have more to say on this :) > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7831) Mesos dispatcher doesn't deregister as a framework from Mesos when stopped
[ https://issues.apache.org/jira/browse/SPARK-7831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083142#comment-15083142 ] Nilanjan Raychaudhuri commented on SPARK-7831: -- I am working on a possible fix for this. I will submit a pull request soon > Mesos dispatcher doesn't deregister as a framework from Mesos when stopped > -- > > Key: SPARK-7831 > URL: https://issues.apache.org/jira/browse/SPARK-7831 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.4.0 > Environment: Spark 1.4.0-rc1, Mesos 0.2.2 (compiled from source) >Reporter: Luc Bourlier > > To run Spark on Mesos in cluster mode, a Spark Mesos dispatcher has to be > running. > It is launched using {{sbin/start-mesos-dispatcher.sh}}. The Mesos dispatcher > registers as a framework in the Mesos cluster. > After using {{sbin/stop-mesos-dispatcher.sh}} to stop the dispatcher, the > application is correctly terminated locally, but the framework is still > listed as {{active}} in the Mesos dashboard. > I would expect the framework to be de-registered when the dispatcher is > stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12648) UDF with Option[Double] throws ClassCastException
Mikael Valot created SPARK-12648: Summary: UDF with Option[Double] throws ClassCastException Key: SPARK-12648 URL: https://issues.apache.org/jira/browse/SPARK-12648 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Mikael Valot I can write an UDF that returns an Option[Double], and the DataFrame's schema is correctly inferred to be a nullable double. However I cannot seem to be able to write a UDF that takes an Option as an argument: import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkContext, SparkConf} val conf = new SparkConf().setMaster("local[4]").setAppName("test") val sc = new SparkContext(conf) val sqlc = new SQLContext(sc) import sqlc.implicits._ val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", "weight") import org.apache.spark.sql.functions._ val addTwo = udf((d: Option[Double]) => d.map(_+2)) df.withColumn("plusTwo", addTwo(df("weight"))).show() => 2016-01-05T14:41:52 Executor task launch worker-0 ERROR org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1) java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) ~[na:na] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) ~[na:na] at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51) ~[spark-sql_2.10-1.6.0.jar:1.6.0] at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49) ~[spark-sql_2.10-1.6.0.jar:1.6.0] at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) ~[scala-library-2.10.5.jar:na] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12649) support reading bucketed table
Wenchen Fan created SPARK-12649: --- Summary: support reading bucketed table Key: SPARK-12649 URL: https://issues.apache.org/jira/browse/SPARK-12649 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12649) support reading bucketed table
[ https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12649: Assignee: (was: Apache Spark) > support reading bucketed table > -- > > Key: SPARK-12649 > URL: https://issues.apache.org/jira/browse/SPARK-12649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12649) support reading bucketed table
[ https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083208#comment-15083208 ] Apache Spark commented on SPARK-12649: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10604 > support reading bucketed table > -- > > Key: SPARK-12649 > URL: https://issues.apache.org/jira/browse/SPARK-12649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12649) support reading bucketed table
[ https://issues.apache.org/jira/browse/SPARK-12649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12649: Assignee: Apache Spark > support reading bucketed table > -- > > Key: SPARK-12649 > URL: https://issues.apache.org/jira/browse/SPARK-12649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
[ https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083258#comment-15083258 ] Anson Abraham commented on SPARK-11227: --- I am having this issue as well, in my environment. But i'm not running mesos or yarn. it only occurs w/ spark-submit. It works with spark 1.4.x, but 1.5.x > i get the same error, when my cluster is in HA mode (but non-yarn or mesos). I double checked configs and it is correct. Any help would be appreciated here. > Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1 > > > Key: SPARK-11227 > URL: https://issues.apache.org/jira/browse/SPARK-11227 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1 > Environment: OS: CentOS 6.6 > Memory: 28G > CPU: 8 > Mesos: 0.22.0 > HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager) >Reporter: Yuri Saito > > When running jar including Spark Job at HDFS HA Cluster, Mesos and > Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: > nameservice1" and fail. > I do below in Terminal. > {code} > /opt/spark/bin/spark-submit \ > --class com.example.Job /jobs/job-assembly-1.0.0.jar > {code} > So, job throw below message. > {code} > 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 > (TID 0, spark003.example.com): java.lang.IllegalArgumentException: > java.net.UnknownHostException: nameservice1 > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) > at > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at scala.Option.map(Option.scala:145) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:
[jira] [Created] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
John Vines created SPARK-12650: -- Summary: No means to specify Xmx settings for SparkSubmit in yarn-cluster mode Key: SPARK-12650 URL: https://issues.apache.org/jira/browse/SPARK-12650 Project: Spark Issue Type: Bug Affects Versions: 1.5.2 Environment: Hadoop 2.6.0 Reporter: John Vines Background- I have an app master designed to do some work and then launch a spark job. Issue- If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, leading to the jvm taking a default heap which is relatively large. This causes a large amount of vmem to be taken, so that it is killed by yarn. This can be worked around by disabling Yarn's vmem check, but that is a hack. If I run it in yarn-client mode, it's fine as long as my container has enough space for the driver, which is manageable. But I feel that the utter lack of Xmx settings for what I believe is a very small jvm is a problem. I believe this was introduced with the fix for SPARK-3884 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083311#comment-15083311 ] Daniel Darabos commented on SPARK-11293: I have a somewhat contrived example that still leaks in 1.6.0. I started {{spark-shell --master 'local-cluster[2,2,1024]'}} and ran: {code} sc.parallelize(0 to 1000, 2).map(x => x % 1 -> x).groupByKey.asInstanceOf[org.apache.spark.rdd.ShuffledRDD[Int, Int, Iterable[Int]]].setKeyOrdering(implicitly[Ordering[Int]]).mapPartitions { it => it.take(1) }.collect {code} I've added extra logging around task memory acquisition so I would be able to see what is not released. These are the logs: {code} 16/01/05 17:02:45 INFO Executor: Running task 0.0 in stage 13.0 (TID 24) 16/01/05 17:02:45 INFO MapOutputTrackerWorker: Updating epoch to 7 and clearing cache 16/01/05 17:02:45 INFO TorrentBroadcast: Started reading broadcast variable 13 16/01/05 17:02:45 INFO MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 2.3 KB, free 7.6 KB) 16/01/05 17:02:45 INFO TorrentBroadcast: Reading broadcast variable 13 took 6 ms 16/01/05 17:02:45 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 4.5 KB, free 12.1 KB) 16/01/05 17:02:45 INFO MapOutputTrackerWorker: Don't have map outputs for shuffle 6, fetching them 16/01/05 17:02:45 INFO MapOutputTrackerWorker: Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@192.168.0.32:55147) 16/01/05 17:02:45 INFO MapOutputTrackerWorker: Got the output locations 16/01/05 17:02:45 INFO ShuffleBlockFetcherIterator: Getting 2 non-empty blocks out of 2 blocks 16/01/05 17:02:45 INFO ShuffleBlockFetcherIterator: Started 1 remote fetches in 1 ms 16/01/05 17:02:45 ERROR TaskMemoryManager: Task 24 acquire 5.0 MB for null 16/01/05 17:02:45 ERROR TaskMemoryManager: Stack trace: java.lang.Exception: here at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:187) at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:82) at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/01/05 17:02:47 ERROR TaskMemoryManager: Task 24 acquire 15.0 MB for null 16/01/05 17:02:47 ERROR TaskMemoryManager: Stack trace: java.lang.Exception: here at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:187) at org.apache.spark.util.collection.Spillable$class.maybeSpill(Spillable.scala:82) at org.apache.spark.util.collection.ExternalAppendOnlyMap.maybeSpill(ExternalAppendOnlyMap.scala:55) at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:158) at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:45) at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:89) at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.conc
[jira] [Commented] (SPARK-11293) Spillable collections leak shuffle memory
[ https://issues.apache.org/jira/browse/SPARK-11293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083326#comment-15083326 ] Daniel Darabos commented on SPARK-11293: Sorry, my example was overly complicated. This one triggers the same leak. {code} sc.parallelize(0 to 1000, 2).map(x => x % 1 -> x).groupByKey.mapPartitions { it => it.take(1) }.collect {code} > Spillable collections leak shuffle memory > - > > Key: SPARK-11293 > URL: https://issues.apache.org/jira/browse/SPARK-11293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.1, 1.4.1, 1.5.1 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Critical > Fix For: 1.6.0 > > > I discovered multiple leaks of shuffle memory while working on my memory > manager consolidation patch, which added the ability to do strict memory leak > detection for the bookkeeping that used to be performed by the > ShuffleMemoryManager. This uncovered a handful of places where tasks can > acquire execution/shuffle memory but never release it, starving themselves of > memory. > Problems that I found: > * {{ExternalSorter.stop()}} should release the sorter's shuffle/execution > memory. > * BlockStoreShuffleReader should call {{ExternalSorter.stop()}} using a > {{CompletionIterator}}. > * {{ExternalAppendOnlyMap}} exposes no equivalent of {{stop()}} for freeing > its resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12648) UDF with Option[Double] throws ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083397#comment-15083397 ] kevin yu commented on SPARK-12648: -- I can recreate the problem, I will look into this issue. Thanks. Kevin > UDF with Option[Double] throws ClassCastException > - > > Key: SPARK-12648 > URL: https://issues.apache.org/jira/browse/SPARK-12648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Mikael Valot > > I can write an UDF that returns an Option[Double], and the DataFrame's > schema is correctly inferred to be a nullable double. > However I cannot seem to be able to write a UDF that takes an Option as an > argument: > import org.apache.spark.sql.SQLContext > import org.apache.spark.{SparkContext, SparkConf} > val conf = new SparkConf().setMaster("local[4]").setAppName("test") > val sc = new SparkContext(conf) > val sqlc = new SQLContext(sc) > import sqlc.implicits._ > val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", > "weight") > import org.apache.spark.sql.functions._ > val addTwo = udf((d: Option[Double]) => d.map(_+2)) > df.withColumn("plusTwo", addTwo(df("weight"))).show() > => > 2016-01-05T14:41:52 Executor task launch worker-0 ERROR > org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option > at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) > ~[na:na] > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) ~[na:na] > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51) > ~[spark-sql_2.10-1.6.0.jar:1.6.0] > at > org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49) > ~[spark-sql_2.10-1.6.0.jar:1.6.0] > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > ~[scala-library-2.10.5.jar:na] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12431) add local checkpointing to GraphX
[ https://issues.apache.org/jira/browse/SPARK-12431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083419#comment-15083419 ] David Youd commented on SPARK-12431: Since localCheckpoint() was partially implemented, but in a way that doesn’t work, should this be changed from “Improvement” to “Bug”? > add local checkpointing to GraphX > - > > Key: SPARK-12431 > URL: https://issues.apache.org/jira/browse/SPARK-12431 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.5.2 >Reporter: Edward Seidl > > local checkpointing was added to RDD to speed up iterative spark jobs, but > this capability hasn't been added to GraphX. Adding localCheckpoint to > GraphImpl, EdgeRDDImpl, and VertexRDDImpl greatly improved the speed of a > k-core algorithm I'm using (at the cost of fault tolerance, of course). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12651) mllib deprecation messages mention non-existent version 1.7.0
Marcelo Vanzin created SPARK-12651: -- Summary: mllib deprecation messages mention non-existent version 1.7.0 Key: SPARK-12651 URL: https://issues.apache.org/jira/browse/SPARK-12651 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.0 Reporter: Marcelo Vanzin Priority: Trivial Might be a problem in 1.6 also? {code} @Since("1.4.0") @deprecated("Support for runs is deprecated. This param will have no effect in 1.7.0.", "1.6.0") def getRuns: Int = runs {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12651) mllib deprecation messages mention non-existent version 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-12651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083474#comment-15083474 ] Sean Owen commented on SPARK-12651: --- I've got this covered in SPARK-12618 / https://github.com/apache/spark/pull/10570 already > mllib deprecation messages mention non-existent version 1.7.0 > - > > Key: SPARK-12651 > URL: https://issues.apache.org/jira/browse/SPARK-12651 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Trivial > > Might be a problem in 1.6 also? > {code} > @Since("1.4.0") > @deprecated("Support for runs is deprecated. This param will have no effect > in 1.7.0.", "1.6.0") > def getRuns: Int = runs > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12426) Docker JDBC integration tests are failing again
[ https://issues.apache.org/jira/browse/SPARK-12426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083482#comment-15083482 ] Mark Grover commented on SPARK-12426: - Sean and Josh, I got to the bottom of this. This is because docker sucks when bubbling up the error that docker engine is not running on the machine running the unit tests. The instructions for installing docker engine on various OSs are at https://docs.docker.com/engine/installation/ Once installed the docker service needs to be started, if it's not already running. On Linux, this is simply {{sudo service docker start}} and then our docker integration tests pass. Sorry that I didn't get a chance to look into it around 1.6 rc time, holidays got in the way. I am thinking of adding this info on [this wiki page|https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IntelliJ]. Please let me know if you think there is a better place, that's the best I could find. I don't seem to have access to edit that page, can one of you please give me access? Also, I was trying to search in the code for any puppet recipes we maintain for the setting up build slaves. In order, if our Jenkins infra were wiped out, how do we make sure docker-engine is installed and running? How do we maintain keep track of build dependencies? Thanks in advance! > Docker JDBC integration tests are failing again > --- > > Key: SPARK-12426 > URL: https://issues.apache.org/jira/browse/SPARK-12426 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 1.6.0 >Reporter: Mark Grover > > The Docker JDBC integration tests were fixed in SPARK-11796 but they seem to > be failing again on my machine (Ubuntu Precise). This was the same box that I > tested my previous commit on. Also, I am not confident this failure has much > to do with Spark, since a well known commit where the tests were passing, > fails now, in the same environment. > [~sowen] mentioned on the Spark 1.6 voting thread that the tests were failing > on his Ubuntu 15 box as well. > Here's the error, fyi: > {code} > 15/12/18 10:12:50 INFO SparkContext: Successfully stopped SparkContext > 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Shutting > down remote daemon. > 15/12/18 10:12:50 INFO RemoteActorRefProvider$RemotingTerminator: Remote > daemon shut down; proceeding with flushing remote transports. > *** RUN ABORTED *** > com.spotify.docker.client.DockerException: > java.util.concurrent.ExecutionException: > com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: > java.io.IOException: No such file or directory > at > com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:1141) > at > com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1082) > at > com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281) > at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76) > at > org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) > at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58) > at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) > at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58) > at org.scalatest.Suite$class.callExecuteOnSuite$1(Suite.scala:1492) > at org.scalatest.Suite$$anonfun$runNestedSuites$1.apply(Suite.scala:1528) > ... > Cause: java.util.concurrent.ExecutionException: > com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: > java.io.IOException: No such file or directory > at > jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299) > at > jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286) > at > jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) > at > com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:1080) > at > com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:281) > at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:76) > at > org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187) > at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:58) > at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253) > at > org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.run(DockerJDBCIntegrationSuite.scala:58) > ... > Cause: com.sp
[jira] [Commented] (SPARK-12609) Make R to JVM timeout configurable
[ https://issues.apache.org/jira/browse/SPARK-12609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083506#comment-15083506 ] Felix Cheung commented on SPARK-12609: -- It looks like the timeout to socketConnection is merely the time from establishment - we could hardcode it to something arbitrary long (eg. 1 day) We should have some way to check if "is the JVM backend alive" - for that we could have some kind of keep-alive ping for each request? > Make R to JVM timeout configurable > --- > > Key: SPARK-12609 > URL: https://issues.apache.org/jira/browse/SPARK-12609 > Project: Spark > Issue Type: Improvement > Components: SparkR >Reporter: Shivaram Venkataraman > > The timeout from R to the JVM is hardcoded at 6000 seconds in > https://github.com/apache/spark/blob/6c5bbd628aaedb6efb44c15f816fea8fb600decc/R/pkg/R/client.R#L22 > This results in Spark jobs that take more than 100 minutes to always fail. We > should make this timeout configurable through SparkConf. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12438) Add SQLUserDefinedType support for encoder
[ https://issues.apache.org/jira/browse/SPARK-12438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-12438. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10390 [https://github.com/apache/spark/pull/10390] > Add SQLUserDefinedType support for encoder > -- > > Key: SPARK-12438 > URL: https://issues.apache.org/jira/browse/SPARK-12438 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh > Fix For: 2.0.0 > > > We should add SQLUserDefinedType support for encoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12643) Set lib directory for antlr
[ https://issues.apache.org/jira/browse/SPARK-12643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-12643. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.0.0 > Set lib directory for antlr > --- > > Key: SPARK-12643 > URL: https://issues.apache.org/jira/browse/SPARK-12643 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Minor > Fix For: 2.0.0 > > > Without setting lib directory for antlr, the updates of imported grammar > files can not be detected. So SparkSqlParser.g will not be rebuilt > automatically. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12652) Upgrade py4j to the incoming version 0.9.1
Shixiong Zhu created SPARK-12652: Summary: Upgrade py4j to the incoming version 0.9.1 Key: SPARK-12652 URL: https://issues.apache.org/jira/browse/SPARK-12652 Project: Spark Issue Type: Bug Components: PySpark Reporter: Shixiong Zhu Upgrade py4j when py4j 0.9.1 is out. Mostly because it fixes two critical issues: SPARK-12511 and SPARK-12617 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12521) DataFrame Partitions in java does not work
[ https://issues.apache.org/jira/browse/SPARK-12521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083556#comment-15083556 ] Apache Spark commented on SPARK-12521: -- User 'xguo27' has created a pull request for this issue: https://github.com/apache/spark/pull/10473 > DataFrame Partitions in java does not work > -- > > Key: SPARK-12521 > URL: https://issues.apache.org/jira/browse/SPARK-12521 > Project: Spark > Issue Type: Bug > Components: Java API, SQL >Affects Versions: 1.5.2 >Reporter: Sergey Podolsky > > Hello, > Partition does not work in Java interface of the DataFrame: > {code} > SQLContext sqlContext = new SQLContext(sc); > Map options = new HashMap<>(); > options.put("driver", ORACLE_DRIVER); > options.put("url", ORACLE_CONNECTION_URL); > options.put("dbtable", > "(SELECT * FROM JOBS WHERE ROWNUM < 1) tt"); > options.put("lowerBound", "2704225000"); > options.put("upperBound", "2704226000"); > options.put("partitionColumn", "ID"); > options.put("numPartitions", "10"); > DataFrame jdbcDF = sqlContext.load("jdbc", options); > List jobsRows = jdbcDF.collectAsList(); > System.out.println(jobsRows.size()); > {code} > gives while expected 1000. Is it because of big decimal of boundaries or > partitioins does not work at all in Java? > Thanks. > Sergey -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12653) Re-enable test "SPARK-8489: MissingRequirementError during reflection"
Reynold Xin created SPARK-12653: --- Summary: Re-enable test "SPARK-8489: MissingRequirementError during reflection" Key: SPARK-12653 URL: https://issues.apache.org/jira/browse/SPARK-12653 Project: Spark Issue Type: Bug Components: SQL Reporter: Reynold Xin This test case was disabled in https://github.com/apache/spark/pull/10569#discussion-diff-48813840 I think we need to rebuild the jar because it was compiled against an old version of Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12615) Remove some deprecated APIs in RDD/SparkContext
[ https://issues.apache.org/jira/browse/SPARK-12615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-12615. Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10569 [https://github.com/apache/spark/pull/10569] > Remove some deprecated APIs in RDD/SparkContext > --- > > Key: SPARK-12615 > URL: https://issues.apache.org/jira/browse/SPARK-12615 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12577) better support of parentheses in partition by and order by clause of window function's over clause
[ https://issues.apache.org/jira/browse/SPARK-12577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083590#comment-15083590 ] Thomas Sebastian commented on SPARK-12577: -- Hi Reynold, Would you share some thoughts on how did you replicate this issue? - using sqlContext or API? - which version of spark.? - A bit more failure message details( what sort of exception).? Also, I see some paranthesis close mismatch in the PASS conditions mentioned. > better support of parentheses in partition by and order by clause of window > function's over clause > -- > > Key: SPARK-12577 > URL: https://issues.apache.org/jira/browse/SPARK-12577 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Right now, Hive's parser support > {code} > -- PASS > SELECT SUM(1) OVER (PARTITION BY a + 1 - b * c / d FROM src; > SELECT SUM(1) OVER (PARTITION BY (a + 1 - b * c / d) FROM src; > {code} > But, the following one is not accepted > {code} > -- FAIL > SELECT SUM(1) OVER (PARTITION BY (a) + 1 - b * c / d) FROM src; > {code} > We should fix it in our own parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12317) Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf
[ https://issues.apache.org/jira/browse/SPARK-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] kevin yu updated SPARK-12317: - Summary: Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf (was: Support configurate value with unit(e.g. kb/mb/gb) in SQL) > Support configurate value for AUTO_BROADCASTJOIN_THRESHOLD and > SHUFFLE_TARGET_POSTSHUFFLE_INPUT_SIZE with unit(e.g. kb/mb/gb) in SQLConf > > > Key: SPARK-12317 > URL: https://issues.apache.org/jira/browse/SPARK-12317 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yadong Qi >Priority: Minor > > e.g. `spark.sql.autoBroadcastJoinThreshold` should be configurated as `10MB` > instead of `10485760`, because `10MB` is more easier than `10485760`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083604#comment-15083604 ] Joseph K. Bradley commented on SPARK-4036: -- [~hujiayin] Thanks very much for your work on this, but I think we need to discuss this more before putting it into MLlib. The primary reasons are: * We have limited review bandwidth, and we need to focus on non-feature items currently (API improvements and completeness, bugs, etc.). * For a big new feature like this, we would need to do a proper design document and discussion before a PR. CRFs in particular are a very broad field, so it would be important to discuss scope and generality (linear vs general CRFs, applications such as NLP, vision, etc., or even a more general graphical model framework). In the meantime, I'd recommend you create a Spark package based on your work. That will let users take advantage of it, and you can encourage them to post feedback on the package site or here to continue the discussion. I'd like to close this JIRA for now, but I'll continue to watch the discussion on it. > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf, dig-hair-eye-train.model, > features.hair-eye, sample-input, sample-output > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4036) Add Conditional Random Fields (CRF) algorithm to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-4036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-4036. Resolution: Later > Add Conditional Random Fields (CRF) algorithm to Spark MLlib > > > Key: SPARK-4036 > URL: https://issues.apache.org/jira/browse/SPARK-4036 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Guoqiang Li >Assignee: Kai Sasaki > Attachments: CRF_design.1.pdf, dig-hair-eye-train.model, > features.hair-eye, sample-input, sample-output > > > Conditional random fields (CRFs) are a class of statistical modelling method > often applied in pattern recognition and machine learning, where they are > used for structured prediction. > The paper: > http://www.seas.upenn.edu/~strctlrn/bib/PDF/crf.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12098) Cross validator with multi-arm bandit search
[ https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083610#comment-15083610 ] Joseph K. Bradley commented on SPARK-12098: --- [~yinxusen] Thanks for your work on this, but I think we need to delay this feature. It's something we'll probably want to add in the future, but we just don't have the bandwidth right now for it. Could you publish your work as a Spark package for the time being? It would be great if you could get some feedback about the package from users, so that we can get more info about how much it improves on CrossValidator. Thanks for your understanding. > Cross validator with multi-arm bandit search > > > Key: SPARK-12098 > URL: https://issues.apache.org/jira/browse/SPARK-12098 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xusen Yin > > The classic cross-validation requires all inner classifiers iterate to a > fixed number of iterations, or until convergence states. It is costly > especially in the massive data scenario. According to the paper > Non-stochastic Best Arm Identification and Hyperparameter Optimization > (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce > the amount of total iterations of cross-validation with multi-armed bandit > search. > The multi-armed bandit search for cross-validation (bandit search for short) > requires warm-start of ml algorithms, and fine-grained control of the inner > behavior of the corss validator. > Since there are bunch of algorithms of bandit search to find the best > parameter set, we intent to provide only a few of them in the beginning to > reduce the test/perf-test work and make it more stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12098) Cross validator with multi-arm bandit search
[ https://issues.apache.org/jira/browse/SPARK-12098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-12098. - Resolution: Later > Cross validator with multi-arm bandit search > > > Key: SPARK-12098 > URL: https://issues.apache.org/jira/browse/SPARK-12098 > Project: Spark > Issue Type: New Feature > Components: ML, MLlib >Reporter: Xusen Yin > > The classic cross-validation requires all inner classifiers iterate to a > fixed number of iterations, or until convergence states. It is costly > especially in the massive data scenario. According to the paper > Non-stochastic Best Arm Identification and Hyperparameter Optimization > (http://arxiv.org/pdf/1502.07943v1.pdf), we can see a promising way to reduce > the amount of total iterations of cross-validation with multi-armed bandit > search. > The multi-armed bandit search for cross-validation (bandit search for short) > requires warm-start of ml algorithms, and fine-grained control of the inner > behavior of the corss validator. > Since there are bunch of algorithms of bandit search to find the best > parameter set, we intent to provide only a few of them in the beginning to > reduce the test/perf-test work and make it more stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3872) Rewrite the test for ActorInputStream.
[ https://issues.apache.org/jira/browse/SPARK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083632#comment-15083632 ] Josh Rosen commented on SPARK-3872: --- Is this now "Won't Fix" for 2.0? > Rewrite the test for ActorInputStream. > --- > > Key: SPARK-3872 > URL: https://issues.apache.org/jira/browse/SPARK-3872 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Prashant Sharma >Assignee: Prashant Sharma > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083631#comment-15083631 ] Joseph K. Bradley commented on SPARK-2344: -- Hi everyone, thanks a lot for your work and discussion about this. However, I think we'll need to postpone this feature because of limited review bandwidth and a need to focus on other items such as language API completeness, etc. Would you be able to post your implementations as Spark packages? For a less common algorithm such as this, it will also be important to collect feedback about how much it improves upon existing MLlib algorithms, so if you get feedback or results from users about your package, please post here. I'll close this JIRA for now but will follow it. Thanks for your understanding. > Add Fuzzy C-Means algorithm to MLlib > > > Key: SPARK-2344 > URL: https://issues.apache.org/jira/browse/SPARK-2344 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Alex >Priority: Minor > Labels: clustering > Original Estimate: 1m > Remaining Estimate: 1m > > I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. > FCM is very similar to K - Means which is already implemented, and they > differ only in the degree of relationship each point has with each cluster: > (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. > As part of the implementation I would like: > - create a base class for K- Means and FCM > - implement the relationship for each algorithm differently (in its class) > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-2344) Add Fuzzy C-Means algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-2344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-2344. Resolution: Later > Add Fuzzy C-Means algorithm to MLlib > > > Key: SPARK-2344 > URL: https://issues.apache.org/jira/browse/SPARK-2344 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Alex >Priority: Minor > Labels: clustering > Original Estimate: 1m > Remaining Estimate: 1m > > I would like to add a FCM (Fuzzy C-Means) algorithm to MLlib. > FCM is very similar to K - Means which is already implemented, and they > differ only in the degree of relationship each point has with each cluster: > (in FCM the relationship is in a range of [0..1] whether in K - Means its 0/1. > As part of the implementation I would like: > - create a base class for K- Means and FCM > - implement the relationship for each algorithm differently (in its class) > I'd like this to be assigned to me. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
[ https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-12654: - Assignee: Thomas Graves > sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop > - > > Key: SPARK-12654 > URL: https://issues.apache.org/jira/browse/SPARK-12654 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Thomas Graves >Assignee: Thomas Graves > > On a secure hadoop cluster using pyspark or spark-shell in yarn client mode > with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute. > Then try to use: > val files = sc.wholeTextFiles("dir") > files.collect() > and it fails with: > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation > Token can be issued only with kerberos or web authentication > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090) > > at org.apache.hadoop.ipc.Client.call(Client.java:1451) > at org.apache.hadoop.ipc.Client.call(Client.java:1382) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) > at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source) > at > org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434) > at > org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529) > at > org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507) > at > org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) > at > org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242) > at > org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55) > at > org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304) > at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
Thomas Graves created SPARK-12654: - Summary: sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop Key: SPARK-12654 URL: https://issues.apache.org/jira/browse/SPARK-12654 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.0 Reporter: Thomas Graves On a secure hadoop cluster using pyspark or spark-shell in yarn client mode with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute. Then try to use: val files = sc.wholeTextFiles("dir") files.collect() and it fails with: py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090) at org.apache.hadoop.ipc.Client.call(Client.java:1451) at org.apache.hadoop.ipc.Client.call(Client.java:1382) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029) at org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434) at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529) at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507) at org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100) at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242) at org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55) at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11798) Datanucleus jars is missing under lib_managed/jars
[ https://issues.apache.org/jira/browse/SPARK-11798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083640#comment-15083640 ] Josh Rosen commented on SPARK-11798: Datanucleus is only added as a dependency when the Hive build profile is enabled. Are you sure that you enabled that flag? > Datanucleus jars is missing under lib_managed/jars > -- > > Key: SPARK-11798 > URL: https://issues.apache.org/jira/browse/SPARK-11798 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Reporter: Jeff Zhang > > I notice the comments in https://github.com/apache/spark/pull/9575 said that > Datanucleus related jars will still be copied to lib_managed/jars. But I > don't see any jars under lib_managed/jars. The weird thing is that I see the > jars on another machine, but could not see jars on my laptop even after I > delete the whole spark project and start from scratch. Does it related with > environments ? I try to add the following code in SparkBuild.scala to track > the issue, it shows that the jars is empty. > {code} > deployDatanucleusJars := { > val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data) > .filter(_.getPath.contains("org.datanucleus")) > // this is what I added > println("*") > println("fullClasspath:"+fullClasspath) > println("assembly:"+assembly) > println("jars:"+jars.map(_.getAbsolutePath()).mkString(",")) > // > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12577) better support of parentheses in partition by and order by clause of window function's over clause
[ https://issues.apache.org/jira/browse/SPARK-12577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083590#comment-15083590 ] Thomas Sebastian edited comment on SPARK-12577 at 1/5/16 7:41 PM: -- Hi Reynold, Would you share some thoughts on how did you replicate this issue? - which version of spark.? - A bit more failure message details( what sort of exception).? Do you mean to say, when the sqlContext based queries(spark-shell) are fired as in the above FAIL conditions, it does not go through, where as it is accepted via HiveQL ? Also, I see some paranthesis close mismatch in the PASS conditions mentioned. was (Author: thomastechs): Hi Reynold, Would you share some thoughts on how did you replicate this issue? - using sqlContext or API? - which version of spark.? - A bit more failure message details( what sort of exception).? Also, I see some paranthesis close mismatch in the PASS conditions mentioned. > better support of parentheses in partition by and order by clause of window > function's over clause > -- > > Key: SPARK-12577 > URL: https://issues.apache.org/jira/browse/SPARK-12577 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > > Right now, Hive's parser support > {code} > -- PASS > SELECT SUM(1) OVER (PARTITION BY a + 1 - b * c / d FROM src; > SELECT SUM(1) OVER (PARTITION BY (a + 1 - b * c / d) FROM src; > {code} > But, the following one is not accepted > {code} > -- FAIL > SELECT SUM(1) OVER (PARTITION BY (a) + 1 - b * c / d) FROM src; > {code} > We should fix it in our own parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8108) Build Hive module by default (i.e. remove -Phive profile)
[ https://issues.apache.org/jira/browse/SPARK-8108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083645#comment-15083645 ] Josh Rosen commented on SPARK-8108: --- +1 on this change; it'd let us simplify certain build scripts. Would be great if someone could investigate this. Note that we might still want to have a dummy no-op {{-Phive}} profile for compatibility with third-party packaging scripts, but maybe that's not a huge deal. > Build Hive module by default (i.e. remove -Phive profile) > - > > Key: SPARK-8108 > URL: https://issues.apache.org/jira/browse/SPARK-8108 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Reporter: Reynold Xin > > I think this is blocked by a jline conflict between Scala 2.11 and Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org